I m offering some benchmarking results comparing the most prominent approaches presented so far, namely @bobince s findnth()
(based on str.split()
) vs. @tgamblin s or @Mark Byers find_nth()
(based on str.find()
). I will also compare with a C extension (_find_nth.so
) to see how fast we can go. Here is find_nth.py
:
def findnth(haystack, needle, n):
parts= haystack.split(needle, n+1)
if len(parts)<=n+1:
return -1
return len(haystack)-len(parts[-1])-len(needle)
def find_nth(s, x, n=0, overlap=False):
l = 1 if overlap else len(x)
i = -l
for c in xrange(n + 1):
i = s.find(x, i + l)
if i < 0:
break
return i
Of course, performance matters most if the string is large, so suppose we want to find the 1000001st newline (
) in a 1.3 GB file called bigfile . To save memory, we would like to work on an mmap.mmap
object representation of the file:
In [1]: import _find_nth, find_nth, mmap
In [2]: f = open( bigfile , r )
In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
There is already the first problem with findnth()
, since mmap.mmap
objects don t support split()
. So we actually have to copy the whole file into memory:
In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s
Ouch! Fortunately s
still fits in the 4 GB of memory of my Macbook Air, so let s benchmark findnth()
:
In [5]: %timeit find_nth.findnth(s,
, 1000000)
1 loops, best of 3: 29.9 s per loop
Clearly a terrible performance. Let s see how the approach based on str.find()
does:
In [6]: %timeit find_nth.find_nth(s,
, 1000000)
1 loops, best of 3: 774 ms per loop
Much better! Clearly, findnth()
s problem is that it is forced to copy the string during split()
, which is already the second time we copied the 1.3 GB of data around after s = mm[:]
. Here comes in the second advantage of find_nth()
: We can use it on mm
directly, such that zero copies of the file are required:
In [7]: %timeit find_nth.find_nth(mm,
, 1000000)
1 loops, best of 3: 1.21 s per loop
There appears to be a small performance penalty operating on mm
vs. s
, but this illustrates that find_nth()
can get us an answer in 1.2 s compared to findnth
s total of 47 s.
I found no cases where the str.find()
based approach was significantly worse than the str.split()
based approach, so at this point, I would argue that @tgamblin s or @Mark Byers answer should be accepted instead of @bobince s.
In my testing, the version of find_nth()
above was the fastest pure Python solution I could come up with (very similar to @Mark Byers version). Let s see how much better we can do with a C extension module. Here is _find_nthmodule.c
:
#include <Python.h>
#include <string.h>
off_t _find_nth(const char *buf, size_t l, char c, int n) {
off_t i;
for (i = 0; i < l; ++i) {
if (buf[i] == c && n-- == 0) {
return i;
}
}
return -1;
}
off_t _find_nth2(const char *buf, size_t l, char c, int n) {
const char *b = buf - 1;
do {
b = memchr(b + 1, c, l);
if (!b) return -1;
} while (n--);
return b - buf;
}
/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
PyObject_HEAD
char *data;
size_t size;
} mmap_object;
typedef struct {
const char *s;
size_t l;
char c;
int n;
} params;
int parse_args(PyObject *args, params *P) {
PyObject *obj;
const char *x;
if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
return 1;
}
PyTypeObject *type = Py_TYPE(obj);
if (type == &PyString_Type) {
P->s = PyString_AS_STRING(obj);
P->l = PyString_GET_SIZE(obj);
} else if (!strcmp(type->tp_name, "mmap.mmap")) {
mmap_object *m_obj = (mmap_object*) obj;
P->s = m_obj->data;
P->l = m_obj->size;
} else {
PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
return 1;
}
P->c = x[0];
return 0;
}
static PyObject* py_find_nth(PyObject *self, PyObject *args) {
params P;
if (!parse_args(args, &P)) {
return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
} else {
return NULL;
}
}
static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
params P;
if (!parse_args(args, &P)) {
return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
} else {
return NULL;
}
}
static PyMethodDef methods[] = {
{"find_nth", py_find_nth, METH_VARARGS, ""},
{"find_nth2", py_find_nth2, METH_VARARGS, ""},
{0}
};
PyMODINIT_FUNC init_find_nth(void) {
Py_InitModule("_find_nth", methods);
}
Here is the setup.py
file:
from distutils.core import setup, Extension
module = Extension( _find_nth , sources=[ _find_nthmodule.c ])
setup(ext_modules=[module])
Install as usual with python setup.py install
. The C code plays at an advantage here since it is limited to finding single characters, but let s see how fast this is:
In [8]: %timeit _find_nth.find_nth(mm,
, 1000000)
1 loops, best of 3: 218 ms per loop
In [9]: %timeit _find_nth.find_nth(s,
, 1000000)
1 loops, best of 3: 216 ms per loop
In [10]: %timeit _find_nth.find_nth2(mm,
, 1000000)
1 loops, best of 3: 307 ms per loop
In [11]: %timeit _find_nth.find_nth2(s,
, 1000000)
1 loops, best of 3: 304 ms per loop
Clearly quite a bit faster still. Interestingly, there is no difference on the C level between the in-memory and mmapped cases. It is also interesting to see that _find_nth2()
, which is based on string.h
s memchr()
library function, loses out against the straightforward implementation in _find_nth()
: The additional "optimizations" in memchr()
are apparently backfiring...
In conclusion, the implementation in findnth()
(based on str.split()
) is really a bad idea, since (a) it performs terribly for larger strings due to the required copying, and (b)
it doesn t work on mmap.mmap
objects at all. The implementation in find_nth()
(based on str.find()
) should be preferred in all circumstances (and therefore be the accepted answer to this question).
There is still quite a bit of room for improvement, since the C extension ran almost a factor of 4 faster than the pure Python code, indicating that there might be a case for a dedicated Python library function.