I offer some benchmarking results comparing the best-known approaches presented so far, namely @bobince findnth()
(based on str.split()
) compared to @tgamblin or @Mark Byers' find_nth()
(based on str.find()
). I will also compare with the C extension ( _find_nth.so
) to see how fast we can go. Here find_nth.py
:
def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle) def find_nth(s, x, n=0, overlap=False): l = 1 if overlap else len(x) i = -l for c in xrange(n + 1): i = s.find(x, i + l) if i < 0: break return i
Of course, performance is important if the line is large, so suppose we want to find 1000001st a new line ('\ n') in a 1.3 GB file called "bigfile". To save memory, we would like to work with the object representation of the mmap.mmap
file:
In [1]: import _find_nth, find_nth, mmap In [2]: f = open('bigfile', 'r') In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
There is already the first problem with findnth()
, since mmap.mmap
objects mmap.mmap
not support split()
. Therefore, we really need to copy the entire file into memory:
In [4]: %time s = mm[:] CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s Wall time: 17.7 s
Oh! Fortunately, s
still fits in the 4GB of memory on my Macbook Air, so let's check findnth()
:
In [5]: %timeit find_nth.findnth(s, '\n', 1000000) 1 loops, best of 3: 29.9 s per loop
Obviously terrible execution. Let's see how the str.find()
approach works:
In [6]: %timeit find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 774 ms per loop
Much better! Clearly, the problem of findnth()
is that it is forced to copy the line during split()
, which has already copied 1.3 GB of data around for the second time after s = mm[:]
. Here's the second advantage of find_nth()
: we can use it directly on mm
, so zero copies of the file are required:
In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 1.21 s per loop
Looks like on mm
vs. s
there is a slight decrease in performance, but this shows that find_nth()
can get a response of 1.2 s compared to findnth
only 47 s.
I did not find any cases where the str.find()
based str.find()
was significantly worse than the str.split()
based str.split()
, so at this point I would say that @tgamblin or @Mark Byers answer should be adopted instead of @ bobince's .
In my testing, the find_nth()
version above was the fastest clean Python solution I could come up with (very similar to the @Mark Byers version). Let's see how much better we can use the C extension module. Here is _find_nthmodule.c
:
#include <Python.h> #include <string.h> off_t _find_nth(const char *buf, size_t l, char c, int n) { off_t i; for (i = 0; i < l; ++i) { if (buf[i] == c && n-- == 0) { return i; } } return -1; } off_t _find_nth2(const char *buf, size_t l, char c, int n) { const char *b = buf - 1; do { b = memchr(b + 1, c, l); if (!b) return -1; } while (n--); return b - buf; } /* mmap_object is private in mmapmodule.c - replicate beginning here */ typedef struct { PyObject_HEAD char *data; size_t size; } mmap_object; typedef struct { const char *s; size_t l; char c; int n; } params; int parse_args(PyObject *args, params *P) { PyObject *obj; const char *x; if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) { return 1; } PyTypeObject *type = Py_TYPE(obj); if (type == &PyString_Type) { P->s = PyString_AS_STRING(obj); P->l = PyString_GET_SIZE(obj); } else if (!strcmp(type->tp_name, "mmap.mmap")) { mmap_object *m_obj = (mmap_object*) obj; P->s = m_obj->data; P->l = m_obj->size; } else { PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0"); return 1; } P->c = x[0]; return 0; } static PyObject* py_find_nth(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyObject* py_find_nth2(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth2(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyMethodDef methods[] = { {"find_nth", py_find_nth, METH_VARARGS, ""}, {"find_nth2", py_find_nth2, METH_VARARGS, ""}, {0} }; PyMODINIT_FUNC init_find_nth(void) { Py_InitModule("_find_nth", methods); }
Here is the setup.py
:
from distutils.core import setup, Extension module = Extension('_find_nth', sources=['_find_nthmodule.c']) setup(ext_modules=[module])
Install as usual with python setup.py install
. C code has an advantage here, as it is limited to finding single characters, but let's see how fast it is:
In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 218 ms per loop In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 216 ms per loop In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000) 1 loops, best of 3: 307 ms per loop In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000) 1 loops, best of 3: 304 ms per loop
Clear a little faster. Interestingly, there is no C level difference between in-memory and mmapped operations. It is also interesting that _find_nth2()
, based on the library function string.h
memchr()
, loses a straightforward implementation in _find_nth()
: additional โoptimizationsโ in memchr()
to memchr()
...
In conclusion, the implementation in findnth()
(based on str.split()
) is actually a bad idea, because (a) it works terribly for large strings due to required copying and (b) it does not work with mmap.mmap
objects at all mmap.mmap
. The implementation in find_nth()
(based on str.find()
) should be preferred in all circumstances (and therefore be the accepted answer to this question).
There are still quite a few opportunities for improvement, since the C extension ran almost 4 times faster than pure Python code, indicating that there might be a case for a dedicated Python library.