Find the nth occurrence of a substring in a string

Question

Find the nth occurrence of a substring in a string

It seems like this should be pretty trivial, but I'm new to Python and want to do it in the most pythonic way.

I want to find the nth occurrence of a substring in a string.

There must be something equivalent to what I want to do, this

mystring.find("substring", 2nd)

How can you achieve this in Python?

+101

python string substring

prestomation Dec 10 '09 at 20:58

source share

19 answers

Here's a more Pythonic version of a simple iterative solution:

 def find_nth(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+len(needle)) n -= 1 return start

Example:

 >>> find_nth("foofoofoofoo", "foofoo", 2) 6

If you want to find the nth overlapping appearance of needle , you can increase it by 1 instead of len(needle) , for example:

 def find_nth_overlapping(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+1) n -= 1 return start

Example:

 >>> find_nth_overlapping("foofoofoofoo", "foofoo", 2) 3

This is easier to read than the Mark version, and does not require additional memory for the splitting version or import of the regex module. It also adheres to several rules in Zen of python , unlike the various re approaches:

Simple is better than complex.
Flat is better than nested.
Readability indicators.

+61

Todd Gamblin Dec 10 '09 at 21:45

source share

This will find the second occurrence of the substring in the string.

 def find_2nd(string, substring): return string.find(substring, string.find(substring) + 1)

Change: I didn’t really think about performance, but a quick recursion can help find the nth case:

 def find_nth(string, substring, n): if (n == 1): return string.find(substring) else: return string.find(substring, find_nth(string, substring, n - 1) + 1)

+31

Sriram Murali Oct 26 '12 at 20:59

source share

Understanding that regex isn't always the best solution, I would probably use it here:

 >>> import re >>> s = "ababdfegtduab" >>> [m.start() for m in re.finditer(r"ab",s)] [0, 2, 11] >>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 11

+19

Mark Peters Dec 10 '09 at 21:36

source share

I offer some benchmarking results comparing the best-known approaches presented so far, namely @bobince findnth() (based on str.split() ) compared to @tgamblin or @Mark Byers' find_nth() (based on str.find() ). I will also compare with the C extension ( _find_nth.so ) to see how fast we can go. Here find_nth.py :

 def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle) def find_nth(s, x, n=0, overlap=False): l = 1 if overlap else len(x) i = -l for c in xrange(n + 1): i = s.find(x, i + l) if i < 0: break return i

Of course, performance is important if the line is large, so suppose we want to find 1000001st a new line ('\ n') in a 1.3 GB file called "bigfile". To save memory, we would like to work with the object representation of the mmap.mmap file:

 In [1]: import _find_nth, find_nth, mmap In [2]: f = open('bigfile', 'r') In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

There is already the first problem with findnth() , since mmap.mmap objects mmap.mmap not support split() . Therefore, we really need to copy the entire file into memory:

 In [4]: %time s = mm[:] CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s Wall time: 17.7 s

Oh! Fortunately, s still fits in the 4GB of memory on my Macbook Air, so let's check findnth() :

 In [5]: %timeit find_nth.findnth(s, '\n', 1000000) 1 loops, best of 3: 29.9 s per loop

Obviously terrible execution. Let's see how the str.find() approach works:

 In [6]: %timeit find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 774 ms per loop

Much better! Clearly, the problem of findnth() is that it is forced to copy the line during split() , which has already copied 1.3 GB of data around for the second time after s = mm[:] . Here's the second advantage of find_nth() : we can use it directly on mm , so zero copies of the file are required:

 In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 1.21 s per loop

Looks like on mm vs. s there is a slight decrease in performance, but this shows that find_nth() can get a response of 1.2 s compared to findnth only 47 s.

I did not find any cases where the str.find() based str.find() was significantly worse than the str.split() based str.split() , so at this point I would say that @tgamblin or @Mark Byers answer should be adopted instead of @ bobince's .

In my testing, the find_nth() version above was the fastest clean Python solution I could come up with (very similar to the @Mark Byers version). Let's see how much better we can use the C extension module. Here is _find_nthmodule.c :

 #include <Python.h> #include <string.h> off_t _find_nth(const char *buf, size_t l, char c, int n) { off_t i; for (i = 0; i < l; ++i) { if (buf[i] == c && n-- == 0) { return i; } } return -1; } off_t _find_nth2(const char *buf, size_t l, char c, int n) { const char *b = buf - 1; do { b = memchr(b + 1, c, l); if (!b) return -1; } while (n--); return b - buf; } /* mmap_object is private in mmapmodule.c - replicate beginning here */ typedef struct { PyObject_HEAD char *data; size_t size; } mmap_object; typedef struct { const char *s; size_t l; char c; int n; } params; int parse_args(PyObject *args, params *P) { PyObject *obj; const char *x; if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) { return 1; } PyTypeObject *type = Py_TYPE(obj); if (type == &PyString_Type) { P->s = PyString_AS_STRING(obj); P->l = PyString_GET_SIZE(obj); } else if (!strcmp(type->tp_name, "mmap.mmap")) { mmap_object *m_obj = (mmap_object*) obj; P->s = m_obj->data; P->l = m_obj->size; } else { PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0"); return 1; } P->c = x[0]; return 0; } static PyObject* py_find_nth(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyObject* py_find_nth2(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth2(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyMethodDef methods[] = { {"find_nth", py_find_nth, METH_VARARGS, ""}, {"find_nth2", py_find_nth2, METH_VARARGS, ""}, {0} }; PyMODINIT_FUNC init_find_nth(void) { Py_InitModule("_find_nth", methods); }

Here is the setup.py :

 from distutils.core import setup, Extension module = Extension('_find_nth', sources=['_find_nthmodule.c']) setup(ext_modules=[module])

Install as usual with python setup.py install . C code has an advantage here, as it is limited to finding single characters, but let's see how fast it is:

 In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 218 ms per loop In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 216 ms per loop In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000) 1 loops, best of 3: 307 ms per loop In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000) 1 loops, best of 3: 304 ms per loop

Clear a little faster. Interestingly, there is no C level difference between in-memory and mmapped operations. It is also interesting that _find_nth2() , based on the library function string.h memchr() , loses a straightforward implementation in _find_nth() : additional “optimizations” in memchr() to memchr() ...

In conclusion, the implementation in findnth() (based on str.split() ) is actually a bad idea, because (a) it works terribly for large strings due to required copying and (b) it does not work with mmap.mmap objects at all mmap.mmap . The implementation in find_nth() (based on str.find() ) should be preferred in all circumstances (and therefore be the accepted answer to this question).

There are still quite a few opportunities for improvement, since the C extension ran almost 4 times faster than pure Python code, indicating that there might be a case for a dedicated Python library.

+17

Stefan May 05 '14 at 18:16

source share

I would probably do something similar using the find function, which takes an index parameter:

 def find_nth(s, x, n): i = -1 for _ in range(n): i = s.find(x, i + len(x)) if i == -1: break return i print find_nth('bananabanana', 'an', 3)

This is not particularly Pythonic, I think, but it is simple. You could do this using recursion:

 def find_nth(s, x, n, i = 0): i = s.find(x, i) if n == 1 or i == -1: return i else: return find_nth(s, x, n - 1, i + len(x)) print find_nth('bananabanana', 'an', 3)

This is a functional way to solve it, but I do not know if it makes it more Pythonic.

+6

Mark Byers Dec 10 '09 at 21:14

source share

The easiest way?

 text = "This is a test from a test ok" firstTest = text.find('test') print text.find('test', firstTest + 1)

+5

forbzie Sep 02 '15 at 15:32

source share

Here's another version of re + itertools that should work when looking for either str or RegexpObject . I will freely admit that this is probably over-engineered, but for some reason it entertained me.

 import itertools import re def find_nth(haystack, needle, n = 1): """ Find the starting index of the nth occurrence of ``needle`` in \ ``haystack``. If ``needle`` is a ``str``, this will perform an exact substring match; if it is a ``RegexpObject``, this will perform a regex search. If ``needle`` doesn't appear in ``haystack``, return ``-1``. If ``needle`` doesn't appear in ``haystack`` ``n`` times, return ``-1``. Arguments --------- * ``needle`` the substring (or a ``RegexpObject``) to find * ``haystack`` is a ``str`` * an ``int`` indicating which occurrence to find; defaults to ``1`` >>> find_nth("foo", "o", 1) 1 >>> find_nth("foo", "o", 2) 2 >>> find_nth("foo", "o", 3) -1 >>> find_nth("foo", "b") -1 >>> import re >>> either_o = re.compile("[oO]") >>> find_nth("foo", either_o, 1) 1 >>> find_nth("FOO", either_o, 1) 1 """ if (hasattr(needle, 'finditer')): matches = needle.finditer(haystack) else: matches = re.finditer(re.escape(needle), haystack) start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1)) try: return next(start_here)[1].start() except StopIteration: return -1

+2

Hank Gay Dec 11 '09 at 15:06

source share

Here is another approach using re.finditer.
The difference is that it only looks into the haystack as necessary

 from re import finditer from itertools import dropwhile needle='an' haystack='bananabanana' n=2 next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()

+1

John La Rooy Dec 10 '09 at 21:45

source share

 >>> s="abcdefabcdefababcdef" >>> j=0 >>> for n,i in enumerate(s): ... if s[n:n+2] =="ab": ... print n,i ... j=j+1 ... if j==2: print "2nd occurence at index position: ",n ... 0 a 6 a 2nd occurence at index position: 6 12 a 14 a

+1

ghostdog74 Dec 11 '09 at 0:22

source share

This will give you an array of starting indexes to match yourstring :

 import re indices = [s.start() for s in re.finditer(':', yourstring)]

Then your nth record will be as follows:

 n = 2 nth_entry = indices[n-1]

Of course, you have to be careful with the bounds of the indices. You can get the number of instances of yourstring as follows:

 num_instances = len(indices)

+1

modle13 Jan 13 '17 at 2:19 on

source share

Customize the answer to modle13 , but without re module dependency.

 def iter_find(haystack, needle): return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]

I would like this to be an inline string method.

 >>> iter_find("http://stackoverflow.com/questions/1883980/", '/') [5, 6, 24, 34, 42]

+1

Zv_oDD Apr 09 '17 at 0:06

source share

 # return -1 if nth substr (0-indexed) dne, else return index def find_nth(s, substr, n): i = 0 while n >= 0: n -= 1 i = s.find(substr, i + 1) return i

+1

Jason Jan 17 '18 at 21:36

source share

Replacing one insert is great, but only works because XX and bar have the same lentgh

A good and general def would be:

 def findN(s,sub,N,replaceString="XXX"): return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)

0

Charles Doutriaux Apr 17 '13 at 22:53

source share

Providing another “complex” solution using split and join .

In your example, we can use

 len("substring".join([s for s in ori.split("substring")[:2]]))

0

Ivor Zhou Mar 31 '15 at 5:40

source share

This is the answer you really want:

 def Find(String,ToFind,Occurence = 1): index = 0 count = 0 while index <= len(String): try: if String[index:index + len(ToFind)] == ToFind: count += 1 if count == Occurence: return index break index += 1 except IndexError: return False break return False

0

yarz-tech Jul 19 '16 at 18:53

source share

A solution without using loops and recursion.

Use the required pattern in the compilation method and enter the desired occurrence in the variable 'n', and the last statement will display the starting index of the nth occurrence of the pattern on this line. Here is the result of finditer, i.e. iterator is converted to a list and gets direct access to the nth index.

 import re n=2 sampleString="this is history" pattern=re.compile("is") matches=pattern.finditer(sampleString) print(list(matches)[n].span()[0])

0

Karthik Jun 20 '19 at 11:36

source share

Here is my solution for finding the n occurrence of b in string a :

 from functools import reduce def findNth(a, b, n): return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)

It is pure Python and iterative. If 0 or n too large, -1 is returned. It is a single line and can be used directly. Here is an example:

 >>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1) 7

0

黄锐铭 Jul 15 '19 at 21:10

source share

What about:

 c = os.getcwd().split('\\') print '\\'.join(c[0:-2])

-one

GetItDone Jun 13 '16 at 16:01

source share

bobince · Accepted Answer · 2009-12-10 21:26

I think an iterative approach would be the usual way.

Here's a line break alternative that can often be useful for search related processes:

 def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle)

And here is quick (and somewhat dirty, in that you need to choose some kind of chaff that cannot match the needle):

 'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')

Find the nth occurrence of a substring in a string

More articles: