Find the nth occurrence of a substring in a string

It seems like this should be pretty trivial, but I'm new to Python and want to do it in the most pythonic way.

I want to find the nth occurrence of a substring in a string.

There must be something equivalent to what I want to do, this

mystring.find("substring", 2nd)

How can you achieve this in Python?

+101
python string substring
Dec 10 '09 at 20:58
source share
19 answers

I think an iterative approach would be the usual way.

Here's a line break alternative that can often be useful for search related processes:

 def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle) 

And here is quick (and somewhat dirty, in that you need to choose some kind of chaff that cannot match the needle):

 'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar') 
+56
Dec 10 '09 at 21:26
source share

Here's a more Pythonic version of a simple iterative solution:

 def find_nth(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+len(needle)) n -= 1 return start 

Example:

 >>> find_nth("foofoofoofoo", "foofoo", 2) 6 

If you want to find the nth overlapping appearance of needle , you can increase it by 1 instead of len(needle) , for example:

 def find_nth_overlapping(haystack, needle, n): start = haystack.find(needle) while start >= 0 and n > 1: start = haystack.find(needle, start+1) n -= 1 return start 

Example:

 >>> find_nth_overlapping("foofoofoofoo", "foofoo", 2) 3 

This is easier to read than the Mark version, and does not require additional memory for the splitting version or import of the regex module. It also adheres to several rules in Zen of python , unlike the various re approaches:

  • Simple is better than complex.
  • Flat is better than nested.
  • Readability indicators.
+61
Dec 10 '09 at 21:45
source share

This will find the second occurrence of the substring in the string.

 def find_2nd(string, substring): return string.find(substring, string.find(substring) + 1) 

Change: I didnโ€™t really think about performance, but a quick recursion can help find the nth case:

 def find_nth(string, substring, n): if (n == 1): return string.find(substring) else: return string.find(substring, find_nth(string, substring, n - 1) + 1) 
+31
Oct 26 '12 at 20:59
source share

Understanding that regex isn't always the best solution, I would probably use it here:

 >>> import re >>> s = "ababdfegtduab" >>> [m.start() for m in re.finditer(r"ab",s)] [0, 2, 11] >>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 11 
+19
Dec 10 '09 at 21:36
source share

I offer some benchmarking results comparing the best-known approaches presented so far, namely @bobince findnth() (based on str.split() ) compared to @tgamblin or @Mark Byers' find_nth() (based on str.find() ). I will also compare with the C extension ( _find_nth.so ) to see how fast we can go. Here find_nth.py :

 def findnth(haystack, needle, n): parts= haystack.split(needle, n+1) if len(parts)<=n+1: return -1 return len(haystack)-len(parts[-1])-len(needle) def find_nth(s, x, n=0, overlap=False): l = 1 if overlap else len(x) i = -l for c in xrange(n + 1): i = s.find(x, i + l) if i < 0: break return i 

Of course, performance is important if the line is large, so suppose we want to find 1000001st a new line ('\ n') in a 1.3 GB file called "bigfile". To save memory, we would like to work with the object representation of the mmap.mmap file:

 In [1]: import _find_nth, find_nth, mmap In [2]: f = open('bigfile', 'r') In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 

There is already the first problem with findnth() , since mmap.mmap objects mmap.mmap not support split() . Therefore, we really need to copy the entire file into memory:

 In [4]: %time s = mm[:] CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s Wall time: 17.7 s 

Oh! Fortunately, s still fits in the 4GB of memory on my Macbook Air, so let's check findnth() :

 In [5]: %timeit find_nth.findnth(s, '\n', 1000000) 1 loops, best of 3: 29.9 s per loop 

Obviously terrible execution. Let's see how the str.find() approach works:

 In [6]: %timeit find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 774 ms per loop 

Much better! Clearly, the problem of findnth() is that it is forced to copy the line during split() , which has already copied 1.3 GB of data around for the second time after s = mm[:] . Here's the second advantage of find_nth() : we can use it directly on mm , so zero copies of the file are required:

 In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 1.21 s per loop 

Looks like on mm vs. s there is a slight decrease in performance, but this shows that find_nth() can get a response of 1.2 s compared to findnth only 47 s.

I did not find any cases where the str.find() based str.find() was significantly worse than the str.split() based str.split() , so at this point I would say that @tgamblin or @Mark Byers answer should be adopted instead of @ bobince's .

In my testing, the find_nth() version above was the fastest clean Python solution I could come up with (very similar to the @Mark Byers version). Let's see how much better we can use the C extension module. Here is _find_nthmodule.c :

 #include <Python.h> #include <string.h> off_t _find_nth(const char *buf, size_t l, char c, int n) { off_t i; for (i = 0; i < l; ++i) { if (buf[i] == c && n-- == 0) { return i; } } return -1; } off_t _find_nth2(const char *buf, size_t l, char c, int n) { const char *b = buf - 1; do { b = memchr(b + 1, c, l); if (!b) return -1; } while (n--); return b - buf; } /* mmap_object is private in mmapmodule.c - replicate beginning here */ typedef struct { PyObject_HEAD char *data; size_t size; } mmap_object; typedef struct { const char *s; size_t l; char c; int n; } params; int parse_args(PyObject *args, params *P) { PyObject *obj; const char *x; if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) { return 1; } PyTypeObject *type = Py_TYPE(obj); if (type == &PyString_Type) { P->s = PyString_AS_STRING(obj); P->l = PyString_GET_SIZE(obj); } else if (!strcmp(type->tp_name, "mmap.mmap")) { mmap_object *m_obj = (mmap_object*) obj; P->s = m_obj->data; P->l = m_obj->size; } else { PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0"); return 1; } P->c = x[0]; return 0; } static PyObject* py_find_nth(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyObject* py_find_nth2(PyObject *self, PyObject *args) { params P; if (!parse_args(args, &P)) { return Py_BuildValue("i", _find_nth2(Ps, Pl, Pc, Pn)); } else { return NULL; } } static PyMethodDef methods[] = { {"find_nth", py_find_nth, METH_VARARGS, ""}, {"find_nth2", py_find_nth2, METH_VARARGS, ""}, {0} }; PyMODINIT_FUNC init_find_nth(void) { Py_InitModule("_find_nth", methods); } 

Here is the setup.py :

 from distutils.core import setup, Extension module = Extension('_find_nth', sources=['_find_nthmodule.c']) setup(ext_modules=[module]) 

Install as usual with python setup.py install . C code has an advantage here, as it is limited to finding single characters, but let's see how fast it is:

 In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000) 1 loops, best of 3: 218 ms per loop In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000) 1 loops, best of 3: 216 ms per loop In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000) 1 loops, best of 3: 307 ms per loop In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000) 1 loops, best of 3: 304 ms per loop 

Clear a little faster. Interestingly, there is no C level difference between in-memory and mmapped operations. It is also interesting that _find_nth2() , based on the library function string.h memchr() , loses a straightforward implementation in _find_nth() : additional โ€œoptimizationsโ€ in memchr() to memchr() ...

In conclusion, the implementation in findnth() (based on str.split() ) is actually a bad idea, because (a) it works terribly for large strings due to required copying and (b) it does not work with mmap.mmap objects at all mmap.mmap . The implementation in find_nth() (based on str.find() ) should be preferred in all circumstances (and therefore be the accepted answer to this question).

There are still quite a few opportunities for improvement, since the C extension ran almost 4 times faster than pure Python code, indicating that there might be a case for a dedicated Python library.

+17
May 05 '14 at 18:16
source share

I would probably do something similar using the find function, which takes an index parameter:

 def find_nth(s, x, n): i = -1 for _ in range(n): i = s.find(x, i + len(x)) if i == -1: break return i print find_nth('bananabanana', 'an', 3) 

This is not particularly Pythonic, I think, but it is simple. You could do this using recursion:

 def find_nth(s, x, n, i = 0): i = s.find(x, i) if n == 1 or i == -1: return i else: return find_nth(s, x, n - 1, i + len(x)) print find_nth('bananabanana', 'an', 3) 

This is a functional way to solve it, but I do not know if it makes it more Pythonic.

+6
Dec 10 '09 at 21:14
source share

The easiest way?

 text = "This is a test from a test ok" firstTest = text.find('test') print text.find('test', firstTest + 1) 
+5
Sep 02 '15 at 15:32
source share

Here's another version of re + itertools that should work when looking for either str or RegexpObject . I will freely admit that this is probably over-engineered, but for some reason it entertained me.

 import itertools import re def find_nth(haystack, needle, n = 1): """ Find the starting index of the nth occurrence of ``needle`` in \ ``haystack``. If ``needle`` is a ``str``, this will perform an exact substring match; if it is a ``RegexpObject``, this will perform a regex search. If ``needle`` doesn't appear in ``haystack``, return ``-1``. If ``needle`` doesn't appear in ``haystack`` ``n`` times, return ``-1``. Arguments --------- * ``needle`` the substring (or a ``RegexpObject``) to find * ``haystack`` is a ``str`` * an ``int`` indicating which occurrence to find; defaults to ``1`` >>> find_nth("foo", "o", 1) 1 >>> find_nth("foo", "o", 2) 2 >>> find_nth("foo", "o", 3) -1 >>> find_nth("foo", "b") -1 >>> import re >>> either_o = re.compile("[oO]") >>> find_nth("foo", either_o, 1) 1 >>> find_nth("FOO", either_o, 1) 1 """ if (hasattr(needle, 'finditer')): matches = needle.finditer(haystack) else: matches = re.finditer(re.escape(needle), haystack) start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1)) try: return next(start_here)[1].start() except StopIteration: return -1 
+2
Dec 11 '09 at 15:06
source share

Here is another approach using re.finditer.
The difference is that it only looks into the haystack as necessary

 from re import finditer from itertools import dropwhile needle='an' haystack='bananabanana' n=2 next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start() 
+1
Dec 10 '09 at 21:45
source share
 >>> s="abcdefabcdefababcdef" >>> j=0 >>> for n,i in enumerate(s): ... if s[n:n+2] =="ab": ... print n,i ... j=j+1 ... if j==2: print "2nd occurence at index position: ",n ... 0 a 6 a 2nd occurence at index position: 6 12 a 14 a 
+1
Dec 11 '09 at 0:22
source share

This will give you an array of starting indexes to match yourstring :

 import re indices = [s.start() for s in re.finditer(':', yourstring)] 

Then your nth record will be as follows:

 n = 2 nth_entry = indices[n-1] 

Of course, you have to be careful with the bounds of the indices. You can get the number of instances of yourstring as follows:

 num_instances = len(indices) 
+1
Jan 13 '17 at 2:19 on
source share

Customize the answer to modle13 , but without re module dependency.

 def iter_find(haystack, needle): return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)] 

I would like this to be an inline string method.

 >>> iter_find("http://stackoverflow.com/questions/1883980/", '/') [5, 6, 24, 34, 42] 
+1
Apr 09 '17 at 0:06
source share
 # return -1 if nth substr (0-indexed) dne, else return index def find_nth(s, substr, n): i = 0 while n >= 0: n -= 1 i = s.find(substr, i + 1) return i 
+1
Jan 17 '18 at 21:36
source share

Replacing one insert is great, but only works because XX and bar have the same lentgh

A good and general def would be:

 def findN(s,sub,N,replaceString="XXX"): return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1) 
0
Apr 17 '13 at 22:53
source share

Providing another โ€œcomplexโ€ solution using split and join .

In your example, we can use

 len("substring".join([s for s in ori.split("substring")[:2]])) 
0
Mar 31 '15 at 5:40
source share

This is the answer you really want:

 def Find(String,ToFind,Occurence = 1): index = 0 count = 0 while index <= len(String): try: if String[index:index + len(ToFind)] == ToFind: count += 1 if count == Occurence: return index break index += 1 except IndexError: return False break return False 
0
Jul 19 '16 at 18:53
source share

A solution without using loops and recursion.

Use the required pattern in the compilation method and enter the desired occurrence in the variable 'n', and the last statement will display the starting index of the nth occurrence of the pattern on this line. Here is the result of finditer, i.e. iterator is converted to a list and gets direct access to the nth index.

 import re n=2 sampleString="this is history" pattern=re.compile("is") matches=pattern.finditer(sampleString) print(list(matches)[n].span()[0]) 
0
Jun 20 '19 at 11:36
source share

Here is my solution for finding the n occurrence of b in string a :

 from functools import reduce def findNth(a, b, n): return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1) 

It is pure Python and iterative. If 0 or n too large, -1 is returned. It is a single line and can be used directly. Here is an example:

 >>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1) 7 
0
Jul 15 '19 at 21:10
source share

What about:

 c = os.getcwd().split('\\') print '\\'.join(c[0:-2]) 
-one
Jun 13 '16 at 16:01
source share