Python efficient way to check if a very large string contains a substring

Question

Python efficient way to check if a very large string contains a substring

Python is not the best language, so I am not so good at the most effective solutions to some of my problems. I have a very large line (from a 30 MB file), and I need to check if this file contains a smaller substring (this line is only a few tens of characters). The way I do it now is:

if small_string in large_string: # logic here

But this seems very inefficient because it checks every possible sequence of characters in the file. I know that there will only be an exact match on a new line, so it would be better to read the file as a list and repeat this list so that it matches?

EDIT: To clear up some confusion regarding the "match for newline only", here is an example:

 small_string = "This is a line" big_string = "This is a line\nThis is another line\nThis is yet another"

If I'm not mistaken, the in keyword will check all sequences, not just every line.

+4

performance python

Jon martin Aug 24 '11 at 11:57

source share

8 answers

Is it really very slow? You are talking about a line of 30 MB; let him try it with an even larger line:

 In [12]: string="agu82934u"*50*1024*1024+"string to be found" In [13]: len(string) Out[13]: 471859218 In [14]: %timeit "string to be found" in string 1 loops, best of 3: 335 ms per loop In [15]: %timeit "string not to be found" in string 1 loops, best of 3: 200 ms per loop

I would not say that 335 ms is looking for a substring in a 450 MB line a lot of time.

+9

Michał Bentkowski Aug 24 '11 at 12:35

source share

How slow is too slow? I just ran a a in b test for a unique line at the end of a line of 170 MB. It ended before my finger left the Enter key.

+8

Marcelo cantos Aug 24 '11 at 12:09

source share

I do not understand how to make it more optimal compared, to be honest. But you can use less memory and lose less time with I / O if you read the file line by line:

 has_small_string = False for line in large_file: if small_string in line: has_small_string = True break if has_small_string: # ... Your stuff here ...

This is not a revolutionary improvement and may be even less useful if you really need a large line in memory, but it can be useful, so I post here :)

+4

brandizzi Aug 24 '11 at 12:07

source share

If you just want to check if this substring exists,

 for line in open("file"): if substring in line: print "exists" sys.exit() # or break to do some other stuff

+2

ghostdog74 Aug 24 '11 at 12:09

source share

Do you mean that only the full string will match? (your EDIT: matching with a newline example seems to be)

Then I guess that

 for line in open('file').readlines(): if line==small_string: return True return False

IE, using ==, is faster than "in" - possibly. I would not be surprised if the base implementation catches the case when the search string and search string are the same length and just try == itself.

woudl will be better.

+1

Matt warren Aug 24 '11 at 13:46

source share

 small_string = "This is a line" big_string = "This is a line This is another line\nThis is yet another" test= big_string.split("This is a line" ,1) if len(test)==2: print "it`s there" elif len(test)!=2: print "it`s not"

0

Mahmoud youssef Nov 05 '15 at 12:49

source share

I would rely on a quick implementation by someone else:

 import subprocess from subprocess import STDOUT import os ... with open(os.devnull, 'w') as devnull: if subprocess.call('grep %s "%s"' % (smallstring, file), shell=True, stdout=devnull, stderr=STDOUT) == 0: pass #do stuff

Will not work with windows.

edit: I'm worried that grep returns 0 if it finds something or not. But now I don’t have a shell, so I can’t test it.

-1

rplnt Aug 24 '11 at 12:12

source share

Pedro loureiro · Accepted Answer · 2011-08-24T12:02:11+0000

You can use one of these algorithms:

Rabin-Karp String Search Algorithm
Knuth-Morris-Pratt algorithm (aka KMP) see implementation here

Although I find KMP more efficient, it is harder to implement. The first link includes some pseudo-code, which should make it very simple to implement in python.

you can search for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm

Python efficient way to check if a very large string contains a substring

More articles: