Python efficient way to check if a very large string contains a substring

Python is not the best language, so I am not so good at the most effective solutions to some of my problems. I have a very large line (from a 30 MB file), and I need to check if this file contains a smaller substring (this line is only a few tens of characters). The way I do it now is:

if small_string in large_string: # logic here 

But this seems very inefficient because it checks every possible sequence of characters in the file. I know that there will only be an exact match on a new line, so it would be better to read the file as a list and repeat this list so that it matches?

EDIT: To clear up some confusion regarding the "match for newline only", here is an example:

 small_string = "This is a line" big_string = "This is a line\nThis is another line\nThis is yet another" 

If I'm not mistaken, the in keyword will check all sequences, not just every line.

+4
source share
8 answers

You can use one of these algorithms:

Although I find KMP more efficient, it is harder to implement. The first link includes some pseudo-code, which should make it very simple to implement in python.

you can search for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm

+3
source

Is it really very slow? You are talking about a line of 30 MB; let him try it with an even larger line:

 In [12]: string="agu82934u"*50*1024*1024+"string to be found" In [13]: len(string) Out[13]: 471859218 In [14]: %timeit "string to be found" in string 1 loops, best of 3: 335 ms per loop In [15]: %timeit "string not to be found" in string 1 loops, best of 3: 200 ms per loop 

I would not say that 335 ms is looking for a substring in a 450 MB line a lot of time.

+9
source

How slow is too slow? I just ran a a in b test for a unique line at the end of a line of 170 MB. It ended before my finger left the Enter key.

+8
source

I do not understand how to make it more optimal compared, to be honest. But you can use less memory and lose less time with I / O if you read the file line by line:

 has_small_string = False for line in large_file: if small_string in line: has_small_string = True break if has_small_string: # ... Your stuff here ... 

This is not a revolutionary improvement and may be even less useful if you really need a large line in memory, but it can be useful, so I post here :)

+4
source

If you just want to check if this substring exists,

 for line in open("file"): if substring in line: print "exists" sys.exit() # or break to do some other stuff 
+2
source

Do you mean that only the full string will match? (your EDIT: matching with a newline example seems to be)

Then I guess that

 for line in open('file').readlines(): if line==small_string: return True return False 

IE, using ==, is faster than "in" - possibly. I would not be surprised if the base implementation catches the case when the search string and search string are the same length and just try == itself.

woudl will be better.

+1
source
 small_string = "This is a line" big_string = "This is a line This is another line\nThis is yet another" test= big_string.split("This is a line" ,1) if len(test)==2: print "it`s there" elif len(test)!=2: print "it`s not" 
0
source

I would rely on a quick implementation by someone else:

 import subprocess from subprocess import STDOUT import os ... with open(os.devnull, 'w') as devnull: if subprocess.call('grep %s "%s"' % (smallstring, file), shell=True, stdout=devnull, stderr=STDOUT) == 0: pass #do stuff 

Will not work with windows.

edit: I'm worried that grep returns 0 if it finds something or not. But now I don’t have a shell, so I can’t test it.

-1
source

All Articles