The fastest way to create a list of overlapping substrings from a string in Python

I am trying to create a list of all overlapping substrings of n-length in a given string.

For example, for n from 6 and the string "hereismystring" I would generate a list ["hereis", "ereism", "reismy", ..., "string"] . The trivial code that I am using now is as follows:

 n = 6 l = len(string) substrings = [string[i:(i + n)] for i in xrange(l - n + 1)] 

Simple enough. The problem is that I would like to speed this up (I have a lot of very long lines). Is there a faster way in Python? Would it even go down to Cython, given that Python string routines are anyway?

For reference, this method accepts about 100US on my machine (new Macbook Pro) for a string length of 500 and n out of 30.

Thanks for the help in advance!

+4
source share
2 answers

Taking a step back from the question of which Python coding method would be the fastest, I would approach the problem in different ways. Since all strings are the same length, and they are all taken from the same source string, why not just work with character ranges directly and not convert them to the correct strings? You would have avoided a lot of highlighting and copying, but you would need to tweak your code to know that each line has n characters.

In other words, just read the ranges from the source line directly when you want to work with a substring. You will work with the characters you want, as quickly as they can be pulled from the cache. You can express a โ€œsubstringโ€ as just an offset to the original string.

Sometimes, if you need ultra-fast performance, you need to leave familiar data structures. Just a thought.

+2
source

What about:

 >>> d = deque("hereismystring") >>> s = ''.join(d)[:6] >>> while not len(s) % 6: ... print s ... _ = d.popleft() ... s = ''.join(d)[:6] ... hereis ereism reismy eismys ismyst smystr mystri ystrin string >>> 

I believe deque is O (1) and lists are O (n)

+1
source

All Articles