Python speeds up this regex

Question

Python speeds up this regex

p = re.compile('>.*\n') p.sub('', text)

I want to delete all lines starting with '>'. I have a really huge file (3 GB), which is processed in pieces of 250 MB in size, so the variable "text" is a string of 250 MB in size. (I tried different sizes, but the performance was the same for the full file).

Now, can I somehow speed up this regular expression? I tried multi-line match, but it was much slower. Or are there even better ways?

(I already tried to split the line and then filter the line like this, but it was also slower (I also tried lambda instead of def del_line: (this may not be working code, it's just from memory):

 def del_line(x): return x[0] != '>' def func(): .... text = file.readlines(chunksize) text = filter(del_line, text) ...

EDIT: As suggested in the comments, I also tried taking turns:

 text = [] for line in file: if line[0] != '>': text.append(line) text = ''.join(text)

It is also slower, it needs ~ 12 seconds. My regular expression needs ~ 7 sec. (yes, it is fast, but it should also work on slower machines)

EDIT: Of course, I also tried str.startswith ('>'), it was slower ...

+6

python regex

Eulelie May 09 '14 at 21:38

source share

2 answers

Rafael almeida · Answer 1 · 2014-05-09T23:22:22+0000

If you have a chance, running grep as a subprocess is probably the most pragmatic choice.

If for some reason you cannot rely on grep, you can try out some of the “tricks” that grep does quickly. From the author himself, you can read about them here: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

At the end of the article, the author summarizes the main points. The one that stands out for me the most:

Also GNU grep AVOIDS BREAKING INPUT IN LINES. When searching for new lines, they would slow down on average several times, because find new lines that need to be looked at every byte!

The idea would be to load the entire file into memory and repeat it at the byte level instead of the linear level. Only when you find a match do you look for the boundaries of the line and delete it.

You say that you need to run this on other computers. If this is available to you and you are no longer doing this, try running it on PyPy instead of CPython (the default interpreter). This may (or may not) improve lead time by a significant factor, depending on the nature of the program.

Also, as noted in some comments, compare with the actual grep to get a baseline of how fast you can go, reasonably. Get it on Cygwin, if you're on Windows, it's easy enough.

John Smith Optional · Answer 2 · 2014-05-09T22:27:29+0000

Is it not faster?

 def cleanup(chunk): return '\n'.join(st for st in chunk.split('\n') if not(st and st[0] == '>'))

EDIT: yes, this is not faster. It is twice as slow.

Perhaps consider using a subprocess and a tool such as grep, as suggested by Ryan P. You can even use multiprocessing.

Python speeds up this regex

More articles: