p = re.compile('>.*\n') p.sub('', text)
I want to delete all lines starting with '>'. I have a really huge file (3 GB), which is processed in pieces of 250 MB in size, so the variable "text" is a string of 250 MB in size. (I tried different sizes, but the performance was the same for the full file).
Now, can I somehow speed up this regular expression? I tried multi-line match, but it was much slower. Or are there even better ways?
(I already tried to split the line and then filter the line like this, but it was also slower (I also tried lambda instead of def del_line: (this may not be working code, it's just from memory):
def del_line(x): return x[0] != '>' def func(): .... text = file.readlines(chunksize) text = filter(del_line, text) ...
EDIT: As suggested in the comments, I also tried taking turns:
text = [] for line in file: if line[0] != '>': text.append(line) text = ''.join(text)
It is also slower, it needs ~ 12 seconds. My regular expression needs ~ 7 sec. (yes, it is fast, but it should also work on slower machines)
EDIT: Of course, I also tried str.startswith ('>'), it was slower ...
source share