Python Regular Expression Acceleration

I need to quickly extract text from HTML files. I use the following regular expressions instead of a full-fledged parser, because I need to be fast, not exact (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are some good ways to speed up my process? I can implement some parts in C, but I wonder if this will help, given that the time wasted inside re.sub, which I think will be effectively implemented.

# Remove scripts, styles, tags, entities, and extraneous spaces: scriptRx = re.compile("<script.*?/script>", re.I) styleRx = re.compile("<style.*?/style>", re.I) tagsRx = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>") entitiesRx = re.compile("&[0-9a-zA-Z]+;") spacesRx = re.compile("\s{2,}") .... text = scriptRx.sub(" ", text) text = styleRx.sub(" ", text) .... 

Thanks!

+4
source share
6 answers

First, use the HTML parser created for it, for example BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

Then you can identify the remaining individual slow points with the profiler:

http://docs.python.org/library/profile.html

And to learn about regular expressions, I found that Mastering Regular Expressions is very valuable, regardless of what the programming language means:

http://oreilly.com/catalog/9781565922570

also:

How can I debug a regex in python?

Due to the reuse of use-case, then for this request I would say that the above is not what you want. My alternative recommendation: Regular Expression Acceleration in Python

+9
source

You process each file five times, so the first thing you should do (as Paul Sunwald said) is to try to reduce this number by combining your regular expressions. I would also avoid using reluctant quantifiers, which are designed for convenience through efficiency. Consider this regular expression:

 <script.*?</script> 

Every time . moves to another character, first you need to make sure that </script> will not match in this place. It almost looks like a negative look at each position:

 <script(?:(?!</script>).)*</script> 

But we know that it makes no sense to lookahead if the next character is nothing but < , and we can adjust the regular expression accordingly:

 <script[^<]*(?:<(?!/script>)[^<]*)*</script> 

When I test them in RegexBuddy with this target string:

 <script type="text/javascript">var imagePath='http://sstatic.net/stackoverflow/img/';</script> 

... a non-reactive regular expression takes 173 steps to match, while a custom regular expression takes only 28.

Combining your first three regular expressions into one, you get this beast:

 <(?:(script|style)[^<]*(?:<(?!/\1)[^<]*)*</\1>|[!/]?[a-zA-Z-]+[^<>]*>) 

You may want to lock the <HEAD> element while you are on it (i.e. (script|style|head) ).

I don’t know what you are doing with the fourth regular expression, for character objects - are you just deleting them? I assume that the fifth regular expression should be run separately, as some of the spaces it clears are generated by the previous steps. But try it with the first three regular expressions and see how they differ. This should tell you whether to continue this approach.

+4
source

one thing you can do is combine script / style regular expressions using backlinks. here are some sample data:

 $ cat sample <script>some stuff</script> <html>whatever </html> <style>some other stuff</style> 

using perl:

 perl -ne "if (/<(script|style)>.*?<\/\1>/) { print $1; } " sample 

it will match either script or style. Secondly, the recommendation to "master regular expressions" is a great book.

+1
source

The suggestion for using an HTML parser is a good one, as it is likely to be faster than regular expressions. But I'm not sure that BeautifulSoup is the right tool for the job, as it creates a parse tree from the whole file and stores everything in memory. For a terabyte of HTML, you will need an obscene amount of RAM to do this ;-) I would suggest you look at HTMLParser , which is written at a lower level than BeautifulSoup, but I consider it a stream parser, so it will only load a bit of text at a time.

+1
source

If your use case really parses a few things for each of the millions of documents, my answer above won't help. I recommend some heuristics, for example, to make some regular expressions of "direct text" with them - as simple as /script/ and /style/ to throw things away quickly if you can. In fact, do you really need to do a tag end check? Is <style good enough? Leave a confirmation for someone else. If fast succeeds, then put the rest in one regular expression, for example /<script|<style|\s{2,}|etc.../ , so that he does not need to go through so much text once for each regular expression.

+1
source

I would use a simple program with a regular Python section, something like this, but it is tested with only one example style file:

 ## simple filtering when not hierarchical tags inside other discarded tags start_tags=('<style','<script') end_tags=('</style>','</script>') ##print("input:\n %s" % open('giant.html').read()) out=open('cleaned.html','w') end_tag='' for line in open('giant.html'): line=' '.join(line.split()) if end_tag: if end_tag in line: _,tag,end = line.partition(end_tags[index]) if end.strip(): out.write(end) end_tag='' continue ## discard rest of line if no end tag found in line found=( index for index in (start_tags.index(start_tag) if start_tag in line else '' for start_tag in start_tags) if index is not '') for index in found: start,tag,end = line.partition(start_tags[index]) # drop until closing angle bracket of start tag tag,_ ,end = end.partition('>') # check if closing tag already in same line if end_tags[index] in end: _,tag,end = end.partition(end_tags[index]) if end.strip(): out.write(end) end_tag = '' # end tag reset after found else: end_tag=end_tags[index] out.write(end) # no end tag at same line if not end_tag: out.write(line+'\n') out.close() ## print 'result:\n%s' % open('cleaned.html').read() 
0
source

Source: https://habr.com/ru/post/1316096/


All Articles