C # Lots of regular expressions on strings - too much memory

Basically what I would like to do is run several (15-25) regular expressions on one line with the best possible memory management.

Overview: Streams a text file (sometimes html) via ftp adding to StringBuilder to get a very large string. File sizes range from 300 KB to 30 MB.

Regular expressions are semi-complex, but require several lines of a file (for example, sections of a book section), so arbitrary line breaks or replacements in each load cycle are out of response.

Sample replaces:

 Regex re = new Regex("<A.*?>Table of Contents</A>", RegexOptions.IgnoreCase); source = re.Replace(source, ""); 

For every skyscraper frame replacement run, I know that this is because the line is immutable in C # and it needs to make a copy - even if I call GC.Collect() , it still does not help enough for a 30 MB file.

Any tips on a better approach or a way to execute multiple regular expressions using read-only memory (make 2 copies (so there is 60 MB in memory), do a search, undo a copy of up to 30 MB)?

Update:

There is no simple answer, but for future people looking at this, I used a combination of all the answers below to get an acceptable condition:

  • If possible, split the string into chunks, see manojlds for an answer to this path, as the file is being read - looking for suitable endpoints.

  • If you cannot split as it flows, at least split it later, if possible - see ChrisWue's answer for some external tools that can help connect to files with this process.

  • Optimize your regular expression, avoid greedy operators, and try to limit the engine's capabilities as much as possible - see Sylverdrag's answer.

  • Combine regex when possible, this will reduce the number of replacements when regexes are not based on each other (useful in this case to clear bad input) - see Brian Reichle's answer for sample code.

Thanks everyone!

+7
source share
4 answers

Depending on the nature of RegEx, you can combine them into a single regular expression and use the Replace () overload, which the MatchEvaluator accepts to determine the replacement from the matched string.

 Regex re = new Regex("First Pattern|Second Pattern|Super(Mega)*Delux", RegexOptions.IgnoreCase); source = re.Replace(source, delegate(Match m) { string value = m.Value; if(value.Equals("first pattern", StringComparison.OrdinalIgnoreCase) { return "1st"; } else if(value.Equals("second pattern", StringComparison.OrdinalIgnoreCase) { return "2nd"; } else { return ""; } }); 

Of course, this falls apart if the latest templates should be able to match the result of earlier replacements.

+2
source

Take a look at this post, which talks about finding a stream using regular expressions, and not about the need to store in a string that consumes memory:

http://www.developer.com/design/article.php/3719741/Building-a-Regular-Expression-Stream-Search-with-the-NET-Framework.htm

+2
source

I have a pretty similar situation.

Use the compilation option for regular expression:

 Source = Regex.Replace(source, pattern, replace, RegexOptions.Compiled); 

Depending on your situation, this can significantly affect speed.

Not a complete solution, especially for files larger than 3-4 MB.

If you need to decide which regular expression should be run (not my business), you should optimize the regular expression whenever possible, avoiding costly operations. For example, avoid unscrupulous operators, avoid looking forward and looking back.

Instead of using:

 <a.*?>xxx 

using

 <a[^<>]*>xxx 

The reason is that the roughness operator forces the regular expression engine to check each character in comparison with the rest of the expression, while [^ <>] only requires the comparison of the current character with <and> and stops as soon as the condition is agreed. In a large file, this can make the difference between half a second and freezing the application.

This does not completely solve the problem, but it should help.

+2
source

Assuming that the documents you upload have some kind of structure, you might be better off writing a parser to put the document in the form, breaking a large line into several pieces, and then working with this structure.

One problem with a large string is that objects over 85,000 bytes are considered large objects and are placed in a large object heap that is not compacted, and this can lead to unexpected memory situations.

Another option would be to pass it through an external tool like sed or awk .

+1
source

All Articles