Fast text preprocessing

In my project, I work with text in general. I found that preprocessing can be very slow. Therefore, I would like to ask you if you know how to optimize my code. The stream is as follows:

get HTML page โ†’ (for plain text โ†’ stemming โ†’ remove stop words) โ†’ further text processing

There are preprocessing steps in parentheses. The application runs in about 10.265 s, but pre-processing takes 9.18 seconds! This is the time for preprocessing 50 HTML pages (excluding loading).

I use the HtmlAgilityPack library to convert HTML to plain text. This is pretty fast. It takes 2.5 ms to convert 1 document, so it is relatively good.

The problem arises later. Shooting a single document takes up to 120 ms. Sorry, these HTML pages are in Polish. In C #, there is no Stemmer for Polish. I only know 2 free uses written in Java: stempel and morfologic. I precompiled stempel.jar into a stempel.dll file using IKVM software. Therefore, there is nothing more to do.

Removing stop words also takes a lot of time (~ 70 ms for 1 document). This is done as follows:

result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " "); while (stopwords.MoveNext()) { string stopword = stopwords.Current.ToString(); result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " "); } return result; 

First I delete all numbers, special characters, 1- and 2-letter words. Then in the loop, I delete the stop words. There are about 270 stop words.

Is it possible to do it faster?

Edit:

What I want to do is delete everything that is not a word of more than two letters. So I want to pull out all the special characters (including the numbers ".", ",", "?", "!" Etc.), Stop words. I only need pure words that I can use for Data Mining.

+6
c # regex text-processing
source share
6 answers

Well, I know that SO is not a pure forum, and maybe I should not answer my question, but I want to share my results.

Finally, thank you guys, I was able to better optimize my text preprocessing. First of all, I simplified this long expression from my question (after Josh Kelly's answer):

  [0-9] | [^ \ w] | (\ b \ w {1,2} \ b) 

He does the same as the first, but is very simple. Then, following Josh Kellyโ€™s suggestion, I put this regex into the assembly. I found a great example of compiling expressions into assemblies here . I did this because this regular expression is used many, many times. After lecturing several articles on compiled regex, this was my solution. I deleted the last expression after the exclusion of stop words (without real meaning).

Thus, the runtime in a text file of 12 Kbytes was ~ 15 ms. This is only for the expression mentioned above.

The last step was the words stop. I decided to make a test for 3 different options (the runtime for the same text file is 12 Kbytes).

One big regex

with all the words stop and compiled into assembly (sentence mquander). Nothing is visible here.

  • Lead time: ~ 215ms

String.Replace ()

People say it can be faster than Regex. Therefore, for each stop word, I used the string.Replace() method. Many loops to take with the result:

  • Lead time: ~ 65 ms

LINQ

submitted by L. Bushkin. Nothing more to say.

  • Lead time: ~ 2.5 ms

I can only say wow. Just compare the runtime of the first with the last! Thanks a lot LBushkin!

+3
source share

Iteratively replacing words will be the biggest bottleneck in your implementation. At each iteration, you need to scan the entire line to stop, then the replace operation should select a new line and fill it with a text post-replacement. It will not be fast.

A more efficient approach is to tokenize the string and perform the replacement in a streaming way. Separate input into separate words, separated by any spaces or delimiters. You can do this gradually, so you do not need to allocate additional memory for this. For each word (token), you can now search the hash of stop words - if you find a match, you will replace it when you pass the final text to a separate StringBuilder . If the token is not temporary, just release its StringBuilder unchanged. This approach should have O (n) performance, as it only scans a string once and uses a HashSet to do a HashSet search.

Below is one approach that I would expect to do better. Although it is not completely streaming (it uses String.Split() , which allocates an array of additional lines), it does all the processing in one pass. Refining the code to avoid highlighting the extra line will probably not give much improvement, since you still need to extract the substrings to compare with your stopwatch.

The code below returns a list of words that excludes all stop words and words in two letters or shorter forms the result. It also uses case-insensitive stop word comparison.

 public IEnumerable<string> SplitIntoWords( string input, IEnumerable<string> stopwords ) { // use case-insensitive comparison when matching stopwords var comparer = StringComparer.InvariantCultureIgnoreCase; var stopwordsSet = new HashSet<string>( stopwords, comparer ); var splitOn = new char[] { ' ', '\t', '\r' ,'\n' }; // if your splitting is more complicated, you could use RegEx instead... // if this becomes a bottleneck, you could use loop over the string using // string.IndexOf() - but you would still need to allocate an extra string // to perform comparison, so it unclear if that would be better or not var words = input.Split( splitOn, StringSplitOptions.RemoveEmptyEntries ); // return all words longer than 2 letters that are not stopwords... return words.Where( w => !stopwordsSet.Contains( w ) && w.Length > 2 ); } 
+15
source share

Instead of replacing the regular expression in a loop, why not dynamically build a suitable monster regular expression matching any of your stop words, and then start one replacement, replacing it with nothing? Something like "\b(what|ok|yeah)\b" if your stop words are "what," "good," and "yes." It seems that this is likely to be more effective.

+2
source share

Speed โ€‹โ€‹up regular expressions

Your regular expressions may use some work.

For example, this line:

 result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " "); 

uses parentheses to capture the stopwatch for later use, and then never uses it. Perhaps the .NET regex engine is smart enough to skip capture in this case, perhaps not.

This regex is too complicated:

 "(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])" 
  • "([-]|[.]|[-.]|[0-9])?" identical to "([-.0-9])?" . (If you are not trying to match โ€œ-.โ€ As one of your possibilities? I suppose not now.) If you do not need a capture (and you are not in your example), then it is identical to "[-.0-9]?" .
  • "[-.0-9]?" a bit redundant before "[0-9]*" . You can simplify it to "[-.]?[0-9]*" .
  • Similarly, if you do not need a capture, then "([.]|[,])*" Is identical to "[,.]*" .

Lastly, check if compiling your regular expressions can provide better performance.

Reduce regular expressions and string manipulation

Building a bunch of lines that make up a bunch of Regex objects and then dropping them, as you do in this loop, may not be very fast:

 result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " "); 

Try pre-processing stop words into an array of Regex objects or create one pre-compiled Regex monster (as others suggested).

Restructure Algorithm

It looks like you are only interested in processing, non-stopwatch, text, and not punctuation, numbers, etc.

To do this, your algorithm uses the following approach:

  • A string of all text (including stop words?).
  • Use regular expressions (not necessarily the fastest approach) to replace (which requires constant reordering of the string body) without words with spaces.
  • Use regular expressions (again, not necessarily the fastest approach) to replace (again) stop words with spaces, one shutter at a time.

Here I began to write a different approach, but L. Bushkin beat me. Do what he says. Keep in mind that, as a rule, a change in your algorithm usually gives greater improvements than micro-optimizations, such as better use of regular expressions.

+1
source share

You may have encountered the Schlemiel the Painter problem . In C # (and other languages), when you add or concatenate strings, you actually create a completely new string. Doing this in a loop often causes a LOT of memory allocation, which can otherwise be avoided.

0
source share

I agree with mquander, and here is a bit more information. Each time you use a regular expression, C # will create a table matching the text. This is normal and dandy if you call the regular expression function only a couple of times, but what you do here creates about 270 new tables and processes them for each html document.

I would try just creating one Regex object and use | operator to match all the different stop words and the first filter. After that, you should use compilation of regular expressions for the assembly so that JIT generates machine code.

http://en.csharp-online.net/CSharp_Regular_Expression_Recipes%E2%80%94Compiling_Regular_Expressions

You should see sharp acceleration with just that.

0
source share

All Articles