Speed โโup regular expressions
Your regular expressions may use some work.
For example, this line:
result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");
uses parentheses to capture the stopwatch for later use, and then never uses it. Perhaps the .NET regex engine is smart enough to skip capture in this case, perhaps not.
This regex is too complicated:
"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])"
"([-]|[.]|[-.]|[0-9])?" identical to "([-.0-9])?" . (If you are not trying to match โ-.โ As one of your possibilities? I suppose not now.) If you do not need a capture (and you are not in your example), then it is identical to "[-.0-9]?" ."[-.0-9]?" a bit redundant before "[0-9]*" . You can simplify it to "[-.]?[0-9]*" .- Similarly, if you do not need a capture, then
"([.]|[,])*" Is identical to "[,.]*" .
Lastly, check if compiling your regular expressions can provide better performance.
Reduce regular expressions and string manipulation
Building a bunch of lines that make up a bunch of Regex objects and then dropping them, as you do in this loop, may not be very fast:
result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");
Try pre-processing stop words into an array of Regex objects or create one pre-compiled Regex monster (as others suggested).
Restructure Algorithm
It looks like you are only interested in processing, non-stopwatch, text, and not punctuation, numbers, etc.
To do this, your algorithm uses the following approach:
- A string of all text (including stop words?).
- Use regular expressions (not necessarily the fastest approach) to replace (which requires constant reordering of the string body) without words with spaces.
- Use regular expressions (again, not necessarily the fastest approach) to replace (again) stop words with spaces, one shutter at a time.
Here I began to write a different approach, but L. Bushkin beat me. Do what he says. Keep in mind that, as a rule, a change in your algorithm usually gives greater improvements than micro-optimizations, such as better use of regular expressions.