Improving Regular Expression Performance

Question

Improving Regular Expression Performance

I have about 100 thousand Outlook mail items that have about 500-600 characters per body. I have a list of 580 keywords that should search for each body, and then add the words below.

I believe that I have increased the efficiency of most functions, but it still takes a lot of time. Even for 100 emails, this takes about 4 seconds.

I run two functions for each keyword list (290 keywords of each list).

public List<string> Keyword_Search(HtmlNode nSearch) { var wordFound = new List<string>(); foreach (string currWord in _keywordList) { bool isMatch = Regex.IsMatch(nSearch.InnerHtml, "\\b" + @currWord + "\\b", RegexOptions.IgnoreCase); if (isMatch) { wordFound.Add(currWord); } } return wordFound; }

In any case, can I increase the efficiency of this function?

Another thing that can slow it down is that I use the HTML Agility Pack to navigate some nodes and pull out the body (nSearch.InnerHtml). _keywordList is a List element, not an array.

+6

c # regex

cam Mar 30 '10 at 13:23

source share

10 answers

I do not think this is work for regular expressions. You might be better off looking for each message in a word and checking each word against your list of words. With your approach, you search for each message n times, where n is the number of words you want to find - no wonder it takes some time.

+2

Jeff yates Mar 30 '10 at 13:37

source share

In most cases, form matches occur that fail, so you want to minimize glitches.

If the search keyword is not frequent, you can test all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b ), then you exclude emails without coincidence. One that has at least one match, you conduct a thorough search.

+2

Dan andreatta Mar 30 '10 at 13:48

source share

one thing you can easily do is combine all the words in one go by constructing an expression like:

\ B (: word1 | word2 | word3 | ....) \ b

Then you can precompile the template and reuse it to search for all events for each letter (you don’t know how you do it with the .Net API, but there should be a way).

Another thing, instead of using the ignorecase flag, if you convert everything to lowercase, which may give you a slight speedup (you need to profile it as an implementation dependent one). Remember to warm up the CLR with the profile.

+1

ddimitrov Mar 30 '10 at 13:42

source share

It could be faster. You can use Regex Groups as follows:

  public List<string> Keyword_Search(HtmlNode nSearch) { var wordFound = new List<string>(); // cache inner HTML string innerHtml = nSearch.InnerHtml; string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)"; Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase); MatchCollection myMatches = myRegex.Matches(innerHtml); foreach (Match myMatch in myMatches) { // Group 0 represents the entire match so we skip that one for (int i = 1; i < myMatch.Groups.Count; i++) { if (myMatch.Groups[i].Success) wordFound.Add(_keywordList[i-1]); } } return wordFound; }

This way you use only one regular expression. And group indices should correlate with your _keywordList with an offset of 1, hence the string wordFound.Add(_keywordList[i-1]);

UPDATE:

After I looked at my code again, I realized that putting matches into groups really wasn’t necessary. And Regex groups have some overhead. Instead, you can remove the bracket from the template, and then simply add the matches themselves to the wordFound list. This will give the same effect, but will be faster.

It will be something like this:

 public List<string> Keyword_Search(HtmlNode nSearch) { var wordFound = new List<string>(); // cache inner HTML string innerHtml = nSearch.InnerHtml; string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b"; Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase); MatchCollection myMatches = myRegex.Matches(innerHtml); foreach (Match myMatch in myMatches) { wordFound.Add(myMatch.Value); } return wordFound; }

+1

Steve wortham Mar 30 '10 at 16:41

source share

Regular expressions can be optimized quite a bit if you just want to match a fixed set of constant strings. Instead of a few matches, for example. against "winter", "victory" or "wombat", you can simply match it with "w(in(ter)?|ombat)" , for example (Jeffrey Friedl’s book can give you a lot of ideas like this). Such optimization is also built into some programs, in particular emacs ('regexp-opt'). I am not very familiar with .NET, but I assume that someone has programmed similar functionality - google for "regex optimization".

0

Kilian foth Mar 30 '10 at 13:33

source share

If the regular expression is really the neck of the bottle and even optimizes it (by combining the search words into one expression) does not help, consider using a search algorithm with several patterns, for example, Wu-Manber.

Ive posted a very simple implementation here on Stack Overflow. It is written in C ++, but since the code is simple, it is easy to translate it to C #.

Note that this will find words anywhere, not just at word boundaries. However, this can be easily checked after you have checked whether the text contains any words; either again with a regex (now you only check individual emails - much faster) or manually, checking the characters before and after individual hits.

0

Konrad Rudolph Mar 30 '10 at 14:08

source share

If your problem is finding Outlook items that contain a specific string, you should profit from using search tools in Outlook ...

see: http://msdn.microsoft.com/en-us/library/bb644806.aspx

0

Adrian Mar 30 '10 at 14:18

source share

If the keyword search is a direct literal, i.e. does not contain additional matches with a regular expression, then another method may be more appropriate. The following code demonstrates one such method, this code only goes through each email address once, your code went through each message 290 times (twice)

  public List<string> FindKeywords(string emailbody, List<string> keywordList) { // may want to clean up the input a bit, such as replacing '.' and ',' with a space // and remove double spaces string emailBodyAsUppercase = emailbody.ToUpper(); List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' ')); List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList)); return foundKeywords; }

0

Adrian Mar 30 '10 at 15:52

source share

If you can use .Net 3.5+ and LINQ, you can do something like this.

 public static class HtmlNodeTools { public static IEnumerable<string> MatchedKeywords( this HtmlNode nSearch, IEnumerable<string> keywordList) { //// as regex //var innerHtml = nSearch.InnerHtml; //return keywordList.Where(kw => // Regex.IsMatch(innerHtml, // @"\b" + kw + @"\b", // RegexOptions.IgnoreCase) // ); //would be faster if you don't need the pattern matching var innerHtml = ' ' + nSearch.InnerHtml + ' '; return keywordList.Where(kw => innerHtml.Contains(kw)); } } class Program { static void Main(string[] args) { var keyworkList = new string[] { "hello", "world", "nomatch" }; var h = new HtmlNode() { InnerHtml = "hi there hello other world" }; var matched = h.MatchedKeywords(keyworkList).ToList(); //hello, world } }

... regex example again ...

 public static class HtmlNodeTools { public static IEnumerable<string> MatchedKeywords( this HtmlNode nSearch, IEnumerable<KeyValuePair<string, Regex>> keywordList) { // as regex var innerHtml = nSearch.InnerHtml; return from kvp in keywordList where kvp.Value.IsMatch(innerHtml) select kvp.Key; } } class Program { static void Main(string[] args) { var keyworkList = new string[] { "hello", "world", "nomatch" }; var h = new HtmlNode() { InnerHtml = "hi there hello other world" }; var keyworkSet = keyworkList.Select(kw => new KeyValuePair<string, Regex>(kw, new Regex( @"\b" + kw + @"\b", RegexOptions.IgnoreCase) ) ).ToArray(); var matched = h.MatchedKeywords(keyworkSet).ToList(); //hello, world } }

0

Matthew whited Mar 30 '10 at 16:10

source share

Dirk vollmar · Accepted Answer · 2010-03-30T14:04:02+0000

I assume that the nSearch.InnerHtml COM call nSearch.InnerHtml rather slow, and you repeat the call for every single word that you check. You can simply cache the result of the call:

 public List<string> Keyword_Search(HtmlNode nSearch) { var wordFound = new List<string>(); // cache inner HTML string innerHtml = nSearch.InnerHtml; foreach (string currWord in _keywordList) { bool isMatch = Regex.IsMatch(innerHtml, "\\b" + @currWord + "\\b", RegexOptions.IgnoreCase); if (isMatch) { wordFound.Add(currWord); } } return wordFound; }

Another optimization will be the one proposed by Jeff Yates. For example. using one template:

 string pattern = @"(\b(?:" + string.Join("|", _keywordList) + @")\b)";

Improving Regular Expression Performance

More articles: