Minimize LINQ Token Counter

Subsequent answer to an earlier question .

Is there any way to reduce this by avoiding an external String.Split call? The goal is an associative container {token, count} .

 string src = "for each character in the string, take the rest of the " + "string starting from that character " + "as a substring; count it if it starts with the target string"; string[] target = src.Split(new char[] { ' ' }); var results = target.GroupBy(t => new { str = t, count = target.Count(sub => sub.Equals(t)) }); 
+1
source share
4 answers

As you have it right now, it will work (to some extent), but is terribly inefficient. As is, the result is a listing of groupings, not pairs (words, numbers) that you might think.

This overload GroupBy() takes a function to select a key. You efficiently perform this calculation for each item in the collection. Without going along the path of using regular expressions, ignoring punctuation, this should be written like this:

 string src = "for each character in the string, take the rest of the " + "string starting from that character " + "as a substring; count it if it starts with the target string"; var results = src.Split() // default split by whitespace .GroupBy(str => str) // group words by the value .Select(g => new { str = g.Key, // the value count = g.Count() // the count of that value }); // sort the results by the words that were counted var sortedResults = results.OrderByDescending(p => p.str); 
+4
source

3-4 times slower Regex method is more accurate:

 string src = "for each character in the string, take the rest of the " + "string starting from that character " + "as a substring; count it if it starts with the target string"; var regex=new Regex(@"\w+",RegexOptions.Compiled); var sw=new Stopwatch(); for (int i = 0; i < 100000; i++) { var dic=regex .Matches(src) .Cast<Match>() .Select(m=>m.Value) .GroupBy(s=>s) .ToDictionary(g=>g.Key,g=>g.Count()); if(i==1000)sw.Start(); } Console.WriteLine(sw.Elapsed); sw.Reset(); for (int i = 0; i < 100000; i++) { var dic=src .Split(' ') .GroupBy(s=>s) .ToDictionary(g=>g.Key,g=>g.Count()); if(i==1000)sw.Start(); } Console.WriteLine(sw.Elapsed); 

For example, the Regex method will not consider string and string, two separate records and will correctly indicate substring instead of substring; .

EDIT

Read your previous question and make sure my code does not quite match your specification. Despite this, it still demonstrates the advantage / cost of using Regex.

+3
source

Here's a LINQ version without ToDictionary() , which can add unnecessary overhead depending on your needs ...

 var dic = src.Split(' ').GroupBy(s => s, (str, g) => new { str, count = g.Count() }); 

Or in the query syntax ...

 var dic = from str in src.Split(' ') group str by str into g select new { str, count = g.Count() }; 
+1
source

Getting rid of String.Split does not leave many parameters in the table. One option is Regex.Matches , as shown by spender , and the other is Regex.Split (which does not give us anything new).

Instead of grouping, you can use any of these approaches:

 var target = src.Split(new[] { ' ', ',', ';' }, StringSplitOptions.RemoveEmptyEntries); var result = target.Distinct() .Select(s => new { Word = s, Count = target.Count(w => w == s) }); // or dictionary approach var result = target.Distinct() .ToDictionary(s => s, s => target.Count(w => w == s)); 

A Distinct call is necessary to avoid duplication of elements. I went ahead and expanded the characters to smash them to get actual words without punctuation. I found that the first approach is the fastest using benchmarking code using spender.

Back to the requirement to order results from your previously asked question, you can easily expand the first approach as follows:

 var result = target.Distinct() .Select(s => new { Word = s, Count = target.Count(w => w == s) }) .OrderByDescending(o => o.Count); // or in query form var result = from s in target.Distinct() let count = target.Count(w => w == s) orderby count descending select new { Word = s, Count = count }; 

EDIT: got rid of Tuple since the anonymous type was at hand.

+1
source

All Articles