Dictionary Exclusion

I read documents and break words to get every word in the dictionary, but how can I exclude some words (for example, "/ a / an").

This is my function:

private void Splitter(string[] file) { try { tempDict = file .SelectMany(i => File.ReadAllLines(i) .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries)) .AsParallel() .Distinct()) .GroupBy(word => word) .ToDictionary(g => g.Key, g => g.Count()); } catch (Exception ex) { Ex(ex); } } 

Also, in this scenario, where do you need to add add .ToLower() to make all the words from the file lowercase? I was thinking of something like this before ( temp = file ..):

 file.ToList().ConvertAll(d => d.ToLower()); 
+7
dictionary c # tolower wpf
source share
2 answers

Do you want to filter stop words?

  HashSet<String> StopWords = new HashSet<String> { "a", "an", "the" }; ... tempDict = file .SelectMany(i => File.ReadAllLines(i) .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries)) .AsParallel() .Select(word => word.ToLower()) // <- To Lower case .Where(word => !StopWords.Contains(word)) // <- No stop words .Distinct() .GroupBy(word => word) .ToDictionary(g => g.Key, g => g.Count()); 

However, this code is a partial solution: proper names such as Berlin will be converted to lowercase: Berlin, as well as abbreviations: KISS (Keep It Simple, Stupid) will just become a kiss, and some numbers will be incorrect.

+4
source share

I would do this:

 var ignore = new [] { "the", "a", "an" }; tempDict = file .SelectMany(i => File .ReadAllLines(i) .SelectMany(line => line .ToLowerInvariant() .Split( new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries)) .AsParallel() .Distinct()) .Where(x => !ignore.Contains(x)) .GroupBy(word => word) .ToDictionary(g => g.Key, g => g.Count()); 

You can change ignore to HashSet<string> if performance becomes an issue, but would be unlikely since you are using an IO file.

+1
source share

All Articles