Divide the sentence into words, but when faced with punctuation in C #

I have seen several similar questions, but I am trying to achieve this.

Given the string, str = "The moon is our natural satellite, that is, it revolves around the Earth!" I want to extract words and store them in an array. Expected array elements will be like this.

the moon is our natural satellite ie it rotates around the earth 

I tried using String.split (',' \ t ',' \ r '), but this does not work correctly. I also tried uninstalling. And other punctuation marks, but I need a string like "ie" can also be parsed. What is the best way to achieve this? I also tried using regex.split to no avail.

 string[] words = Regex.Split(line, @"\W+"); 

Of course, appreciate some boosts in the right direction.

+8
split c # regex words
source share
4 answers

Regular solution.

 (\b[^\s]+\b) 

And if you really want to fix the latter . on ie , you can use this.

 ((\b[^\s]+\b)((?<=\.\w).)?) 

Here is the code I'm using.

  var input = "The moon is our natural satellite, ie it rotates around the Earth!"; var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)"); foreach(var match in matches) { Console.WriteLine(match); } 

Results:

 The moon is our natural satellite ie it rotates around the Earth 
+26
source share

I suspect that the solution you are looking for is much more complicated than you think. You are looking for some form of language analysis, or at least a dictionary, so you can determine if a period is part of a word or a sentence ends. Did you think he could do both?

Consider adding a dictionary of valid words containing punctuation marks. This may be the easiest way to solve your problem.

+8
source share

This works for me.

 var str="The moon is our natural satellite, ie it rotates around the Earth!"; var a = str.Split(new char[] {' ', '\t'}); for (int i=0; i < a.Length; i++) { Console.WriteLine(" -{0}", a[i]); } 

Results:

  -The -moon -is -our -natural -satellite, -ie -it -rotates -around -the -Earth! 

you can do some post-processing of the results, remove commas and semicolons, etc.

+2
source share
 Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value) 
+1
source share

All Articles