Regex ignore underscores

I have a regular expression ([-@.\/,':\w]*[\w])*and it matches all the words in the text (including broken words such as IBM), but I want this to exclude underscores, and I cannot figure out how to do this ... I tried to add ^[_](e.g. (^[_][-@.\/,':\w]*[\w])*), but it just breaks all the words into letters. I want to preserve the coincidence of words, but I do not want to have underlined words in them, as well as words that consist entirely of underscores.

What is the right way to do this?

PS

  • My application is written in C # (if that matters).
  • I can’t use A-Za-z0-9 because I need to match words regardless of the language (maybe Chinese, Russian, Japanese, German, English).

Update
Here is an example:

"IBM should be analyzed as a single word w_o_r_d! Russian should also work: the multiplex of historical events."

Matches must be:

I.B.M.  
should  
be  
parsed  
as  
one  
word  
Russian  
should  
work  
too  
  
  
  

Please note that w_o_r_dmust not match.

+5
source share
3 answers

Try this instead:

([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*

The class \wconsists of [\p{L}\p{Nd}\p{Pc}]when executing a Unicode match. (Or simply [a-zA-Z0-9]if you are performing non-Unicode matching.)

This is the \p{Pc}Unicode category - punctuation / connector - which causes the problem by matching underscores, so we explicitly map the other categories without including this.

( " : " , " : Unicode" .)

+6

Tue \w.

A-Za-z0-9.

+2

LukeH :

([-@.\/,':\p{L}]*\p{L})*

\p{L} Lu, Ll, Lt, Lo, Lm. . Unicode

+1

All Articles