Regex ignore underscores

Question

Regex ignore underscores

I have a regular expression ([-@.\/,':\w]*[\w])*and it matches all the words in the text (including broken words such as IBM), but I want this to exclude underscores, and I cannot figure out how to do this ... I tried to add ^[_](e.g. (^[_][-@.\/,':\w]*[\w])*), but it just breaks all the words into letters. I want to preserve the coincidence of words, but I do not want to have underlined words in them, as well as words that consist entirely of underscores.

What is the right way to do this?

PS

My application is written in C # (if that matters).
I can’t use A-Za-z0-9 because I need to match words regardless of the language (maybe Chinese, Russian, Japanese, German, English).

Update
Here is an example:

"IBM should be analyzed as a single word w_o_r_d! Russian should also work: the multiplex of historical events."

Matches must be:

I.B.M.  
should  
be  
parsed  
as  
one  
word  
Russian  
should  
work  
too

Please note that w_o_r_dmust not match.

+5

c # regex regex-negation pattern-matching

Kiril Mar 30 '11 at 23:52

source share

3 answers

Tue \w.

A-Za-z0-9.

+2

sidyll 30 . '11 23:57

LukeH :

([-@.\/,':\p{L}]*\p{L})*

\p{L} Lu, Ll, Lt, Lo, Lm. . Unicode

+1

jb. 31 . '11 1:44

LukeH · Accepted Answer · 2011-03-31T00:33:07+0000

Try this instead:

([-@.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*

The class \wconsists of [\p{L}\p{Nd}\p{Pc}]when executing a Unicode match. (Or simply [a-zA-Z0-9]if you are performing non-Unicode matching.)

This is the \p{Pc}Unicode category - punctuation / connector - which causes the problem by matching underscores, so we explicitly map the other categories without including this.

( " : " , " : Unicode" .)

Regex ignore underscores

More articles: