Is there a way to match any Unicode non-alcohol character?

Question

Is there a way to match any Unicode non-alcohol character?

I have some documents that went through the conversion of OCR from PDF to HTML. Because of this, they ended up with a lot of random characters from unicode presets, where the converter got messed up (i.e. Elipses, etc.). They also correctly have a bunch of non-English, but still alphabetic characters, such as é, and Russian characters, etc.

Is there a way to make a regex that will match any character in the Unicode alphabet (from alphabets of any language)? Or one that matches only non-alphabets? Any of them would be very helpful and awesome. I use Perl if that changes anything. Thanks!

+7

regex perl unicode character-properties

Eli May 14, '11 at 23:32

source share

2 answers

Depending on which language you use, the regex engine may or may not be Unicode. If so, he may or may not know the property markers \p{} . If so, your answer is in Unicode Characters and Properties in the Jan Goyvaerts regular expression tutorial .

You can use \p{Latin} , if supported, to discover everything that is (or is not, of course) from a language that uses any of the Latin Unicode blocks.

+2

Mike 'Pomax' Kamermans May 14, '11 at 23:46

source share

mpdaugherty · Accepted Answer · 2011-05-14T23:42:05+0000

Check the properties of the Unicode character: http://www.regular-expressions.info/unicode.html#prop . I think you are probably looking

\p{L}

which will match any letters or ideograms. You can also include letters with labels on them so you can do

 \p{L}\p{M}*

In any case, all the various types of symbol properties are described in detail in the first link.

Edit: You can also see this answer, discussing whether \ w matches Unicode characters. They suggest that you can also use \ p {Word} or \ p {Alnum}: Does \ w match all alphanumeric characters defined in the Unicode standard?

Is there a way to match any Unicode non-alcohol character?

More articles: