Is there a way to match any Unicode non-alcohol character?

I have some documents that went through the conversion of OCR from PDF to HTML. Because of this, they ended up with a lot of random characters from unicode presets, where the converter got messed up (i.e. Elipses, etc.). They also correctly have a bunch of non-English, but still alphabetic characters, such as Γ©, and Russian characters, etc.

Is there a way to make a regex that will match any character in the Unicode alphabet (from alphabets of any language)? Or one that matches only non-alphabets? Any of them would be very helpful and awesome. I use Perl if that changes anything. Thanks!

+7
source share
2 answers

Check the properties of the Unicode character: http://www.regular-expressions.info/unicode.html#prop . I think you are probably looking

\p{L} 

which will match any letters or ideograms. You can also include letters with labels on them so you can do

 \p{L}\p{M}* 

In any case, all the various types of symbol properties are described in detail in the first link.

Edit: You can also see this answer, discussing whether \ w matches Unicode characters. They suggest that you can also use \ p {Word} or \ p {Alnum}: Does \ w match all alphanumeric characters defined in the Unicode standard?

+19
source

Depending on which language you use, the regex engine may or may not be Unicode. If so, he may or may not know the property markers \p{} . If so, your answer is in Unicode Characters and Properties in the Jan Goyvaerts regular expression tutorial .

You can use \p{Latin} , if supported, to discover everything that is (or is not, of course) from a language that uses any of the Latin Unicode blocks.

+2
source

All Articles