Unicode text search using regular expression

Finding a file written in Hindi (Devanagri) (UTF-16) caused the following problem.

The file contains:

त्रास ततत जुग न ीं द ना हा बु

Note that the first char 'त्र' is a multiple code point त + ् + र Now, looking for "त", I get 4 matches, including त of the first char. I am using Java.

How can I find a search for "त" that are not part of several code characters.

Any help would be appreciated. :)

+4
source share
2 answers

You can do this using unicode properties, I suppose.

त(?!\p{M}+) 

Corresponds to the code point, if it is not followed by any code points in category M, which are characters that must be combined with other characters. He uses a negative result to make this statement.

E: and if this does not work right away, try

 \uxxxx(?!\p{M}+) 

Where xxxx is the code point number of the character.

+1
source

It seems that the glyph "त्र" is actually a ligature or conjunct, not a symbol of a multiple code point. Therefore, I assume that you will get the expected result (if you do not want to match glyphs). See http://en.wikipedia.org/wiki/Devanagari#Conjuncts .

0
source

All Articles