Unicode text search using regular expression

Question

Unicode text search using regular expression

Finding a file written in Hindi (Devanagri) (UTF-16) caused the following problem.

The file contains:

त्रास ततत जुग न ीं द ना हा बु

Note that the first char 'त्र' is a multiple code point त + ् + र Now, looking for "त", I get 4 matches, including त of the first char. I am using Java.

How can I find a search for "त" that are not part of several code characters.

Any help would be appreciated. :)

+4

java unicode character-properties

user162703 Aug 25 '09 at 13:09

source share

2 answers

Sean · Answer 1 · 2009-08-25T13:28:20+0000

You can do this using unicode properties, I suppose.

त(?!\p{M}+)

Corresponds to the code point, if it is not followed by any code points in category M, which are characters that must be combined with other characters. He uses a negative result to make this statement.

E: and if this does not work right away, try

 \uxxxx(?!\p{M}+)

Where xxxx is the code point number of the character.

fbonnet · Answer 2 · 2009-08-25T13:29:31+0000

It seems that the glyph "त्र" is actually a ligature or conjunct, not a symbol of a multiple code point. Therefore, I assume that you will get the expected result (if you do not want to match glyphs). See http://en.wikipedia.org/wiki/Devanagari#Conjuncts .

Unicode text search using regular expression

More articles: