Convert Unicode string to unicode characters in C # for Indian languages

Question

Convert Unicode string to unicode characters in C # for Indian languages

I need to convert a unicode string to Unicode characters.

for example: Tamil language

"கமலி" => 'க', 'ம', 'லி'

I can break unicode bytes, but creating Unicode characters has become a problem.

byte[] stringBytes = Encoding.Unicode.GetBytes("கமலி"); char[] stringChars = Encoding.Unicode.GetChars(stringBytes); foreach (var crt in stringChars) { Trace.WriteLine(crt); }

It gives the result as:

'க' => 0 x0b95

'ம' => 0 x0bae

'ல' => 0 x0bb2

'ி' => 0 x0bbf

therefore, the problem is how to strip the character "லி" as "லி" without separation, like 'ல', 'ி'.

since this is natural in Indian, representing consonants and vowels as separate characters, but C # parsing is difficult.

All I need to break into 3 characters.

+6

c # .net unicode .net-2.0 tamil

arun kumar non ascii Dec 20 '12 at 6:27

source share

1 answer

porges · Accepted Answer · 2012-12-20T07:08:31+0000

To StringInfo over graphemes, you can use the methods of the StringInfo class.

Each combination of characters of the base character + combination of characters is called a text element in the .NET documentation, and you can TextElementEnumerator through them using TextElementEnumerator :

 var str = "கமலி"; var enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(str); while (enumerator.MoveNext()) { Console.WriteLine(enumerator.Current); }

Conclusion:

 க ம லி

Convert Unicode string to unicode characters in C # for Indian languages

More articles: