Convert Unicode string to unicode characters in C # for Indian languages

I need to convert a unicode string to Unicode characters.

for example: Tamil language

"கமலி" => 'க', 'ம', 'லி'

I can break unicode bytes, but creating Unicode characters has become a problem.

byte[] stringBytes = Encoding.Unicode.GetBytes("கமலி"); char[] stringChars = Encoding.Unicode.GetChars(stringBytes); foreach (var crt in stringChars) { Trace.WriteLine(crt); } 

It gives the result as:

'க' => 0 x0b95

'ம' => 0 x0bae

'ல' => 0 x0bb2

'ி' => 0 x0bbf

therefore, the problem is how to strip the character "லி" as "லி" without separation, like 'ல', 'ி'.

since this is natural in Indian, representing consonants and vowels as separate characters, but C # parsing is difficult.

All I need to break into 3 characters.

+6
source share
1 answer

To StringInfo over graphemes, you can use the methods of the StringInfo class.

Each combination of characters of the base character + combination of characters is called a text element in the .NET documentation, and you can TextElementEnumerator through them using TextElementEnumerator :

 var str = "கமலி"; var enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(str); while (enumerator.MoveNext()) { Console.WriteLine(enumerator.Current); } 

Conclusion:

 க ம லி 
+11
source

All Articles