A sequence of code points constitutes a single syllable / letter / symbol in many other non-European languages (for example, all languages with an index)
So, when you calculate the length OR find a substring (cases of finding substrings are definitely used - let's say we play the executioner's game), you need to promote the syllable by the syllable, and not by the code point by the code point.
Thus, the definition of a character / syllable and where you actually break the string into “pieces of syllables” depends on the nature of the language you are dealing with. For example, the syllable pattern in many indicator languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following
V (Vowel in their primary form appearing at the beginning of the word) C (consonant) C + V (consonant + vowel in their secondary form) C + C + V C + C + C + V
You need to parse the string and find the patterns above to break the string and find the substrings.
I don’t think that you can use the general-purpose method, which can magically break lines in the manner described above for any Unicode string (or sequence of code points) - since a template that works for one language may not be applicable for another letter;
I suggest that there may be some methods / libraries that can take some definition / configuration parameters as input to break Unicode strings into such syllable syllables. Not sure though! Appreciate if someone can share how they solved this problem using any commercially available or open source methods.
SRKJ Oct 20 2018-12-12T00: 00Z
source share