If I understand your question correctly, do you have a unicode string containing codepoints and want to convert it to an array of graphs?
I am working on developing an open source Python library for tasks such as a Tamil language site .
I have not used PHP for a while, so I will post the logic. You can take a look at the code in the amuthaa / TamilWord.py function in the split_letters () file .
As Ruach mentioned, Tamil graphemes are built as code points.
Vowels (உயிர் எழுத்து), aytham (ஆய்த எழுத்து - ஃ) and all combinations ((உயிர்-மெய் எழுத்து) in the column "a" (அ வரி - i.e. க, ச, ட, த, ப, ற, ங, ஞ, ண, ந, ம, ன, ய, ர, ள, வ, ழ, ல) each uses one code number.
Each consonant consists of two code points: a-combination letters + pulli. For example. ப் = ப + ்
Each combination other than a-combinations also consists of two code points: a-combination letter + marking: for example. பி = ப் + ி, தை = த் + ை
So, if your logic would be something like this:
initialize an empty array for each codepoint in word: if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array otherwise, the codepoint is a marking such as the pulli (ie ்) or one of the combination extensions (eg ி or ை), so append it to the end of the last element of the array
This, of course, assumes that your line is well formed and you don’t have such things as two markings in a row.
Here's the Python code if you find this useful. If you want to help us port this to PHP, let me know:
@staticmethod def split_letters(word=u''): """ Returns the graphemes (ie the Tamil characters) in a given word as a list """
Ashwin Balamohan Jan 25 '13 at 16:48 2013-01-25 16:48
source share