How to split Tamil characters in a string in PHP

How do I split tamil characters in a string?

When I use preg_match_all('/./u', $str, $results) ,
I get the characters "த", "ம", "ி", "ழ" and "்".

How to get the combined characters "த", "மி" and "ழ்"?

+10
php unicode string-split tamil
Jan 10 2018-12-12T00:
source share
2 answers

I think you should use the grapheme_extract function to iterate over the combined characters (which are technically called "grapheme clusters").

Alternatively, if you prefer a regex approach, I think you can use this:

 preg_match_all('/\pL\pM*|./u', $str, $results) 

where \pL stands for Unicode and \pM stands for Unicode.

(Disclaimer: I have not tested any of these approaches.)

+11
Jan 10 2018-12-12T00:
source share

If I understand your question correctly, do you have a unicode string containing codepoints and want to convert it to an array of graphs?

I am working on developing an open source Python library for tasks such as a Tamil language site .

I have not used PHP for a while, so I will post the logic. You can take a look at the code in the amuthaa / TamilWord.py function in the split_letters () file .

As Ruach mentioned, Tamil graphemes are built as code points.

  • Vowels (உயிர் எழுத்து), aytham (ஆய்த எழுத்து - ஃ) and all combinations ((உயிர்-மெய் எழுத்து) in the column "a" (அ வரி - i.e. க, ச, ட, த, ப, ற, ங, ஞ, ண, ந, ம, ன, ய, ர, ள, வ, ழ, ல) each uses one code number.

  • Each consonant consists of two code points: a-combination letters + pulli. For example. ப் = ப + ்

  • Each combination other than a-combinations also consists of two code points: a-combination letter + marking: for example. பி = ப் + ி, தை = த் + ை

So, if your logic would be something like this:

 initialize an empty array for each codepoint in word: if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array otherwise, the codepoint is a marking such as the pulli (ie ்) or one of the combination extensions (eg ி or ை), so append it to the end of the last element of the array 

This, of course, assumes that your line is well formed and you don’t have such things as two markings in a row.

Here's the Python code if you find this useful. If you want to help us port this to PHP, let me know:

 @staticmethod def split_letters(word=u''): """ Returns the graphemes (ie the Tamil characters) in a given word as a list """ # ensure that the word is a valid word TamilWord.validate(word) # list (which will be returned to user) letters = [] # a tuple of all combination endings and of all அ combinations combination_endings = TamilLetter.get_combination_endings() a_combinations = TamilLetter.get_combination_column(u'அ').values() # loop through each codepoint in the input string for codepoint in word: # if codepoint is an அ combination, a vowel, aytham or a space, # add it to the list if codepoint in a_combinations or \ TamilLetter.is_whitespace(codepoint) or \ TamilLetter.is_vowel(codepoint) or \ TamilLetter.is_aytham(codepoint): letters.append(codepoint) # if codepoint is a combination ending or a pulli ('்'), add it # to the end of the previously-added codepoint elif codepoint in combination_endings or \ codepoint == TamilLetter.get_pulli(): # ensure that at least one character already exists if len(letters) > 0: letters[-1] = letters[-1] + codepoint # otherwise raise an Error. However, validate_word() # should catch this else: raise ValueError("""%s cannot be first character of a word""" % (codepoint)) return letters 
+2
Jan 25 '13 at 16:48
source share



All Articles