Do Arabic characters have different Unicode codes based on line position?

Arabic characters have different Unicode code points based on line position, or is this a visual solution?

This is the same word, 3 times, with spaces and without it looks like it is the same Unicode value.

عربى
عرب ى
ع ربى

What I need to do is scan the list of Arabic strings and get their values. Using these values, I will select the icon for the corresponding letter to display. However, if this is the same code point, the point is that I need to create my own logic in the code that I want to avoid.

+6
source share
2 answers

Different forms have different Unicode, for example, the letter ت \u062A has all these codes for different forms: \uFE95 ت, \uFE97 ت, \uFE98 ت, \uFE96 ت.

Although, mostly Arabic texts are stored with the main unshaped unicode. figurative forms are used only for rendering. therefore, if you check your text through a program, you will find it mostly unformatted.

If you want all letters to be formed, you can use the reshaper library, for example: Python Arabic Reshaper :

  import arabic_reshaper reshaped_text = arabic_reshaper.reshape(u'اللغة العربية رائعة') 

If you want all letters not to be formatted, use the formula card down to convert the letters to their basic form.

Here is the formulation map:

 SHAPING = { u'\u0621' : ( u'\uFE80' ) , u'\u0622' : ( u'\uFE81', u'\uFE82' ) , u'\u0623' : ( u'\uFE83', u'\uFE84' ) , u'\u0624' : ( u'\uFE85' , u'\uFE86' ) , u'\u0625' : ( u'\uFE87' , u'\uFE88' ) , u'\u0626' : ( u'\uFE89' , u'\uFE8B' , u'\uFE8C' , u'\uFE8A' ) , u'\u0627' : ( u'\uFE8D' , u'\uFE8E' ) , u'\u0628' : ( u'\uFE8F' , u'\uFE91' , u'\uFE92' , u'\uFE90' ) , u'\u0629' : ( u'\uFE93' , u'\uFE94' ) , u'\u062A' : ( u'\uFE95' , u'\uFE97' , u'\uFE98' , u'\uFE96' ) , u'\u062B' : ( u'\uFE99' , u'\uFE9B' , u'\uFE9C' , u'\uFE9A' ) , u'\u062C' : ( u'\uFE9D' , u'\uFE9F' , u'\uFEA0', u'\uFE9E' ) , u'\u062D' : ( u'\uFEA1' , u'\uFEA3' , u'\uFEA4' , u'\uFEA2' ) , u'\u062E' : ( u'\uFEA5' , u'\uFEA7' , u'\uFEA8' , u'\uFEA6' ) , u'\u062F' : ( u'\uFEA9' , u'\uFEAA' ) , u'\u0630' : ( u'\uFEAB' , u'\uFEAC' ) , u'\u0631' : ( u'\uFEAD' , u'\uFEAE' ) , u'\u0632' : ( u'\uFEAF' , u'\uFEB0' ) , u'\u0633' : ( u'\uFEB1' , u'\uFEB3' , u'\uFEB4' , u'\uFEB2' ) , u'\u0634' : ( u'\uFEB5' , u'\uFEB7' , u'\uFEB8' , u'\uFEB6' ) , u'\u0635' : ( u'\uFEB9' , u'\uFEBB' , u'\uFEBC' , u'\uFEBA' ) , u'\u0636' : ( u'\uFEBD' , u'\uFEBF' , u'\uFEC0' , u'\uFEBE' ) , u'\u0637' : ( u'\uFEC1' , u'\uFEC3' , u'\uFEC4' , u'\uFEC2' ) , u'\u0638' : ( u'\uFEC5' , u'\uFEC7' , u'\uFEC8' , u'\uFEC6' ) , u'\u0639' : ( u'\uFEC9' , u'\uFECB' , u'\uFECC' , u'\uFECA' ) , u'\u063A' : ( u'\uFECD' , u'\uFECF' , u'\uFED0', u'\uFECE' ) , u'\u0640' : ( u'\u0640' ) , u'\u0641' : ( u'\uFED1' , u'\uFED3' , u'\uFED4' , u'\uFED2' ) , u'\u0642' : ( u'\uFED5' , u'\uFED7' , u'\uFED8' , u'\uFED6' ) , u'\u0643' : ( u'\uFED9' , u'\uFEDB' , u'\uFEDC' , u'\uFEDA' ) , u'\u0644' : ( u'\uFEDD' , u'\uFEDF' , u'\uFEE0', u'\uFEDE' ) , u'\u0645' : ( u'\uFEE1' , u'\uFEE3' , u'\uFEE4' , u'\uFEE2' ) , u'\u0646' : ( u'\uFEE5' , u'\uFEE7' , u'\uFEE8' , u'\uFEE6' ) , u'\u0647' : ( u'\uFEE9' , u'\uFEEB' , u'\uFEEC' , u'\uFEEA' ) , u'\u0648' : ( u'\uFEED' , u'\uFEEE' ) , u'\u0649' : ( u'\uFEEF' , u'\uFEF0' ) , u'\u064A' : ( u'\uFEF1' , u'\uFEF3' , u'\uFEF4' , u'\uFEF2' ) } 
+19
source

In Arabic, there are 5 character blocks reserved for it in Unicode :

  • U + 0600 .. U + 06FF Arabic
  • U + 0750 .. U + 077F Arabic application.
  • U + 08A0 .. U + 08FF Arabic Extended A
  • U + FB50 .. U + FDFF Arabic presentation forms A
  • U + FE70 .. U + FEFF Arabic presentation forms B

The sample text in the question is encoded using four code points:

  • UTF-8 0xD8 0xB9 = U + 0639 = ARABIC LETTER AIN
  • UTF-8 0xD8 0xB1 = U + 0631 = ARABIC LETTER REH
  • UTF-8 0xD8 0xA8 = U + 0628 = ARABIC LETTER BEH
  • UTF-8 0xD9 0x89 = U + 0649 = ARABIC LETTER ALEF MAKSURA

In addition, there are spaces and some occurrences:

  • UTF-8 0xE2 0x80 0x8E = U + 200E = LEFT-TO-RIGHT MARK (LRM)

The fact that Arabic letters are displayed differently, despite the same Unicode code point used to store data, indicates that you will need to adapt the glyph that maps to its position relative to other characters (beginning, middle, end of the word or autonomous). You can read Chapter 9 ( Middle East-I ) to learn a lot more about processing Arabic text.

+3
source

All Articles