Arabic presentation forms B support in C #

Question

Arabic presentation forms B support in C #

I tried to convert a file from utf-8 to Arabic-1265 using the coding APIs in C #, but I had a strange problem that some characters were not converted correctly, for example, "لا" in the following statement "محمد صلا ح عادل "it looks like" محمد ص? ح عادل ". Some of my friends told me that this is because these characters are taken from Arabic presentation forms B. I create the file using notepad ++ and save it as utf-8.

here is the code i use

StreamReader sr = new StreamReader(@"C:\utf-8.txt", Encoding.UTF8); string str = sr.ReadLine(); StreamWriter sw = new StreamWriter(@"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256")); sw.Write(str); sw.Flush(); sw.Close();

But I do not know how to properly convert the file using these presentation forms in C #.

+4

c # encoding forms arabic presentation

Maged 21 sept '10 at 7:42

source share

3 answers

To give a more general answer:

Windows-1256 encoding is an outdated 8-bit character encoding. It has only 256 characters, of which only 60 are in Arabic letters.
Unicode has a much wider range of characters. In particular, it contains:
- "normal" Arabic characters, U + 0600 - U + 06FF. It is assumed that they are used for normal Arabic text, including text written in other languages that use an Arabic script, such as Farsi. For example, "لا" - U + 0644 (ل), and then U + 0627 (ا).
- symbols "Presentation Form", U + FB50 - U + FDFF ("Presentation Form-A") and U + FE70 - U + FEFF ("Presentation Form-B"). They are not intended to represent Arabic text. They are primarily intended for compatibility, especially with font file formats, which require separate code points for each individual ligated form of each character and combination of characters. The ligature "لا" is represented by one code (U + FEFB), despite two characters.
When encoding in Windows-1256, the .NET encoding for Windows-1256 automatically converts characters from the presentation form block into "plain text" because it has no other choice (except, of course, turn all this into question marks). For obvious reasons, he can only do this with characters that actually have an "equivalent."
When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the "plain text" block.

As we discovered, your input file contains characters that are not represented in Windows-1256. Such characters will turn into question marks ( ? ). In addition, those presentation form characters that have the equivalent of normal text will change their ligation behavior, because this is what regular Arabic text does.

+3

Timwi 21 sept '10 at 8:48

source share

First of all, the two characters you indicated are not from the block of Arabic presentation forms. They are \x0644 and \x0627 , which are taken from the standard Arabic block. However, to make sure I tried the \xFEFB , which is the "equivalent" character (not equivalent, but you know) for لا from the Presentation Forms block, and it works just fine for that too.

Secondly, I assume that you are referring to the encoding Windows-1256, which is designed for inherited 8-bit Arabic text.

So, I tried the following:

 var input = "لا"; var encoding = Encoding.GetEncoding("windows-1256"); var result = encoding.GetBytes(input); Console.WriteLine(string.Join(", ", result));

The output that I get is 225, 199 . So let's try returning it back:

 var bytes = new byte[] { 225, 199 }; var result2 = encoding.GetString(bytes); Console.WriteLine(result2);

Fairly enough, the console does not display the result correctly, but the Watch window in the debugger tells me that the answer is correct (it says "لا"). I can also copy the output from the Console, and this is correct on the clipboard.

Therefore, the encoding of Windows-1256 works very well, and it is not clear what the problem is.

My recommendation:

Write a short piece of code that detects the problem.
Submit a new question with this code snippet.
In this question, describe what result you get and what result you expect.

0

Timwi 21 sept '10 at 7:52

source share

Hans passant · Accepted Answer · 2010-09-21T08:58:04+0000

Yes, your line contains many ligatures that cannot be represented on code page 1256. Before writing, you will have to expand the line. Like this:

  str = str.Normalize(NormalizationForm.FormKD); st.Write(str);

Arabic presentation forms B support in C #

More articles: