What are some common character encodings that a text editor should support?

Question

What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically determines the encoding by looking at the specification at the beginning of the file and / or looking at the first 256 bytes for characters> 0x7f.

What other encodings should be supported, and what characteristics can make this encoding easily automatic detection?

+4

encoding unicode text-editor character

Nathan osman Jan 20 '10 at 18:49

source share

6 answers

I do not know about encodings, but I make sure that it can support several different line ending standards! (\ n vs \ r \ n)

If you have not registered Mich Kaplan's blog yet, I suggest doing this: http://blogs.msdn.com/michkap/

In particular, this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx

+3

mletterle Jan 20 '10 at 18:54

source share

Unable to determine encoding. The best you can do is something like IE and depends on the distribution of letters in different languages, as well as on standard characters for the language. But it is at best long.

I would suggest accessing a large library of character sets (check projects like iconv) and make everything accessible to the user. But do not worry about automatic detection. Just let the user choose their default encoding preference, which will default to UTF-8.

+1

Vilx- Jan 20 '10 at 19:20

source share

Latin-1 (ISO-8859-1) and its extension Windows CP-1252 must be supported for Western users. UTF-8 is arguably an excellent choice, but people often don't have that choice. Chinese users will need GB-18030, and remember that there are Japanese, Russians, and Greeks who all have their own encodings next to Unicode with the UTF-8 encoding.

Regarding detection, most encodings cannot be safely detected. In some (e.g. Latin-1) some byte values are simply invalid. In UTF-8, any byte value may appear, but not every sequence of byte values. In practice, however, you will not do the decoding yourself, but use the encoding / decoding library, try to decode and catch errors. So, why not support all the encodings supported by this library?

You can also develop heuristics such as decoding for a specific encoding, and then check the result for strange characters or combinations of characters or the frequency of such characters. But it will never be safe, and I agree with Wilks that you should not worry. In my experience, people usually know that a file has a specific encoding or that only two or three are possible. Therefore, if they see that you have chosen the wrong one, they can easily adapt. And take a look at other editors. The smartest solution is not always better, especially if people are used to other programs.

+1

thieger Jan 20 '10 at 23:48

source share

UTF-16 is not very common in text files. UTF-8 is much more common as it is backward compatible with ASCII and is specified in standards such as XML.

1) Check the specification of the various Unicode encodings. If found, use this encoding.
2) If there is no specification, check the correctness of the text of the UTF-8 file until you reach a sufficient sample other than ASCII (since many files are almost all ASCII, but can have several accented characters or smart quotes) or the file ends. If valid is UTF-8, use UTF-8.
3) If not Unicode, this is probably the current default code page for the platform.
4) Some encodings are easy to detect, for example, the Japanese Shift-JIS will actively use the prefix bytes 0x82 and 0x83, indicating hiragana and katakana.
5) Give the user the opportunity to change the encoding if the program error is incorrect.

+1

Joseph Boyle Jan 22 '10 at 19:56

source share

Whatever you do, use more than 256 bytes for the nune test. It is important that everything is correct, so why not check the entire document? Or at least the first 100K or so.

Try UTF-8 and the obvious UTF-16 (many alternating 0 bytes), then go back to the ANSI code page for the current locale.

0

xan Jan 20 '10 at 20:32

source share

Steve emmerson · Accepted Answer · 2010-01-20T19:15:08+0000

Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html .

As far as I know, there is no reliable way to detect this automatically (although the probability of an erroneous diagnosis can be reduced to a very small amount by scanning).

What are some common character encodings that a text editor should support?

More articles: