What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically determines the encoding by looking at the specification at the beginning of the file and / or looking at the first 256 bytes for characters> 0x7f.

What other encodings should be supported, and what characteristics can make this encoding easily automatic detection?

+4
source share
6 answers

Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html .

As far as I know, there is no reliable way to detect this automatically (although the probability of an erroneous diagnosis can be reduced to a very small amount by scanning).

+4
source

I do not know about encodings, but I make sure that it can support several different line ending standards! (\ n vs \ r \ n)

If you have not registered Mich Kaplan's blog yet, I suggest doing this: http://blogs.msdn.com/michkap/

In particular, this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx

+3
source

Unable to determine encoding. The best you can do is something like IE and depends on the distribution of letters in different languages, as well as on standard characters for the language. But it is at best long.

I would suggest accessing a large library of character sets (check projects like iconv) and make everything accessible to the user. But do not worry about automatic detection. Just let the user choose their default encoding preference, which will default to UTF-8.

+1
source

Latin-1 (ISO-8859-1) and its extension Windows CP-1252 must be supported for Western users. UTF-8 is arguably an excellent choice, but people often don't have that choice. Chinese users will need GB-18030, and remember that there are Japanese, Russians, and Greeks who all have their own encodings next to Unicode with the UTF-8 encoding.

Regarding detection, most encodings cannot be safely detected. In some (e.g. Latin-1) some byte values ​​are simply invalid. In UTF-8, any byte value may appear, but not every sequence of byte values. In practice, however, you will not do the decoding yourself, but use the encoding / decoding library, try to decode and catch errors. So, why not support all the encodings supported by this library?

You can also develop heuristics such as decoding for a specific encoding, and then check the result for strange characters or combinations of characters or the frequency of such characters. But it will never be safe, and I agree with Wilks that you should not worry. In my experience, people usually know that a file has a specific encoding or that only two or three are possible. Therefore, if they see that you have chosen the wrong one, they can easily adapt. And take a look at other editors. The smartest solution is not always better, especially if people are used to other programs.

+1
source

UTF-16 is not very common in text files. UTF-8 is much more common as it is backward compatible with ASCII and is specified in standards such as XML.

1) Check the specification of the various Unicode encodings. If found, use this encoding.
2) If there is no specification, check the correctness of the text of the UTF-8 file until you reach a sufficient sample other than ASCII (since many files are almost all ASCII, but can have several accented characters or smart quotes) or the file ends. If valid is UTF-8, use UTF-8.
3) If not Unicode, this is probably the current default code page for the platform.
4) Some encodings are easy to detect, for example, the Japanese Shift-JIS will actively use the prefix bytes 0x82 and 0x83, indicating hiragana and katakana.
5) Give the user the opportunity to change the encoding if the program error is incorrect.

+1
source

Whatever you do, use more than 256 bytes for the nune test. It is important that everything is correct, so why not check the entire document? Or at least the first 100K or so.

Try UTF-8 and the obvious UTF-16 (many alternating 0 bytes), then go back to the ANSI code page for the current locale.

0
source