How to determine the character set in a string?

Question

How to determine the character set in a string?

I have several files that are in several different languages. I thought they were all UTF-8 encoded, but now I'm not sure. Some characters look great, some don't. Is there a way so that I can break lines and try to define character sets? Maybe divide by space and then define each word? Finally, is there an easy way to translate characters from one set to UTF-8?

+7

perl utf-8 character-encoding

anon Nov 25 '08 at 10:18

source share

3 answers

Determining if the file is probably UTF-8 or not should be pretty simple. Determining the encoding, if it is not UTF-8, would be very difficult in general.

If the file is encoded using UTF-8, the high-order bits of each byte must follow the pattern. If the character is one byte, its most significant bit will be cleared (zero). Otherwise, the n byte character (where n is 2 to 4) will contain the bit n first byte set to one, followed by one zero bit. The following n - 1 bytes must have the maximum bit and the second largest bit.

If all the bytes in your file comply with these rules, it is probably encoded using UTF-8. I say, probably because anyone can invent a new encoding, which happens by the same rules, intentionally or accidentally, but interprets the codes in different ways.

Note that a file encoded using US-ASCII will follow these rules, but the high bit of each byte is zero. It is good to process a file such as UTF-8, since they are compatible in this range. Otherwise, this is some other encoding, and there is no built-in test to distinguish the encoding. You will need to use some contextual knowledge to guess.

+6

erickson Nov 25 '08 at 10:39

source share

Take a look at the icon

http://www.gnu.org/software/libiconv/

Text :: Iconv

+2

rebra Nov 25 '08 at 10:27

source share

Leon Timmermans · Accepted Answer · 2008-11-25T22:37:34+0000

If you don't know the character set for sure, you can only guess, basically. utf8 :: valid can help you with this, but you may not know for sure. If you know that if it is not unicode, it must be a specific character set (e.g. Latin-1), you are in luck. If you do not know, you are screwed. In any case, you should always assume that the entire file is in the same character set, unless otherwise indicated. If you do not, you will lose your sanity.

As for your question, how to convert between character sets: Encode , you have to do it for you

How to determine the character set in a string?

More articles: