Best way to decode unknown Unicoding encoding in Python 2.5

Do I have everything in order? In any case, I parse a lot of html, but I do not always know what encoding it should have been (an amazing number of lies about this). The code below easily shows what I have done so far, but I'm sure there is a better way. Your suggestions will be highly appreciated.

import logging import codecs from utils.error import Error class UnicodingError(Error): pass # these encodings should be in most likely order to save time encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u", "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ] def unicode(string): '''make unicode''' for enc in self.encodings: try: logging.debug("unicoder is trying " + enc + " encoding") utf8 = unicode(string, enc) logging.info("unicoder is using " + enc + " encoding") return utf8 except UnicodingError: if enc == self.encodings[-1]: raise UnicodingError("still don't recognise encoding after trying do guess.") 
+6
python html encoding unicode character-encoding
source share
3 answers

There are two general purpose libraries for detecting unknown encodings:

Chardet is supposed to be the port of the way firefox does

You can use the following regular expression to determine utf8 from byte strings:

 import re utf8_detector = re.compile(r"""^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$""", re.X) 

In practice, if you are dealing with English, I found the following works in 99.9% of cases:

  • if it passes the above regular expression, it is ascii or utf8
  • if it contains any bytes from 0x80-0x9f but not 0xa4, this is Windows-1252
  • if it contains 0xa4, suppose it is latin-15
  • otherwise suppose it's Latin-1
+8
source share

I solved the same problem and found that there is no way to determine the encoding type of the content without metadata about the content. That is why I ended up in the same approach as you are here.

My only additional advice that you have made, instead of ordering a list of possible encodings in the most probable order, you should order it by specificity. I found that some character sets are subsets of others, so if you check utf_8 as your second choice, you won't be able to find subsets of utf_8 (I think one of the Korean character sets uses the same number space as utf).

+2
source share

Since you are using Python, you can try UnicodeDammit . This is part of Beautiful Soup , which may also be useful.

As with the name, UnicodeDammit will try its best to get the right unicode from the crap you can find in the world.

+1
source share

All Articles