JSON dump from string in unknown character encoding

I am trying to unload HTML from websites into JSON, and I need a way to handle various character encodings.

I read that if it is not utf-8, it is probably ISO-8859-1, so now I am doing the following:

for possible_encoding in ["utf-8", "ISO-8859-1"]: try: # post_dict contains, among other things, website html retrieved # with urllib2 json = simplejson.dumps(post_dict, encoding=possible_encoding) break except UnicodeDecodeError: pass if json is None: raise UnicodeDecodeError 

This, of course, crashes if I come across any other encodings, so I wonder if there is a way to solve this problem in the general case.

The reason I'm trying to serialize HTML is because I need to send a POST request to our NodeJS server. So, if someone has another solution that allows me to do this (perhaps without serialization for JSON at all), I would be happy to hear that as well.

+4
source share
1 answer

You must know the character encoding regardless of the type of media you use to send the POST request (if you do not want to send binary drops). To get the character encoding of your html content, see A good way to get the HTTP response encoding / encoding in Python .

To send post_dict as json, make sure all the lines in it are Unicode (just convert html to Unicode as soon as you get it) and don't use the encoding parameter to call json.dumps() . The parameter will not help you in any case if different encodings are used on different websites (where you get your html lines).

+1
source

All Articles