I am trying to unload HTML from websites into JSON, and I need a way to handle various character encodings.
I read that if it is not utf-8, it is probably ISO-8859-1, so now I am doing the following:
for possible_encoding in ["utf-8", "ISO-8859-1"]: try: # post_dict contains, among other things, website html retrieved # with urllib2 json = simplejson.dumps(post_dict, encoding=possible_encoding) break except UnicodeDecodeError: pass if json is None: raise UnicodeDecodeError
This, of course, crashes if I come across any other encodings, so I wonder if there is a way to solve this problem in the general case.
The reason I'm trying to serialize HTML is because I need to send a POST request to our NodeJS server. So, if someone has another solution that allows me to do this (perhaps without serialization for JSON at all), I would be happy to hear that as well.
source share