Is Python 3.3 better than 2.7 for decoding and re-encoding Scraper web text for UTF-8? How much better?

There are, apparently, a million questions related to Unicode Python errors, where ...ordinal [is] not in range(128) . The vast majority seem to include Python 2.x.

I know about these errors because I am currently coding by decrypting hell. For a third-party project, I clear the web pages and try to normalize this text data so that it does not appear on our site with crazy characters. To normalize the data, I rely on HTMLParser HTMLParser() and entitydefs , and also decrypts the text from any of its original form ( string.decode('[original encoding]', 'ignore')) and encodes it as UTF-8 ( string.encode('utf-8', 'ignore') ).

However, there seems to always be a site where my best efforts fail by raising the same old UnicodeError: ASCII decoding error...ordinal not in range(128). This is so annoying.

I read ( here and here ) that in Python 3 all Unicode text is. Although I read a lot about Unicode because I am not a software engineer, I don’t know if Unicode is objectively better (i.e., a lower bounce rate) than the ascii encoding option by default 2.x. I should think that everything will be better, but I would like for someone more experienced and experienced to give some perspective.

I would like to know if I should upgrade to Python 3 for its (improved) processing of text cleared from the Internet. I hope someone here can explain (or suggest resources that explain) the pros and cons of Python 3's approach to text processing. This is better?? Is there anyone who dealt with my problem that has already migrated to Python 3? Would he recommend that I start using Python 3 if 2to3 migration 2to3 not a problem?

Thanks in advance for any help. I need it very much.

+7
python encoding unicode
source share
1 answer

I will speak from the point of view of the user of Python 2.7.

It is true that Python 3 introduces some big changes to the Unicode field. I will not say that working with encodings in Python 3 is simpler, but it is actually smarter for working with i18n.

As I said, I am using Python 2.7, and so far I have been able to handle every encoding problem I found. You just need to understand what is happening under the hood, and have a very reasonable experience of what encodings , of course: this is the best article to understand encodings .

In this article, Joel says what you need to keep in mind every time you encounter an encoding situation :

It makes no sense to have a string without knowing which encoding it uses.

Having said that, my suggestion to approach your problem with Python 2.7 would be something like this:

  • Read Joel's article, of course (excellent reading and only takes 30 minutes or less).
  • Find out what encoding uses the webpage (you can feel it by looking at the Response headers or in the box in BeautifulSoup .
  • .decode() extracted string with encoding you figured out
  • When you decode , you no longer have a str object, you have a Unicode object.
  • Unicode is just an internal representation, not a real encoding, so if you want to output the content somewhere, you will need .encode() , and I suggest you use utf-8 , of course.

Now you need to understand some points. Perhaps the webpage you are clearing is not an encoding, and says that it uses some encoding , but does not adhere to it. This is a mistake made by the webmaster, but you must do something to understand this. You have 3 options:

  • ,ignore characters that may be problematic. Just calmly skip them.
  • There are good python libraries that try to figure out what encoding the string uses. This is very accurate, but certainly not a silver bullet. They may not guess, especially if the encoding is in the wrong format.
  • Get angry and give up the project;) (I really do not recommend this)

To get encodings right, some of the discipline is required from the source and from the client. You need to properly develop your program, but you need the encoding information and the real encoding in the original match.

Python 3 improves Unicode processing, but if you don't understand what is going on, it will probably be useless. The best you can do is understand encodings (not so hard, read Joel!), And once you understand this, you can process it using Python 2.7, Python 3.3 and even PHP;)

Hope this helps!

+10
source share

All Articles