How to use unidecode in python (3.3)

I am trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode

It should take a string and convert all non-ascii characters to the nearest access to the ascii character. I used the same module in perl quite easily, just by calling while (<input>) { $_ = unidecode($_); } while (<input>) { $_ = unidecode($_); } , and this one is a direct port to the perl module, the documentation indicates that it should work the same.

I'm sure this is something simple, I just donโ€™t understand enough about character and file encoding to know what the problem is. My source code is encoded in UTF-8 (converted from UCS-2LE). The problem may have more in common with my lack of knowledge about coding and incorrect string handling than a module, I hope someone can explain why. I tried everything I know, without accidentally pasting the code and looking for errors that I get without luck.

Here is my python

 from unidecode import unidecode def toascii(): origfile = open(r'C:\log.convert', 'rb') convertfile = open(r'C:\log.toascii', 'wb') for line in origfile: line = unidecode(line) convertfile.write(line) origfile.close() convertfile.close() toascii(); 

If I do not open the source file in byte mode ( origfile = open('file.txt','r' ), I get a UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the line for line in origfile:

If I open it in 'rb' byte mode, I get TypeError: ord() expected string length 1, but int found from line = unidecode(line) .

if I declare a line as a line line = unidecode(str(line)) , then it will be written to the file, but ... not right. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It writes the characters \ n, \ r, etc and unicode instead of converting them to anything.

If I convert a string to a string, as indicated above, and open the conversion file in byte mode 'wb' , it will TypeError: 'str' does not support the buffer interface TypeError error TypeError: 'str' does not support the buffer interface

If I open it in byte mode without declaring it to be 'wb' and unidecode(line) , I will again get a TypeError error TypeError: ord() expected string length 1, but int found .

+7
python encoding unicode
source share
1 answer

The unidecode module takes Unicode string values โ€‹โ€‹and returns a Unicode string in Python 3. Instead, you give it binary data. Decode to Unicode or open the input text file in text mode and encode the result in ASCII before writing to the file or open the output text file in text mode.

Quote from the module documentation:

The module exports one function that takes a Unicode object (Python 2.x) or string (Python 3.x) , and returns a string ( which can be encoded in ASCII bytes in Python 3.x )

Emphasis on mine.

This should work:

 def toascii(): with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile: for line in origfile: line = unidecode(line) convertfile.write(line) 

This opens the input file in a text module (using UTF8 encoding, which, judging by the correctness of the example string), and writes it in a text module (encoding in ASCII).

You need to explicitly specify the encoding of the file that you open; if you omit the encoding, the current system locale is used (the result of calling locale.getpreferredencoding(False) ), which usually would not be the right codec if your code should be portable.

+8
source share

All Articles