I am trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode
It should take a string and convert all non-ascii characters to the nearest access to the ascii character. I used the same module in perl quite easily, just by calling while (<input>) { $_ = unidecode($_); } while (<input>) { $_ = unidecode($_); } , and this one is a direct port to the perl module, the documentation indicates that it should work the same.
I'm sure this is something simple, I just donโt understand enough about character and file encoding to know what the problem is. My source code is encoded in UTF-8 (converted from UCS-2LE). The problem may have more in common with my lack of knowledge about coding and incorrect string handling than a module, I hope someone can explain why. I tried everything I know, without accidentally pasting the code and looking for errors that I get without luck.
Here is my python
from unidecode import unidecode def toascii(): origfile = open(r'C:\log.convert', 'rb') convertfile = open(r'C:\log.toascii', 'wb') for line in origfile: line = unidecode(line) convertfile.write(line) origfile.close() convertfile.close() toascii();
If I do not open the source file in byte mode ( origfile = open('file.txt','r' ), I get a UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the line for line in origfile:
If I open it in 'rb' byte mode, I get TypeError: ord() expected string length 1, but int found from line = unidecode(line) .
if I declare a line as a line line = unidecode(str(line)) , then it will be written to the file, but ... not right. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It writes the characters \ n, \ r, etc and unicode instead of converting them to anything.
If I convert a string to a string, as indicated above, and open the conversion file in byte mode 'wb' , it will TypeError: 'str' does not support the buffer interface TypeError error TypeError: 'str' does not support the buffer interface
If I open it in byte mode without declaring it to be 'wb' and unidecode(line) , I will again get a TypeError error TypeError: ord() expected string length 1, but int found .