Automatically open the file in the correct encoding

I am dealing with some problems in several encoding files. We receive files from another company and must read them (files are in csv format)

It is strange that the files appear to be encoded in UTF-16. I can do this, but I have to open them using the codecs module and specify the encoding in this way.

 ENCODING = 'utf-16' with codecs.open(test_file, encoding=ENCODING) as csv_file: # Autodetect dialect dialect = csv.Sniffer().sniff(descriptor.read(1024)) descriptor.seek(0) input_file = csv.reader(descriptor, dialect=dialect) for line in input_file: do_funny_things() 

But, just as I can get the dialect in a more aggressive way, I think it will be great to be able to automatically open files with its proper encoding, at least all text files. There are other programs, such as vim, that achieve this.

Does anyone know a way to do this in python 2.6?

PD: I hope this will be resolved in Python 3, since all lines are Unicode ...

+6
python
source share
4 answers

chardet can help you.

Autodiscover character encoding in Python 2 and 3. Smart as your browser. Open source.

+8
source share

It will not be “fixed” in python 3, since this is not a fix problem. Many documents are valid in several encodings, so the only way to determine the correct encoding is to know something about the document. Fortunately, in most cases we know something about the document, such as, for example, most characters will be grouped into separate Unicode blocks. The English document mainly contains characters at the first 128 code points. The document in Russian will contain mainly Cyrillic codes. Most documents will contain spaces and newlines. These tips can be used to help you get reasonable guesses about which encodings are used. Better yet, use a library written by someone who has already done this work. (Like the chardet mentioned in another Desintegr answer.

+5
source share

csv.reader cannot handle Unicode strings in 2.x. See the bottom of the csv documentation and this question for solutions.

0
source share

If it is fixed in Python 3, it must also be fixed using

 from __future__ import unicode_literals 
-4
source share

All Articles