Automatically open the file in the correct encoding

Question

Automatically open the file in the correct encoding

I am dealing with some problems in several encoding files. We receive files from another company and must read them (files are in csv format)

It is strange that the files appear to be encoded in UTF-16. I can do this, but I have to open them using the codecs module and specify the encoding in this way.

 ENCODING = 'utf-16' with codecs.open(test_file, encoding=ENCODING) as csv_file: # Autodetect dialect dialect = csv.Sniffer().sniff(descriptor.read(1024)) descriptor.seek(0) input_file = csv.reader(descriptor, dialect=dialect) for line in input_file: do_funny_things()

But, just as I can get the dialect in a more aggressive way, I think it will be great to be able to automatically open files with its proper encoding, at least all text files. There are other programs, such as vim, that achieve this.

Does anyone know a way to do this in python 2.6?

PD: I hope this will be resolved in Python 3, since all lines are Unicode ...

+6

python

Khelben Feb 26 '10 at 14:34

source share

4 answers

It will not be “fixed” in python 3, since this is not a fix problem. Many documents are valid in several encodings, so the only way to determine the correct encoding is to know something about the document. Fortunately, in most cases we know something about the document, such as, for example, most characters will be grouped into separate Unicode blocks. The English document mainly contains characters at the first 128 code points. The document in Russian will contain mainly Cyrillic codes. Most documents will contain spaces and newlines. These tips can be used to help you get reasonable guesses about which encodings are used. Better yet, use a library written by someone who has already done this work. (Like the chardet mentioned in another Desintegr answer.

+5

jcdyer Feb 26 '10 at 16:14

source share

csv.reader cannot handle Unicode strings in 2.x. See the bottom of the csv documentation and this question for solutions.

0

Mark tolonen Feb 26 '10 at 17:22

source share

If it is fixed in Python 3, it must also be fixed using

 from __future__ import unicode_literals

-4

Rdv Feb 26 '10 at 15:04

source share

Desintegr · Accepted Answer · 2010-02-26T14:39:06+0000

chardet can help you.

Autodiscover character encoding in Python 2 and 3. Smart as your browser. Open source.

Automatically open the file in the correct encoding

More articles: