Python coding error with Chinese characters

Question

Python coding error with Chinese characters

I start with the problem of decoding dozens of CSV files with the numbers + (simplified) Chinese characters in UTF-8 in Python 2.7.

I do not know the encoding of the input files, so I tried all the possible encodings that I know about - GB18030, UTF-7, UTF-8, UTF-16 and UTF-32 (LE and BE). In addition, for a good assessment, GBK and GB3212, although they should be a subset of GB18030. All UTFs stop when they get to the first Chinese characters. The rest of the encodings stop somewhere in the first line, except for GB18030. I thought this would be a solution because it reads the first few files and decodes them perfectly. Part of my code, line by line, is as follows:

line = line.decode("GB18030")

The first 2 files I tried to decode worked fine. Halfway through the third file , Python spits out

 UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence

There are about 5 such errors in this file in approximately millions of lines.

I opened the input file in a text editor and checked which characters produce decoding errors, and the first few of them have Euro signs in a specific column of CSV files. I'm pretty sure these are typos, so I would just like to remove the euro symbols. I would like to study the types of coding errors one by one; I would like to get rid of all the mistakes of the Euro, but I do not want to just ignore the others until I look at them first.

Edit: I used chardet , which gave GB2312 as a .99 trust encoding for all files. I tried using GB2312 for decoding, which gave:

 UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence

+4

python encoding cjk

rallen Oct 7 '10 at 16:05

source share

3 answers

You can try chardet .

+1

Mark tolonen Oct 08 '10 at 6:03

source share

Try the following:

 codecs.open(file, encoding='gb18030', errors='replace')

Remember the errors parameter, you can also set it to ignore.

0

elecjack Apr 7 '16 at 8:58

source share

John machin · Accepted Answer · 2010-10-08T07:53:51+0000

"" ... GB18030. I thought this would be a solution because it read the first few files and decoded them perfectly. "" "- explain what you mean. For me there are two criteria for successful decoding: firstly, this raw_bytes.decode (" some_encoding ") did not work, and secondly, the resulting unicode when displayed makes sense on a specific language. Each file in the Universe will pass the first test when decoding with latin1 aka iso_8859_1 . Many files in East Asian languages pass the first test with gb18030 , because mostly used characters in Chinese, Japanese and Korean are encoded using the same blocks double byte sequences How many second tests did you do?

Do not look at the data in the IDE or text editor. Look at this in a web browser; they are usually better at detecting encodings.

How do you know that this is a symbol of the euro? By looking at the screen of a text editor that decodes raw bytes using which encoding? cp1252?

How do you know that it contains hieroglyphs? Are you sure this is not Japanese? Korean? Where did you get it?

Chinese files created in Hong Kong, Taiwan, possibly Macau, and elsewhere using the big5 or big5_hkscs on the mainland - try this.

In any case, take a sign in it and specify chardet ; chardet usually does a pretty good job of detecting the encoding used if the file is large enough and correctly encoded Chinese / Japanese / Korean - however, if someone manually edited the file in a text editor using a single-byte encoding, several invalid characters can lead to encoding, used for the other 99.9% of characters that will not be detected.

You might like print repr(line) say 5 lines from the file and edit the output to your question.

If the file is not confidential, you can make it available for download.

Was the file created on Windows? How do you read it in Python? (show code)

Update after OP comments:

Notepad, etc. Do not try to guess the encoding; "ANSI" by default. You have to say what to do. What you call the euro symbol is the source byte "\ x80" decoded by your editor using the default encoding for your environment - the usual suspect is "cp1252". Do not use such an editor to edit the file.

You used to talk about the "first few mistakes." Now you say that you have only 5 errors. Please explain.

If the file is really almost the correct gb18030, you should be able to decode the file line by line, and when you get this error, lure it, print the error message, extract the byte offsets from the message, print the text (two_bad_bytes) and continue moving. I am very interested in which of the two bytes appears \x80 . If this does not appear at all, the Euro symbol is not part of your problem. Please note that \x80 may correctly display in the gb18030 file, but only as the 2nd byte of a 2-byte sequence, starting from \x81 to \xfe .

It is a good idea to find out what the problem is before trying to fix it. Trying to fix this by selecting it using Notepad, etc. In "ANSI" mode, not a good idea.

You really cared about how you decided that the gb18030 decoding results made sense. In particular, I will carefully study the lines where gbk fails, but gb18030 "works" - there must be some extremely rare Chinese characters, or maybe some non-Chinese characters other than ASCII ...

Here's the best way to check for damage: decode each file using raw_bytes.decode(encoding, 'replace') and write the result (encoded in utf8) to another file. Count the errors on result.count(u'\ufffd') . Browse the output file using what you used to decide if decoding gb18030 makes sense. The U + FFFD symbol should appear as a white question mark inside a black diamond.

If you decide that unprovable fragments can be dropped, the easiest way is raw_bytes.decode(encoding, 'ignore')

Update after additional information

All of these \\ confusing. It seems that "getting bytes" includes repr(repr(bytes)) instead of just repr(bytes) ... at the interactive prompt, do either bytes (you get implict repr ()) or print repr(bytes) (which won 'get implicit expression ())

Empty space: I assume that you mean that '\xf8\xf8'.decode('gb18030') is what you interpret as some kind of full-width space, and that the interpretation is done by visual verification using which Some uncontrolled viewing software. It is right?

Actually, '\xf8\xf8'.decode('gb18030') → u'\e28b' . U + E28B is located in PUA Unicode (Private Use Area). "Empty space", apparently, means that the viewing software is not in doubt, does not have a glyph for U + E28B in the font used.

Perhaps the file source intentionally uses PUA for characters that do not conform to the gb18030 standard, or for annotation, or for transmitting pseudosecond information. If so, you need to resort to decoding tambourine, the answer from a recent Russian study quoted here .

Alternative: cp939-HKSCS theory. According to the HK government, the HKSCS big5 code FE57 was once displayed on U + E28B, but is now displayed on U + 28804.

"Euro": you said "" Because of the data, I cannot split the entire string, but what I called the euro char is in: \ xcb \ xbe \ x80 \ x80 "[I Assuming a \ was omitted from at the very beginning, but " is a literal]. The" Euro symbol ", when it appears, is always in the same column that I do not need, so I was hoping to just use ignore. Unfortunately, since the "euro char" is next to the quotation marks in the file, sometimes "ignore" also gets rid of the euro character, as well as [like] quotation marks, which creates a problem for the csv module to determine the columns ","

It would help a lot if you could show patterns of where these \x80 bytes \x80 displayed in relation to quotes and Chinese characters - keep it readable by simply showing the hex code and hide your sensitive data, for example. using C1 C2 to represent "two bytes, which I'm sure is a Chinese character." For instance:

 C1 C2 C1 C2 cb be 80 80 22 # `\x22` is the quote character

Please provide examples (1) in which “it is not lost” to replace or “ignore” (2) when the quote is lost. In your only example to date, “is not lost:

 >>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore') u'\u53f8"'

And the offer to send you the debug code (see the example below) is still open.

 >>> import decode_debug as de >>> def logger(s): ... sys.stderr.write('*** ' + s + '\n') ... >>> import sys >>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger) *** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence *** input[3:5] ('\x80"') doesn't start with a plausible code sequence u'\u53f8\ufffd\ufffd"' >>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger) *** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence *** input[3:5] ('\x80"') doesn't start with a plausible code sequence u'\u53f8"' >>>

Eureka: - Probable reason to sometimes lose the quote character -

There seems to gb18030 an error in the gb18030 decoder replace / ignore mechanism: \x80 not a valid high byte of gb18030; when it is detected, the decoder should try to re-synchronize with the NEXT byte. However, it seems to ignore both \x80 and the following byte:

 >>> '\x80abcd'.decode('gb18030', 'replace') u'\ufffdbcd' # the 'a' is lost >>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger) *** input[0:4] ('\x80abc') doesn't start with a plausible code sequence u'\ufffdabcd' >>> '\x80\x80abcd'.decode('gb18030', 'replace') u'\ufffdabcd' # the second '\x80' is lost >>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger) *** input[0:4] ('\x80\x80ab') doesn't start with a plausible code sequence *** input[1:5] ('\x80abc') doesn't start with a plausible code sequence u'\ufffd\ufffdabcd' >>>

Python coding error with Chinese characters

More articles: