Python decryption problem with Chinese characters

I am using Python 3.5, and I am trying to take a block of byte text that may or may not contain special Chinese characters and output it to a file. It works for entries that do not contain Chinese characters, but breaks when they do it. Chinese characters are always the person’s name and always in addition to the English spelling of their name. JSON text is formatted and needs to be decoded before I can download it. It seems that decoding is beautiful and does not give me any errors. When I try to write decoded text to a file, it gives me the following error message:

UnicodeEncodeError: codec 'charmap' cannot encode characters at positions 14-18: character cards on undefined

Here is an example of the source data that I get before I do anything with it:

b' "isBulkRecipient": "false",\r\n "name": "Name in, English \xef' b'\xab\x62\xb6\xe2\x15\x8a\x8b\x8a\xee\xab\x89\xcf\xbc\x8a",\r\n 

Here is the code I'm using:

 recipientData = json.loads(recipientContent.decode('utf-8', 'ignore')) recipientName = recipientData['signers'][0]['name'] pprint(recipientName) with open('envelope recipient list.csv', 'a', newline='') as fp: a = csv.writer(fp, delimiter=',') csvData = [[recipientName]] a.writerows(csvData) 

recipientContent obtained from an API call. I do not need to have Chinese characters in the output file. Any advice would be greatly appreciated!

Update:

I made some manual workarounds for each torn record, as well as other entries that did not contain Chinese special characters, but contained them in other languages, and also violated the program. Special characters are found only in the name field. Thus, the name may be something like "Ałex", where it is a mixture of ordinary and special characters. Before I decode a string containing this information, I can print it on the screen, and it looks like this: b'name": "A\xc5ex",\r\n

But after I decrypt it in utf-8, it will give me an error if I try to output it. Error message: UnicodeEncodeError: 'charmap' codec can't encode character 'u0142' in position 2- character maps to -undefined-

I looked what it was, and this is a special symbol.

+6
source share
3 answers

Warning: shotgun decision ahead

Assuming you just want to get rid of all foreign characters in your entire file (i.e. they are not important for your future processing of all other fields), you can simply ignore all characters without ascii

 recipientData = json.loads(recipientContent.decode('utf-8', 'ignore')) 

by

 recipientData = json.loads(recipientContent.decode('ascii', 'ignore')) 

like this, you delete all characters without ascii before future processing.

I called this a fractional solution because it may not work correctly under certain circumstances:

  • Obviously, if non ascii characters are needed for future reference
  • If the characters b'\' or b" appear, for example, from the portion of the utf-16 character.
0
source

The error you get is when you write a file.

In Python 3.x, when you open() in text mode (default) without specifying encoding= , Python will use the encoding that is most suitable for your language settings or language settings.

If you are working on Windows, this will use the charmap codec to match the encoding of your language.

Although you can just write bytes directly to a file, you are doing the right thing by decrypting them first. As others have said, you should really decode using the encoding specified by the web server. You can also use the Python Requests module, which will do this for you. (You, for example, do not decode as UTF-8, so I assume that your example is incorrect)

To solve your immediate error, just pass the encoding to open() , which supports the characters that you have in your data. UTF-8 encoded Unicode is an obvious choice. Therefore, you should change your code as follows:

 with open('envelope recipient list.csv', 'a', encoding='utf-8', newline='') as fp: 
0
source

Add this line to your code:

 from __future__ import unicode_literals 
0
source

All Articles