Is this the best way to provide python unicode string encoding in utf-8?

In an arbitrary โ€œstringโ€ from the library, I have no control, I want to make sure that the โ€œstringโ€ is a Unicode type and encoded in utf-8. I would like to know if this is the best way to do this:

import types input = <some value from a lib I dont have control over> if isinstance(input, types.StringType): input = input.decode("utf-8") elif isinstance(input, types.UnicodeType): input = input.encode("utf-8").decode("utf-8") 

In my actual code, I wrap this in try / except and handle the errors, but I left this part.

+6
python unicode
source share
4 answers

The Unicode object is not encoded (it is internally, but it should be transparent to you as a Python user). The line input.encode("utf-8").decode("utf-8") doesn't make much sense: you get exactly the same sequence of Unicode characters at the end that you had at the beginning.

 if isinstance(input, str): input = input.decode('utf-8') 

- all you need for str objects (byte strings) to be converted to Unicode strings.

+5
source share

Just

 try: input = unicode(input.encode('utf-8')) except ValueError: pass 

It is always better to seek forgiveness than to ask permission.

+2
source share

I think you have a misunderstanding of Unicode and encodings. Unicode characters are just numbers. Encodings are a representation of numbers. Think of Unicode characters as a concept, like fifteen, and encodings like 15, 1111, F, XV. You must know the encoding (decimal, binary, hexadecimal, Roman numerals) before you can decode the encoding and "know" the Unicode value.

Unless you have control over the input string, it's hard to convert it to anything. For example, if the input was read from a file, you would need to know that the encoding of the text file on decode makes sense for Unicode, and then encode to "UTF-8" for your C ++ library.

+2
source share

Are you sure you want to encode UTF-8 encoding in Unicode format? Typically, Python stores characters in .UnicodeType types using UCS-2 or -4, which is sometimes called "wide" characters, which should contain characters from all fairly common scripts.

Interestingly, this is a lib, which sometimes outputs .StringType and sometimes types.UnicodeType types. If I accepted the wild assumption, lib always produces type.StringType, but does not tell what encoding it is in. If so, you are really looking for code that can guess which encoding of type .StringType is encoded as.

In most cases, this is easy, as you can assume that it is either, for example, Latin-1 or UTF-8. If the text can really be in any odd encoding (for example, incoming mail without a proper header), you need a library that guesses the encoding. See http://chardet.feedparser.org/ .

0
source share

All Articles