Working with encodings is very confusing.
I believe that if you enter data through the command line, they will encode the data as independent of your system encoding and are not unicode. (Even copy / paste should do this)
Thus, it should be correctly decoded in unicode using system encoding:
import sys first_arg = sys.argv[1] print first_arg print type(first_arg) first_arg_unicode = first_arg.decode(sys.getfilesystemencoding()) print first_arg_unicode print type(first_arg_unicode) f = codecs.open(first_arg_unicode, 'r', 'utf-8') unicode_text = f.read() print type(unicode_text) print unicode_text.encode(sys.getfilesystemencoding())
the following output works: Hint> python myargv.py "PC · ソ フ ト 申請書 08.09.24.txt"
PC・ソフト申請書08.09.24.txt <type 'str'> <type 'unicode'> PC・ソフト申請書08.09.24.txt <type 'unicode'> ?日本語
If the "PC · ソ フ ト 申請書 08.09.24.txt" contains the text "日本語". (I encoded the file as utf8 using Windows notepad, I am a little fixated on why “?” Appears at the beginning of printing. Is there something related to how notepad saves utf8?)
String decoding method or built-in unicode () can be used to convert encoding to Unicode.
unicode_str = utf8_str.decode('utf8') unicode_str = unicode(utf8_str, 'utf8')
In addition, if you work with encoded files, you can use the codecs.open () function instead of the built-in open () function. It allows you to determine the encoding of a file and then use this encoding to transparently decode content in Unicode.
Therefore, when you call content = codecs.open("myfile.txt", "r", "utf8").read() content , it will be in Unicode.
codecs.open: http://docs.python.org/library/codecs.html?#codecs.open
If I miss, I understand something, please let me know.
If you have not recommended reading Joel's article on Unicode encoding and encoding: http://www.joelonsoftware.com/articles/Unicode.html