Reading Unicode characters from command line arguments in Python 2.x on Windows

Question

Reading Unicode characters from command line arguments in Python 2.x on Windows

I want my Python script to be able to read Unicode command line arguments on Windows. But it seems that sys.argv is a string encoded in some local coding, not in Unicode. How can I read the command line in full Unicode?

Code example: argv.py

 import sys first_arg = sys.argv[1] print first_arg print type(first_arg) print first_arg.encode("hex") print open(first_arg)

On my PC configured for the Japanese codepage, I get:

 C:\temp>argv.py "PC・ソフト申請書08.09.24.doc" PC・ソフト申請書08.09.24.doc <type 'str'> 50438145835c83748367905c90bf8f9130382e30392e32342e646f63 <open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

This Shift-JIS is encoded, I believe, and it "works" for this file name. But it breaks down for file names with characters that are not in the Shift-JIS character set - the final “open” call fails:

 C:\temp>argv.py Jörgen.txt Jorgen.txt <type 'str'> 4a6f7267656e2e747874 Traceback (most recent call last): File "C:\temp\argv.py", line 7, in <module> print open(first_arg) IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

Note. I am talking about Python 2.x, not Python 3.0. I found that Python 3.0 gives sys.argv as the correct Unicode. But a little before the transition to Python 3.0 (due to the lack of support for a third-party library).

Update:

Several answers said that I should decode according to what is encoded in sys.argv . The problem is that it is not full Unicode, so some characters cannot be represented.

Here's a usage example that gives me sadness: I have drag and drop files on .py files in Windows Explorer . I have file names with all types of characters, including some not on the system code page by default. My Python script does not receive the correct Unicode file names passed to it through sys.argv in all cases when characters are not displayed in the current encoding of the code page.

Of course, there is some Windows API to read the command line with full Unicode (and Python 3.0 does this). I assume the Python 2.x interpreter is not using it.

+26

python command-line windows unicode python-2.x

Craig McQueen May 11, '09 at 5:44

source share

4 answers

Working with encodings is very confusing.

I believe that if you enter data through the command line, they will encode the data as independent of your system encoding and are not unicode. (Even copy / paste should do this)

Thus, it should be correctly decoded in unicode using system encoding:

 import sys first_arg = sys.argv[1] print first_arg print type(first_arg) first_arg_unicode = first_arg.decode(sys.getfilesystemencoding()) print first_arg_unicode print type(first_arg_unicode) f = codecs.open(first_arg_unicode, 'r', 'utf-8') unicode_text = f.read() print type(unicode_text) print unicode_text.encode(sys.getfilesystemencoding())

the following output works: Hint> python myargv.py "PC · ソフト申請書 08.09.24.txt"

 PC・ソフト申請書08.09.24.txt <type 'str'> <type 'unicode'> PC・ソフト申請書08.09.24.txt <type 'unicode'> ?日本語

If the "PC · ソフト申請書 08.09.24.txt" contains the text "日本語". (I encoded the file as utf8 using Windows notepad, I am a little fixated on why “?” Appears at the beginning of printing. Is there something related to how notepad saves utf8?)

String decoding method or built-in unicode () can be used to convert encoding to Unicode.

 unicode_str = utf8_str.decode('utf8') unicode_str = unicode(utf8_str, 'utf8')

In addition, if you work with encoded files, you can use the codecs.open () function instead of the built-in open () function. It allows you to determine the encoding of a file and then use this encoding to transparently decode content in Unicode.

Therefore, when you call content = codecs.open("myfile.txt", "r", "utf8").read() content , it will be in Unicode.

codecs.open: http://docs.python.org/library/codecs.html?#codecs.open

If I miss, I understand something, please let me know.

If you have not recommended reading Joel's article on Unicode encoding and encoding: http://www.joelonsoftware.com/articles/Unicode.html

+10

monkut May 11 '09 at 7:45 a.m.

source share

Try the following:

 import sys print repr(sys.argv[1].decode('UTF-8'))

You may need to replace CP437 or CP1252 with UTF-8 . You should be able to derive the correct encoding name from the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

+2

pts May 11 '09 at 5:58 a.m.

source share

The command line can be encoded in Windows. Try to decode the arguments in unicode objects:

 args = [unicode(x, "iso-8859-9") for x in sys.argv]

0

a paid nerd May 11 '09 at 5:57 a.m.

source share

Craig McQueen · Accepted Answer · 2009-05-11 06:21

Here is the solution I'm looking for by calling the GetCommandLineArgvW Windows GetCommandLineArgvW :
Get sys.argv with Unicode characters under Windows (from ActiveState)

But I made a few changes to make it easier to use and better handle certain uses. Here is what I use:

win32_unicode_argv.py

 """ win32_unicode_argv.py Importing this will replace sys.argv with a full Unicode form. Windows only. From this site, with adaptations: http://code.activestate.com/recipes/572200/ Usage: simply import this module into a script. sys.argv is changed to be a list of Unicode strings. """ import sys def win32_unicode_argv(): """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode strings. Versions 2.x of Python don't support Unicode in sys.argv on Windows, with the underlying Windows API instead replacing multi-byte characters with '?'. """ from ctypes import POINTER, byref, cdll, c_int, windll from ctypes.wintypes import LPCWSTR, LPWSTR GetCommandLineW = cdll.kernel32.GetCommandLineW GetCommandLineW.argtypes = [] GetCommandLineW.restype = LPCWSTR CommandLineToArgvW = windll.shell32.CommandLineToArgvW CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)] CommandLineToArgvW.restype = POINTER(LPWSTR) cmd = GetCommandLineW() argc = c_int(0) argv = CommandLineToArgvW(cmd, byref(argc)) if argc.value > 0: # Remove Python executable and commands if present start = argc.value - len(sys.argv) return [argv[i] for i in xrange(start, argc.value)] sys.argv = win32_unicode_argv()

Now, I use it simply:

 import sys import win32_unicode_argv

and from now on, sys.argv is a list of Unicode strings. The Python optparse module seems happy to parse it, which is great.

Reading Unicode characters from command line arguments in Python 2.x on Windows

More articles: