How can I extract non-ASCII character strings using OptParse?

Question

How can I extract non-ASCII character strings using OptParse?

I use the OptParse module to get a string value. OptParse only supports str typed strings , not unicode .

So let's say I start my script with:

 ./someScript --some-option ééééé

French characters, such as 'é', are typed by str , a UnicodeDecodeError trigger when reading in code:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 99: ordinal not in range(128)

I played a little with the built-in Unicode function, but either I get an error message or the character disappears:

 >>> unicode('é'); Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) >>> unicode('é', errors='ignore'); u''

Is there anything you can do to use OptParse to extract unicode / utf-8 strings?

UPDATE

It seems that the line can be restored and printed fine, but then I try to use this line with sqlite (using the APSW module) and it tries to somehow convert to unicode with cursor.execute("...") , and then an error occurs.

Here is an example of a program that causes an error:

 #!/usr/bin/python # coding: utf-8 import os, sys, optparse parser = optparse.OptionParser() parser.add_option("--some-option") (opts, args) = parser.parse_args() print unicode(opts.some_option)

+6

python unicode ascii

user610650 Oct 29 '12 at 12:48

source share

4 answers

You can decode the arguments before the parser processes them. Example:

 #!/usr/bin/python # coding: utf-8 import os, sys, optparse parser = optparse.OptionParser() parser.add_option("--some-option") # Decode the command line arguments to unicode for i, a in enumerate(sys.argv): sys.argv[i] = a.decode('ISO-8859-15') (opts, args) = parser.parse_args() print type(opts.some_option), opts.some_option

This gives the following result:

 C:\workspace>python file.py --some-option préférer <type 'unicode'> préférer

I chose the code ISO / IEC 8859-15 , which seems to you the most suitable for you. Adapt if necessary.

+3

jro Oct 29 '12 at 13:16

source share

I believe your error is related to the following :

For example, to write Unicode literals, including the currency Euro symbol, the encoding ISO-8859-15 can be used, with the Euro symbol having an ordinal value of 164. This script will print the value 8364 (Unicode code corresponding to the Euro symbol), and then Exit:

 # -*- coding: iso-8859-15 -*- currency = u"€" print ord(currency)

0

Woot4moo Oct 29 '12 at 12:56

source share

 #!/usr/bin/python # coding: utf-8 import os, sys, optparse reload(sys) sys.setdefaultencoding('utf-8') parser = optparse.OptionParser() parser.add_option(u"--some-option") (opts, args) = parser.parse_args() print opts.print_help()

0

lionyue Oct 29 '14 at 8:15

source share

Mark tolonen · Accepted Answer · 2012-10-30T12:12:51+0000

The input is returned to the console encoding, so based on your updated example, use:

 print opts.some_option.decode(sys.stdin.encoding)

unicode(opts.some_option) is used by default ascii as an encoding.

How can I extract non-ASCII character strings using OptParse?

More articles: