Why is the Chinese language distorted when using a web page, but is this normal when using MySQLdb?

Question

Why is the Chinese language distorted when using a web page, but is this normal when using MySQLdb?

I create a database in mysql and use webpy to build my web server.

But this is so strange for the Chinese character between the webpy and MySQLdb behaviors when used to access the database, respectively.

Below is my problem:

My table is t_test (utf8 databse):

id name 1 测试

utf8 code for "测试": \ xe6 \ xb5 \ x8b \ xe8 \ xaf \ x95

when using MySQLdb do "select" as follows:

  c=conn.cursor() c.execute("SELECT * FROM t_test") items = c.fetchall() c.close() print "items=%s, name=%s"%(eval_items, eval_items[1])

the result is normal, it prints:

  items=(127L, '\xe6\xb5\x8b\xe8\xaf\x95'), name=测试

But when I use webpy, I do the same:

  db = web.database(dbn='mysql', host="127.0.0.1", user='test', pw='test', db='db_test', charset="utf8") eval_items=db.select('t_test') comment=eval_items[0].name print "comment code=%s"%repr(comment) print "comment=%s"%comment.encode("utf8")

Chinese finder, print result:

  comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022' comment=忙碌鈥姑€

I know that the webpy database also depends on MySQLdb, but it is so different for these two ways. Why?

BTW, for the reason above, I can just use MySQLdb directly to solve my problem with the Chinese character, but it loses the clolumn name in the table - this is so ungrateful. I want to know how can I solve this problem using webpy?

+6

python web.py mysql-python

eason Nov 07 '12 at 10:45

source share

1 answer

jsbueno · Answer 1 · 2012-11-07T21:43:12+0000

In fact, something very wrong is happening - as you said in your comment, unicode. the bytes for "测试" are E6B5 8BE8 AF95 - which works on my utf-8 terminal here:

 >>> d '\xe6\xb5\x8b\xe8\xaf\x95' >>> print d测试

But look at the bytes in the unicode "comment" object:

 comment code=u'\xe6\xb5\u2039\xe8\xaf\u2022'

The value of part of your content is utf-8 bytes for comment (characters marked as "\ xYY" and part are encoded as Unicode points (chares are represented with \ uYYYY) - this indicates serious garbage.

MySQL has some tricks for decoding correctly (utf-8 or otherwise), the encoded text in it is one of which passes the correct "charset" parameter to the connection. But you did it already -

One attempt you can make is to pass the connection with the use_unicode=False option and decode the utf-8 lines in your own code.

 db = web.database(dbn='mysql', host="127.0.0.1", user='test', pw='test', db='db_test', charset="utf8", use_unicode=False)

Check the connection parameters for this and other parameters that you can try:

http://mysql-python.sourceforge.net/MySQLdb.html

Regardless of making it work correctly , with the tips above, I have a workaround for you - it looks like the Unicode characters (and not the utf-8 raw bytes in unicode objects) in your encoded string are encoded in one of these encodings : ("cp1258", "cp1252", "palmos", "cp1254")

Of these, cp1252 is almost the same as “latin1” —this is the default character set that MySQL uses if it does not receive the “charset” argument in the connection. But this is not only a question web2py does not pass it to the database, since you get distorted characters, and not just incorrect encoding - it is as if web2py encoded and decrypted your string back and forth and ignored encoding errors

From all these encodings, I could restore the original string "测试" as a string of byte utf-8, for example:

 comment = comment.encode("cp1252", errors="ignore")

So, placing this line may work for you now, but guessing around with unicode is never good - the pro-player thing is to narrow down what web2py does to give you these semi-decoded utf-8 lines in first place and make him stop there.

Update

I checked here - this is what happens - the correct line utf-8 '\xe6\xb5\x8b\xe8\xaf\x95' read from mysql and before it is delivered to you (in use_unicode = True case) 0 - these bytes are decoded as if they were "cp1252" - this gives the wrong unicode u'\xe6\xb5\u2039\xe8\xaf\u2022' . This is probably a web2py error, for example, it does not pass your "charset = utf8" parameter to the actual connection. When you set "use_unicode = False" instead of giving you raw bytes, it seems to pick the wrong unicode, dencode using "utf-8" - this gives '\xc3\xa6\xc2\xb5\xe2\x80\xb9\xc3\xa8\xc2\xaf\xe2\x80\xa2' sequence that you commented below (which is even more incorrect).

in general, the workaround I mentioned above seems to be the only way to get the original correct string, i.e. if the Unicode is incorrect, do u'\xe6\xb5\u2039\xe8\xaf\u2022'.encode("cp1252", errors="ignore") - that is, do something else to configure the connection to the database ( or perhaps update web2py or mysql drivers)

** update 2 ** I futrher checked the code in the web2py dal.py file - it tries to establish a connection as utf-8 by default - but it looks like it will try to use the MySQLdb and pymysql drivers - if you both set up the attempt to remove pymysql and leave MySQLdb only.

Why is the Chinese language distorted when using a web page, but is this normal when using MySQLdb?

More articles: