Python 2.7: test if characters in a string are all Chinese characters

Question

Python 2.7: test if characters in a string are all Chinese characters

The following code checks to see if the characters in the string are Chinese characters. It works for Python 3, but not for Python 2.7. How to do this in Python 2.7?

for ch in name: if ord(ch) < 0x4e00 or ord(ch) > 0x9fff: return False

+7

python python-2.7

Sugar tang May 08, '13 at 13:10

source share

2 answers

This works fine for me in Python 2.7, if name is the unicode() value:

 >>> ord(u'\u4e00') < 0x4e00 False >>> ord(u'\u4dff') < 0x4e00 True

You do not need to use ord here if you are comparing a character directly with unicode values:

 >>> u'\u4e00' < u'\u4e00' False >>> u'\u4dff' < u'\u4e00' True

The data from the incoming request has not yet been decoded to Unicode, you need to do this first. Explicitly set the accept-charset attribute in the form tag to ensure that the browser uses the correct encoding:

 <form accept-charset="utf-8" action="...">

then decode the server side data:

 name = self.request.get('name').decode('utf8')

+5

Martijn pieters May 08 '13 at 13:14

source share

root · Accepted Answer · 2013-05-08T13:32:51+0000

 # byte str (you probably get from GAE) In [1]: s = """Chinese (汉语/漢語 Hànyǔ or 中文 Zhōngwén) is a group of related language varieties, several of which are not mutually intelligible,""" # unicode str In [2]: us = u"""Chinese (汉语/漢語 Hànyǔ or 中文 Zhōngwén) is a group of related language varieties, several of which are not mutually intelligible,""" # convert to unicode using str.decode('utf-8') In [3]: print ''.join(c for c in s.decode('utf-8') if u'\u4e00' <= c <= u'\u9fff')汉语漢語中文In [4]: print ''.join(c for c in us if u'\u4e00' <= c <= u'\u9fff')汉语漢語中文

To make sure all characters are Chinese, something like this should do:

 all(u'\u4e00' <= c <= u'\u9fff' for c in name.decode('utf-8'))

In your python application, use unicode inside - first decode and encode, creating a unicode sandwich .

Python 2.7: test if characters in a string are all Chinese characters

More articles: