Python regex with utf8 problem

Question

Python regex with utf8 problem

I have a file that contains many lines of plain text utf-8. For example, below, in Chinese.

PROCESS：类型：关爱积分[NOTIFY] 交易号：2012022900000109 订单号：W12022910079166 交易金额：0.01元 交易状态：true 2012-2-29 10:13:08

The file itself was saved in utf-8 format. file name xx.txt

here is my python code, env - python2.7

 #coding: utf-8 import re pattern = re.compile(r'交易金额：(\d+)元') for line in open('xx.txt'): match = pattern.match(line.decode('utf-8')) if match: print match.group()

The problem is here - I did not get any results.

I want to get the decimal line from 交易金额：0.01元 , here 0.01 .

Why is this code not working? Can someone explain this to me, I had no idea.

+7

python python-2.7 regex utf-8

castiel May 11 '12 at 6:25

source share

3 answers

You need to use .search() , since .match() like starting your regular expression with ^ , i.e. it only checks at the beginning of a line.

+3

Thiefmaster May 11 '12 at 6:27

source share

If you use utf-8, you can use flags = re.LOCALE

 #coding: utf-8 import re pattern = re.compile(r'交易金额：(\d+\.?\d+)元', flags=re.LOCALE) for line in open('xx.txt'): match = pattern.match(line)

See re.LOCALE for more details. No need to convert utf-8 to unicode.

0

Cathy Lin Oct 31 '16 at 10:22

source share

uhz · Accepted Answer · 2012-05-11T06:45:59+0000

There are several problems in the code. You must use re.compile(ur'<unicode string>') . It's also a good idea to add the re.UNICODE flag (not sure if you really need one here). The next is that you will not get a match yet, since \d+ does not process decimals only by a series of numbers, instead you should use \d+\.?\d+ (you want a number, possibly a period and a number). Code example:

 #coding: utf-8 text = u"PROCESS：类型：关爱积分[NOTIFY] 交易号：2012022900000109 订单号：W12022910079166 交易金额：0.01元 交易状态：true 2012-2-29 10:13:08" import re pattern = re.compile(ur'交易金额：(\d+\.?\d+)元', re.UNICODE) print pattern.search(text).group(1)

Python regex with utf8 problem

More articles: