Python regex with utf8 problem

I have a file that contains many lines of plain text utf-8. For example, below, in Chinese.

PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08 

The file itself was saved in utf-8 format. file name xx.txt

here is my python code, env - python2.7

 #coding: utf-8 import re pattern = re.compile(r'交易金额:(\d+)元') for line in open('xx.txt'): match = pattern.match(line.decode('utf-8')) if match: print match.group() 

The problem is here - I did not get any results.

I want to get the decimal line from 交易金额:0.01元 , here 0.01 .

Why is this code not working? Can someone explain this to me, I had no idea.

+7
source share
3 answers

There are several problems in the code. You must use re.compile(ur'<unicode string>') . It's also a good idea to add the re.UNICODE flag (not sure if you really need one here). The next is that you will not get a match yet, since \d+ does not process decimals only by a series of numbers, instead you should use \d+\.?\d+ (you want a number, possibly a period and a number). Code example:

 #coding: utf-8 text = u"PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012-2-29 10:13:08" import re pattern = re.compile(ur'交易金额:(\d+\.?\d+)元', re.UNICODE) print pattern.search(text).group(1) 
+17
source

You need to use .search() , since .match() like starting your regular expression with ^ , i.e. it only checks at the beginning of a line.

+3
source

If you use utf-8, you can use flags = re.LOCALE

 #coding: utf-8 import re pattern = re.compile(r'交易金额:(\d+\.?\d+)元', flags=re.LOCALE) for line in open('xx.txt'): match = pattern.match(line) 

See re.LOCALE for more details. No need to convert utf-8 to unicode.

0
source

All Articles