How to match Chinese character using perl regular expression

I need to map some Chinese character in utf8 encoded html and I wrote some test codes as below:

#! /usr/bin/perl use strict; use LWP::UserAgent; use Encode; my $ua = new LWP::UserAgent; my $request = HTTP::Request->new('GET'); my $url = 'http://www.boc.cn/sourcedb/whpj/'; $request->url($url); my $res = $ua->request($request) ; my $str_chinese = encode("utf8" ,"่‹ฑ็ฃ…" ) ; # my $str_chinese = "่‹ฑ็ฃ…" ; my $str_english = "English" ; #my $html = decode("utf8" , $res->content) ; my $html = $res->content ; if ( $html =~ /$str_chinese/ ) { print "chinese word matched" ; }else { print "chinese word unmatched\n" ; } if ( $html =~ /$str_english/i ) { print "english word matched\n" ; }else { print "english word unmatched\n" ; } 

The result shows that the script does not match existing Chinese characters placed in html. Could you give me some hint on how to solve my problem?

+2
source share
3 answers

Instead, you should use the decoded_content method from the HTTP::Message class. Manual decoding is not required.

 #!/usr/bin/env perl use utf8; use strict; use LWP::UserAgent; my $html = LWP::UserAgent->new ->get('http://www.boc.cn/sourcedb/whpj/') ->decoded_content; my $str_chinese = '้ฆ–้กต'; my $str_english = 'English'; if ($html =~ /$str_chinese/) { print "chinese word matched\n"; } else { print "chinese word unmatched\n"; } if ($html =~ /$str_english/i) { print "english word matched\n"; } else { print "english word unmatched\n"; } 

Output:

 chinese word matched english word matched 
+3
source

Since you added UTF-8 characters to the source code, you need to:

 use utf8; 

It tells Perl that your script is written in UTF-8.

+7
source

I run your code and Chinese characters do not match.

Then I check the html, it does not contain these characters. Therefore, this may be the cause of the discrepancy. Then I tried to use another character (่”), and also remove the encoding function. those. my $str_chinese = "่”";

Run the code with this change and the character will be matched.

+4
source

All Articles