How to read a Chinese word in a file using regular expression in perl?

Question

How to read a Chinese word in a file using regular expression in perl?

I tried following the perl code to read the Chinese file word, it seems to work, but it doesn’t work out correctly. Any help is appreciated.

Error message

Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21. Total things = 125, valid words =

which seems like a problem to me is the file format. The “common thing” is 125, this is the line number (125 lines). The strangest part is that my console correctly displays all the individual Chinese words without any problems. Installed pragma utf-8 .

 #!/usr/bin/perl -w use strict; use utf8; use Encode qw(encode); use Encode::HanExtra; my $input_file = "sample_file.txt"; my ($total, $valid); my %count; open (FILE, "< $input_file") or die "Can't open $input_file: $!"; while (<FILE>) { foreach (split) { #break $_ into words, assign each to $_ in turn $total++; next if /\W|^\d+/; #strange words skip the remainder of the loop $valid++; $count{$_}++; # count each separate word stored in a hash ## next comes here ## } } print "Total things = $total, valid words = $valid\n"; foreach my $word (sort keys %count) { print "$word \t was seen \t $count{$word} \t times.\n"; } ##---Data---- sample_file.txt那天约二更时,只见封肃方回来,欢天喜地.众人忙问端的.他乃说道:"原来本府新升的太爷姓贾名化,本贯胡州人氏,曾与女婿旧日相交.方才在咱门前过去,因见娇杏那丫头买线, 所以他只当女婿移住于此.我一一将原故回明,那太爷倒伤感叹息了一回,又问外孙女儿,我说看灯丢了.太爷说:`不妨,我自使番役务必探访回来.'说了一回话, 临走倒送了我二两银子."甄家娘子听了,不免心中伤感.一宿无话.至次日, 早有雨村遣人送了两封银子,四匹锦缎,答谢甄家娘子,又寄一封密书与封肃,转托问甄家娘子要那娇杏作二房. 封肃喜的屁滚尿流,巴不得去奉承,便在女儿前一力撺掇成了,乘夜只用一乘小轿,便把娇杏送进去了.雨村欢喜,自不必说,乃封百金赠封肃, 外谢甄家娘子许多物事,令其好生养赡,以待寻访女儿下落.封肃回家无话.

+3

regex perl embedding cjk

Ivan Jan 6 '11 at 3:19

source share

2 answers

Hugmeir · Answer 1 · 2011-01-06T03:48:31+0000

We set STDOUT to : utf8 I / O level , so the messages do not display distorted data, and then a file with the same layer is opened so that the diamond will not read the distorted data. After that, inside the while, instead of breaking it up into an empty string, we use a regular expression with an "East_Asian_Width: Wide" Unicode-like property .

utf8 is for my personal health check and can be removed (Y).

 use strict; use warnings; use 5.010; use utf8; use autodie; binmode(STDOUT, ':utf8'); open my $fh, '<:utf8', 'sample_file.txt'; my ($total, $valid); my %count; while (<$fh>) { $total += length; for (/(\p{Ea=W})/g) { $valid++; $count{$_}++; } } say "Total things = $total, valid words = $valid"; for my $word (sort keys %count) { say "$word \t was seen \t $count{$word} \t times."; }

EDIT: J-16 SDiZ and daxim indicated that the chances of sample_file.txt on UTF-8 ... are slim. Read their comments, and then look at the Encode module in perldoc, in particular the PerlIO Encoding part.

johne · Answer 2 · 2011-01-06T03:49:23+0000

I can offer some insight, but it's hard to say if my answer will be "useful." Firstly, I only speak and read English, so I obviously do not speak or read Chinese. I really am the author of RegexKitLite , which is an Objective-C wrapper around the ICU regex engine. This is obviously not perl :).

Regardless, the ICU regex engine has a feature that sounds amazingly like what you are trying to do. In particular, the ICU regular expression mechanism contains the parameter modifier UREGEX_UWORD , which can be dynamically turned on using the usual syntax (?w:...) . This modifier performs the following action:

Controls the behavior of \ b in the pattern. If set, word boundaries are found according to the definitions of a word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified using a simple classification of characters as “word” or “non-word,” which approximates the traditional behavior of a regular expression. The results obtained with two parameters can be completely different in space runs and other non-word characters.

You can use this in a regular expression, for example (?w:\b(.*?)\b) , to “extract” words from a string. In the ICU regular expression engine, it has a rather powerful word break mechanism, specially designed to search for word breaks in written languages that do not have an explicit space character, for example, in English. Again, without reading or writing these languages, I understand that itisroughlysomethinglikethis. ICU word break mechanism uses heuristics and sometimes dictionaries to find word breaks. As far as I understand, the Thai case is especially complicated. In fact, I use ฉันกินข้าว (Thai for "I eat rice," or so I was told) with the regular expression (?w)\b\s* to perform a split operation on a string to extract words. Without (?w) you cannot divide into word breaks. With (?w) this leads to the words ฉัน , กิน and ข้าว .

If the above “sounds like a problem you are facing”, then this may be the reason. If so, then I don’t know how to do this in perl , but I would not consider this opinion an authoritative answer, since I use the ICU regular expression mechanism more often than perl alone and I am clearly not properly motivated to find a working solution perl when i already have one :). Hope this helps.

How to read a Chinese word in a file using regular expression in perl?

More articles: