Convert UTF8 string to numeric values in Perl

Question

Convert UTF8 string to numeric values in Perl

For example,

my $str = '中國c'; # Chinese language of china

I want to print numerical values

 20013,22283,99

+6

perl unicode utf-8

Howard Aug 22 '10 at 17:19

source share

4 answers

See perldoc -f ord :

 foreach my $c (split(//, $str)) { print ord($c), "\n"; }

Or compressed into one line: my @chars = map { ord } split //, $str;

Data :: Dumper ed, this gives:

 $VAR1 = [ 20013, 22283, 99 ];

+3

Ether Aug 22 '10 at 17:35

source share

For utf8 to be recognized as such in your source code, you must use utf8; in advance:

 $ perl use utf8; my $str = '中國c'; # Chinese language of china foreach my $c (split(//, $str)) { print ord($c), "\n"; } __END__ 20013 22283 99

or more

 print join ',', map ord, split //, $str;

+3

ysth Aug 22 '10 at 18:20

source share

http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html

 #!/usr/bin/env perl use utf8; # so literals and identifiers can be in UTF-8 use v5.12; # or later to get "unicode_strings" feature use strict; # quote strings, declare variables use warnings; # on by default use warnings qw(FATAL utf8); # fatalize encoding glitches use open qw(:std :utf8); # undeclared streams in UTF-8 # use charnames qw(:full :short); # unneeded in v5.16 # http://perldoc.perl.org/functions/sprintf.html # vector flag # This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string. my $str = '中國c'; printf "%*vd\n", ",", $str;

+2

nk3181544 Jan 10 '14 at 11:38

source share

cjm · Accepted Answer · 2010-08-22T21:59:48+0000

unpack will be more efficient than split and ord because it does not need to create a bunch of temporary 1-character strings:

 use utf8; my $str = '中國c'; # Chinese language of china my @codepoints = unpack 'U*', $str; print join(',', @codepoints) . "\n"; # prints 20013,22283,99

A quick test shows it about 3 times faster than split+ord :

 use utf8; use Benchmark 'cmpthese'; my $str = '中國中國中國中國中國中國中國中國中國中國中國中國中國中國c'; cmpthese(0, { 'unpack' => sub { my @codepoints = unpack 'U*', $str; }, 'split-map' => sub { my @codepoints = map { ord } split //, $str }, 'split-for' => sub { my @cp; for my $c (split(//, $str)) { push @cp, ord($c) } }, 'split-for2' => sub { my $cp; for my $c (split(//, $str)) { $cp = ord($c) } }, });

Results:

  Rate split-map split-for split-for2 unpack split-map 85423/s -- -7% -32% -67% split-for 91950/s 8% -- -27% -64% split-for2 125550/s 47% 37% -- -51% unpack 256941/s 201% 179% 105% --

The difference is less pronounced with a shorter line, but unpack is still more than twice as fast. ( split-for2 bits are faster than the other because it does not create a list of code points.)

Convert UTF8 string to numeric values ​​in Perl

More articles:

Convert UTF8 string to numeric values in Perl