Perl: utf8 :: decode vs. Encode :: decode

Question

Perl: utf8 :: decode vs. Encode :: decode

I have some interesting results trying to distinguish between using Encode::decode("utf8", $var) and utf8::decode($var) . I have already discovered that calling the previous multiple value of the variable will result in the error "It is impossible to decode a string with wide characters in ...", while the last method will work as many times as you need, just returning false.

I am having trouble understanding how the length function returns different results depending on which method you use to decode. The problem arises from the fact that I am dealing with “double-encoded” utf8 text from an external file. To demonstrate this problem, I created a text file test.txt with the following Unicode characters on one line: U + 00e8, U + 00ab, U + 0086, U + 000a. These Unicode characters are double-encoded Unicode U + 8acb characters along with a newline character. The file was encoded to disk in UTF8. Then I run the following perl script:

 #!/usr/bin/perl use strict; use warnings; require "Encode.pm"; require "utf8.pm"; open FILE, "test.txt" or die $!; my @lines = <FILE>; my $test = $lines[0]; print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; my @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; my @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; $test = Encode::decode("utf8", $test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; $test = Encode::decode("utf8", $test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n";

This gives the following result:

  Length: 7
 utf8 flag: 
 Unicode:
 195 168 194 171 194 139 10
 Hex:
 c3a8c2abc28b0a
 ===============
 Length: 4
 utf8 flag: 1
 Unicode:
 232 171 139 10
 Hex:
 c3a8c2abc28b0a
 ===============
 Length: 2
 utf8 flag: 1
 Unicode:
 35531 10
 Hex:
 e8ab8b0a

This is what I would expect. The length is initially 7 because perl believes that $ test is just a series of bytes. After decoding, once perl knows that $ test is a series of characters encoded by utf8 (i.e., Instead of returning a length of 7 bytes, perl returns a length of 4 characters, although $ test still remains 7 bytes in memory). After the second decoding, $ test contains 4 bytes, which are interpreted as 2 characters, which I expect, since Encode :: decode took 4 code points and interpreted them as bytes encoded in utf8, resulting in 2 characters. It is strange when I change the code to call utf8 :: decode instead (replace all $ test = Encode :: decode ("utf8", $ test), with utf8 :: decode ($ test))

This gives an almost identical result, only the length result is different:

 Length: 7
 utf8 flag: 
 Unicode:
 195 168 194 171 194 139 10
 Hex:
 c3a8c2abc28b0a
 ===============
 Length: 4
 utf8 flag: 1
 Unicode:
 232 171 139 10
 Hex:
 c3a8c2abc28b0a
 ===============
 Length: 4
 utf8 flag: 1
 Unicode:
 35531 10
 Hex:
 e8ab8b0a

It seems that perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes after the second decoding (not expected). Why does this switch happen? Does my understanding of how these decoding functions work?

Thanks,
Matt

+7

encoding perl utf-8 decoding

Matt Dec 02 '10 at 20:12

source share

2 answers

daxim · Answer 1 · 2010-12-03T14:04:04+0000

You should not use functions from the utf8 pragma module. Its documentation says:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

Always use the Encode module , and also see the Checklist question for navigating Unicode with Perl . unpack too low level, it doesn't even give you error checking.

You are mistaken in the assumption that octects E8 AB 86 0A are the result of dual encoding UTF-8 characters 諆 and newline . This is a single UTF-8 encoding of these characters. Perhaps all the confusion on your side stems from this error.

length not properly overloaded, at certain points in time it determines the length in characters or the length in octets. Use the best tools like Devel::Peek .

 #!/usr/bin/env perl use strict; use warnings FATAL => 'all'; use Devel::Peek qw(Dump); use Encode qw(decode); my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}"; # or read the octets without implicit decoding from a file, does not matter Dump $test; # FLAGS = (PADMY,POK,pPOK) # PV = 0x8d8520 "\350\253\206\n"\0 $test = decode('UTF-8', $test, Encode::FB_CROAK); Dump $test; # FLAGS = (PADMY,POK,pPOK,UTF8) # PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]

Matt · Answer 2 · 2011-10-21T18:45:00+0000

Turns out it was a mistake: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190 .

Perl: utf8 :: decode vs. Encode :: decode

More articles: