I have some interesting results trying to distinguish between using Encode::decode("utf8", $var) and utf8::decode($var) . I have already discovered that calling the previous multiple value of the variable will result in the error "It is impossible to decode a string with wide characters in ...", while the last method will work as many times as you need, just returning false.
I am having trouble understanding how the length function returns different results depending on which method you use to decode. The problem arises from the fact that I am dealing with โdouble-encodedโ utf8 text from an external file. To demonstrate this problem, I created a text file test.txt with the following Unicode characters on one line: U + 00e8, U + 00ab, U + 0086, U + 000a. These Unicode characters are double-encoded Unicode U + 8acb characters along with a newline character. The file was encoded to disk in UTF8. Then I run the following perl script:
#!/usr/bin/perl use strict; use warnings; require "Encode.pm"; require "utf8.pm"; open FILE, "test.txt" or die $!; my @lines = <FILE>; my $test = $lines[0]; print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; my @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; my @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; $test = Encode::decode("utf8", $test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n"; print "==============\n"; $test = Encode::decode("utf8", $test); print "Length: " . (length $test) . "\n"; print "utf8 flag: " . utf8::is_utf8($test) . "\n"; @unicode = (unpack('U*', $test)); print "Unicode:\n@unicode\n"; @hex = (unpack('H*', $test)); print "Hex:\n@hex\n";
This gives the following result:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
===============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
===============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
This is what I would expect. The length is initially 7 because perl believes that $ test is just a series of bytes. After decoding, once perl knows that $ test is a series of characters encoded by utf8 (i.e., Instead of returning a length of 7 bytes, perl returns a length of 4 characters, although $ test still remains 7 bytes in memory). After the second decoding, $ test contains 4 bytes, which are interpreted as 2 characters, which I expect, since Encode :: decode took 4 code points and interpreted them as bytes encoded in utf8, resulting in 2 characters. It is strange when I change the code to call utf8 :: decode instead (replace all $ test = Encode :: decode ("utf8", $ test), with utf8 :: decode ($ test))
This gives an almost identical result, only the length result is different:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
===============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
===============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
It seems that perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes after the second decoding (not expected). Why does this switch happen? Does my understanding of how these decoding functions work?
Thanks,
Matt