Perl: string literal in module in latin1 - I want utf8

In the module Date::Holidays::DK names of some Danish holidays are written in Latin1 encoding. For example, January 1 is Nytårsdag. What should I do with $x below to get the correct utf8 encoded string?

 use Date::Holidays::DK; my $x = is_dk_holiday(2011,1,1); 

I tried various combinations of use utf8 and no utf8 before / after use Date::Holidays::DK , but this seems to have no effect. I also tried using Encode decode , no luck. More specific,

 use Date::Holidays::DK; use Encode; use Devel::Peek; my $x = decode("iso-8859-1", is_dk_holiday(2011,1,1) ); Dump($x); print "January 1st is '$x'\n"; 

gives way

 SV = PV(0x15eabe8) at 0x1492a10 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"] CUR = 10 LEN = 16 January 1st is 'Nyt sdag' 

(with an invalid character between t and s).

+4
source share
2 answers

use utf8 and do not use utf8 before / after use. Date :: Holidays :: DK, but this has no effect.

Right. The utf8 only indicates that the source code of the program is written in UTF-8.

I also tried using Encode decoding, no luck.

You did not perceive it correctly, you actually did the right thing. You now have a Perl character string and can manipulate it.

with an invalid character between t and s

You also misinterpret it; it is actually the symbol å .


You want to output UTF-8, so you are missing a coding step.

 my $octets = encode 'UTF-8', $x; print $octets; 

Read http://p3rl.org/UNI for a topic on coding. You should always decode and encode, explicitly or implicitly.

+4
source

use utf8 is just a hint to the perl interpreter / compiler that your file is encoded in UTF-8 encoding. If you have strings with a high set of bits, it will automatically encode them to unicode.

If you have a variable encoded in iso-8859-1, you must decode it. Then your variable is in Unicode internal format. This is utf8, but you don't care what perl encoding internaly uses.

Now, if you want to print such a string, you need to convert the Unicode string back to a byte string. You need to do encode on this line. If you don’t do the encoding manually, perl transcode it back to iso-8859-1. This is the default encoding.

Before printing the variable $ x, you need to do $x = encode('UTF-8', $x) .

For proper UTF-8 processing, you always need to decode () each external input through I / O. And you always need to encode () everything that leaves your program.

To change the default input / output encoding, you can use something like this.

 use utf8; use open ':encoding(UTF-8)'; use open ':std'; 

The first line says that your source code is encoded in utf8. The second line says that each input / output should be automatically encoded in utf8. It is important to note that a open() also opens the file in utf8 mode. If you are working with binary files, you need to call binmode() in the handle.

But the second line does not change the processing of STDIN, STDOUT or STDERR. The third line will change this.

Perhaps you can use modul utf8: anything that makes this process easier. But it's always good to understand how all this happens behind the scenes.

To fix your example. One possible way:

 #!/usr/bin/env perl use Date::Holidays::DK; use Encode; use Devel::Peek; my $x = decode("iso-8859-1", is_dk_holiday(2011,1,1) ); Dump($x); print encode("UTF-8", "January 1st is '$x'\n"); 
+2
source

All Articles