Question about path name encoding

What have I done to get such a strange encoding in this path name?
In my file manager (Dolphin), the path name looks good.

#!/usr/local/bin/perl use warnings; use 5.014; use utf8; use open qw( :encoding(UTF-8) :std ); use File::Find; use Devel::Peek; use Encode qw(decode); my $string; find( sub { $string = $File::Find::name }, 'Delibes, Léo' ); $string =~ s|Delibes,\ ||; $string =~ s|\..*\z||; my ( $s1, $s2 ) = split m|/|, $string, 2; say Dump $s1; say Dump $s2; # SV = PV(0x824b50) at 0x9346d8 # REFCNT = 1 # FLAGS = (PADMY,POK,pPOK,UTF8) # PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"] # CUR = 4 # LEN = 16 # SV = PV(0x7a7150) at 0x934c30 # REFCNT = 1 # FLAGS = (PADMY,POK,pPOK,UTF8) # PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"] # CUR = 8 # LEN = 16 say $s1; say $s2; # Léo # Lakmé $s1 = decode( 'utf-8', $s1 ); $s2 = decode( 'utf-8', $s2 ); say $s1; say $s2; # L o # Lakmé 
+4
source share
2 answers

Unfortunately, your API name for your operating system is another “binary interface” where you will need to use Encode::encode and Encode::decode to get predicted results.

Most operating systems treat paths as a sequence of octets (i.e., bytes). Whether this sequence should be interpreted as Latin-1, UTF-8 or another character encoding is the solution of the application. Therefore, the value returned by readdir() is just a sequence of octets, and File::Find does not know that you want the path name to be called Unicode codes. It forms $File::Find::name , simply concatenating the path to the directory (which you specified) with the value returned by your OS via readdir() and how you got the codes broken by octets.

Rule of thumb: Whenever you pass pathnames to the OS, Encode::encode() to make sure this is a sequence of octets. When you get the path name from the OS, Encode::decode() is the character set your application requires.

You can customize your program by calling find as follows:

 find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') ); 

And then calling Encode::decode() using the value of $File::Find::name :

 my $path = Encode::decode('utf8', $File::Find::name); 

To be more clear, here is how $File::Find::name :

 use Encode; # This is a way to get $dir to be represented as a UTF-8 string my $dir = 'L' .chr(233).'o'.chr(256); chop $dir; say "dir: ", d($dir); # length = 3 # This is what readdir() is returning: my $leaf = encode('utf8', 'Lakem' . chr(233)); say "leaf: ", d($leaf); # length = 7 $File::Find::name = $dir . '/' . $leaf; say "File::Find::name: ", d($File::Find::name); sub d { join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0])) } 
+13
source

The POSIX file system API is broken because no encoding is applied. Period.

Many problems can happen. For example, a path name can contain both latin1 and UTF-8, depending on how different file systems are encoded in the path descriptor (and if they do).

-2
source

All Articles