How can I make Perl 6 safe for two-way travel for Unicode data?

The naive Perl 6 program is not safe in the opposite direction with respect to Unicode. It looks like it internally uses a Formulation Form (NFC) for type Str:

$ perl -CO -E 'say "e\x{301}"' | perl6 -ne '.say' | perl -CI -ne 'printf "U+%04x\n", ord for split //' U+00e9 U+000a 

Sneaking into the documents, I see nothing about this behavior, and I find it very shocking. I can’t believe that you need to go back byte level before the text back and forth:

 $ perl -CO -E 'say "e\x{301}"' | perl6 -e 'while (my $byte = $*IN.read(1)) { $*OUT.write($byte) }' | perl -CI -ne 'printf "U+%04x\n", ord for split //' U+0065 U+0301 U+000a 

Do all text files need to be in NFC to communicate securely with Perl 6? What if the document should be in the NFD? I have to miss something. I can't believe this is deliberate behavior.

+8
unicode perl6
source share
2 answers

The answer is to use the Uni type (base class for NFD, NFC, etc.), but this is actually not the case, and there is no good way to get the file into a Uni string. Thus, to a certain unnamed point in the future, you cannot round an unnormalized file unless you treat it as bytes.

+5
source share

Use UTF8-C8 . From the doc:

You can use UTF8-C8 with any file descriptor to read the exact bytes as they are on disk. They may look funny when printed if you are printing using the UTF8 pen. If you print it on a pen where the output is UTF8-C8, then it will be displayed as you normally expect, and be a byte for an exact copy of the byte.

+2
source share

All Articles