How to get Unicode Perl input line length via Ajax or CGI?

Well, this should be very simple, but I looked for everything to answer, and also read the following thread: How to find the length of a Unicode string in Perl?

It doesn’t help me. I know how to get Perl to treat the string constant as UTF-8 and return the correct number of characters (instead of bytes), but for some reason this doesn't work when Perl gets the string through my AJAX call.

Below I send three Greek letters Alpha, Beta and Omega in unicode. Perl tells me that the length is 6 (bytes) when it should tell me only 3 (characters). How to get the correct char score?

#!/usr/bin/perl use strict; if ($ENV{CONTENT_LENGTH}) { binmode (STDIN, ":utf8"); read (STDIN, $_, $ENV{CONTENT_LENGTH}); s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg; print "Content-Type: text/html; charset=UTF-8\n\nReceived: $_ (".length ($_)." chars)"; exit; } print "Content-Type: text/html; charset=UTF-8\n\n"; print qq[<html><head><script> var oRequest; function MakeRequest () { oRequest = new XMLHttpRequest(); oRequest.onreadystatechange = zxResponse; oRequest.open ('POST', '/test/unicode.cgi', true); oRequest.send (encodeURIComponent (document.oForm.oInput.value)); } function zxResponse () { if (oRequest.readyState==4 && oRequest.status==200) { alert (oRequest.responseText); } } </script></head><body> <form name="oForm" method="POST"> <input type="text" name="oInput" value="&#x03B1;&#x03B2;&#x03A9;"> <input type="button" value="Ajax Submit" onClick="MakeRequest();"> </form> </body></html> ]; 

By the way, the code is greatly simplified (I know how to make cross-browser AJAX calls, etc.), and using the CGI Perl module is not an option.

+4
source share
3 answers

For the “native” way of doing this, you can convert when copying using this method:

Set the mode in the memory file in the desired mode and read it. This will do the conversion when the characters are read.

 use strict; use warnings; my $utf_str = "αβΩ"; #alpha; bravo; omega print "$utf_str is ", length $utf_str, " characters\n"; use open ':encoding(utf8)'; open my $fh, '<', \$utf_str; my $new_str; { local $/; $new_str=<$fh>; } binmode(STDOUT, ":utf8"); print "$new_str ", length $new_str, " characters"; #output: αβΩ is 6 characters αβΩ 3 characters 

If you want to convert the encoding into place, you can use this:

 my $utf_str = "αβΩ"; print "$utf_str is ", length $utf_str, " characters\n"; binmode(STDOUT, ":utf8"); utf8::decode($utf_str); print "$utf_str is ", length $utf_str, " characters\n"; #output: αβΩ is 6 characters αβΩ is 3 characters 

However, you should not shy away from Encode .

+4
source

You decode this string before calling length . For instance:

 use Encode; my $utf_string = decode_utf8($_); ## parse string to find utf8 octets print length($utf_string); 

From the code manual :

$ string = decode_utf8 ($ octets [, CHECK]);

equivalent to $ string = decode ("utf8", $ octets [, CHECK]). The sequence of octets represented by $ octets is decoded from UTF-8 into a sequence of logical symbols. Not all octet sequences form valid UTF-8 encodings, so this call may fail. For CHECK, see Handling Invalid Data.

+8
source

use utf8::decode if you know the string is in utf8. This is the core and no penalty for memory usage:

The main use of loop memory without using:

 $ perl -e 'sleep 1 while 1' & [1] 17372 $ ps u | grep 17372 | grep -v grep okram 17372 0.0 0.1 5464 1172 pts/0 S 01:24 0:00 perl -e [...] 

Memory usage with Encode:

 $ perl -MEncode -e 'sleep 1 while 1' & [1] 17488 $ ps u | grep 17488 | grep -v grep okram 17488 0.7 0.2 6020 2224 pts/0 S 01:27 0:00 perl [...] 

The proposed method:

 $ perl -e '$str="ææææ";utf8::decode $str;print length $str,"\n\n"; sleep 1 while 1' & [1] 17554 $ 4 $ ps u | grep 17554| grep -v grep okram 17554 0.0 0.1 5464 1176 pts/0 S 01:28 0:00 perl -e [...] 

As you can see, the line length after utf8::decode is 4 for this line utf8, and the memory usage is largely the same as the base, and (1). The encoding seems to consume a bit more memory ...

+1
source

All Articles