How to get Unicode Perl input line length via Ajax or CGI?

Question

How to get Unicode Perl input line length via Ajax or CGI?

Well, this should be very simple, but I looked for everything to answer, and also read the following thread: How to find the length of a Unicode string in Perl?

It doesn’t help me. I know how to get Perl to treat the string constant as UTF-8 and return the correct number of characters (instead of bytes), but for some reason this doesn't work when Perl gets the string through my AJAX call.

Below I send three Greek letters Alpha, Beta and Omega in unicode. Perl tells me that the length is 6 (bytes) when it should tell me only 3 (characters). How to get the correct char score?

#!/usr/bin/perl use strict; if ($ENV{CONTENT_LENGTH}) { binmode (STDIN, ":utf8"); read (STDIN, $_, $ENV{CONTENT_LENGTH}); s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg; print "Content-Type: text/html; charset=UTF-8\n\nReceived: $_ (".length ($_)." chars)"; exit; } print "Content-Type: text/html; charset=UTF-8\n\n"; print qq[<html><head><script> var oRequest; function MakeRequest () { oRequest = new XMLHttpRequest(); oRequest.onreadystatechange = zxResponse; oRequest.open ('POST', '/test/unicode.cgi', true); oRequest.send (encodeURIComponent (document.oForm.oInput.value)); } function zxResponse () { if (oRequest.readyState==4 && oRequest.status==200) { alert (oRequest.responseText); } } </script></head><body> <form name="oForm" method="POST"> <input type="text" name="oInput" value="&#x03B1;&#x03B2;&#x03A9;"> <input type="button" value="Ajax Submit" onClick="MakeRequest();"> </form> </body></html> ];

By the way, the code is greatly simplified (I know how to make cross-browser AJAX calls, etc.), and using the CGI Perl module is not an option.

+4

ajax perl unicode utf-8

W3coder Sep 13 '10 at 21:52

source share

3 answers

You decode this string before calling length . For instance:

 use Encode; my $utf_string = decode_utf8($_); ## parse string to find utf8 octets print length($utf_string);

From the code manual :

$ string = decode_utf8 ($ octets [, CHECK]);
equivalent to $ string = decode ("utf8", $ octets [, CHECK]). The sequence of octets represented by $ octets is decoded from UTF-8 into a sequence of logical symbols. Not all octet sequences form valid UTF-8 encodings, so this call may fail. For CHECK, see Handling Invalid Data.

+8

Ivan Nevostruev Sep 13 '10 at 22:02

source share

use utf8::decode if you know the string is in utf8. This is the core and no penalty for memory usage:

The main use of loop memory without using:

 $ perl -e 'sleep 1 while 1' & [1] 17372 $ ps u | grep 17372 | grep -v grep okram 17372 0.0 0.1 5464 1172 pts/0 S 01:24 0:00 perl -e [...]

Memory usage with Encode:

 $ perl -MEncode -e 'sleep 1 while 1' & [1] 17488 $ ps u | grep 17488 | grep -v grep okram 17488 0.7 0.2 6020 2224 pts/0 S 01:27 0:00 perl [...]

The proposed method:

 $ perl -e '$str="ææææ";utf8::decode $str;print length $str,"\n\n"; sleep 1 while 1' & [1] 17554 $ 4 $ ps u | grep 17554| grep -v grep okram 17554 0.0 0.1 5464 1176 pts/0 S 01:28 0:00 perl -e [...]

As you can see, the line length after utf8::decode is 4 for this line utf8, and the memory usage is largely the same as the base, and (1). The encoding seems to consume a bit more memory ...

+1

mfontani Sep 14 '10 at 0:32

source share

dawg · Accepted Answer · 2010-09-14T00:12:58+0000

For the “native” way of doing this, you can convert when copying using this method:

Set the mode in the memory file in the desired mode and read it. This will do the conversion when the characters are read.

 use strict; use warnings; my $utf_str = "αβΩ"; #alpha; bravo; omega print "$utf_str is ", length $utf_str, " characters\n"; use open ':encoding(utf8)'; open my $fh, '<', \$utf_str; my $new_str; { local $/; $new_str=<$fh>; } binmode(STDOUT, ":utf8"); print "$new_str ", length $new_str, " characters"; #output: αβΩ is 6 characters αβΩ 3 characters

If you want to convert the encoding into place, you can use this:

 my $utf_str = "αβΩ"; print "$utf_str is ", length $utf_str, " characters\n"; binmode(STDOUT, ":utf8"); utf8::decode($utf_str); print "$utf_str is ", length $utf_str, " characters\n"; #output: αβΩ is 6 characters αβΩ is 3 characters

However, you should not shy away from Encode .

How to get Unicode Perl input line length via Ajax or CGI?

More articles: