Array of output bytes php differential version

I use a function that transcribes strings into an array of bytes, I have this function in PHP and JavaScript, but both have different behaviors when I play these characters: 㬁 愃 θ†˜ Ƙ αƒ° 䚐 ⦀ ι£  ε™‹ ε™‹ ε™‹ ε™‹ ε™‹ ε™‹ ε™‹ ε™‹μŒŒ Ψ΅ 䌠

How to make the results the same?

My code is:

function bytesFromWords($string) { $bytes = array(); $j = strlen($string); for($i = 0; $i < $j; $i++) { $char = ord(mb_substr($string, $i, 1)); $bytes[] = $char >> 8; $bytes[] = $char & 0xFF; } return $bytes; } echo bytesFromWords('γ¬ζ„ƒθ†˜Ζ˜αƒ€δšβ¦€ι£ ε™‹&Σ‘ΰΉ¨γƒζ£±μŒŒΨ΅δŒ '); // result: 0,227,0,172,0,129,0,230,0,132,0,131,0,232,0,134,0,152,0,198,0,152,0,225,0,131,0,128,0,228,0,154,0,144,0,226,0,166,0,128,0,233,0,163,0,160,0,229,0,153,0,139,0,38,0,211,0,161,0,224,0,185,0,168,0,227,0,143,0,131,0,230,0,163,0,177,0,236,0,140,0,140,0,216,0,181,0,228,0,140,0,160 function bytesFromWords (string) { var bytes = []; for(var i = 0; i < string.length; i++) { var char = string.charCodeAt(i); bytes.push(char >>> 8); bytes.push(char & 0xFF); } return bytes; } console.log(bytesFromWords('γ¬ζ„ƒθ†˜Ζ˜αƒ€δšβ¦€ι£ ε™‹&Σ‘ΰΉ¨γƒζ£±μŒŒΨ΅δŒ ').toString()); // result: 59,1,97,3,129,152,1,152,16,192,70,144,41,128,152,224,86,75,0,38,4,225,14,104,51,195,104,241,195,12,6,53,67,32 
+5
source share
3 answers

Questions:

  • strlen does not account for Unicode characters as expected.
  • ord does not work with unicode as expected.
  • chr does not work with unicode as expected.

Problem with strlen

'γ¬ζ„ƒθ†˜Ζ˜αƒ€δšβ¦€ι£ ε™‹&Σ‘ΰΉ¨γƒζ£±μŒŒΨ΅δŒ '.length returns 17 and strlen('γ¬ζ„ƒθ†˜Ζ˜αƒ€δšβ¦€ι£ ε™‹&Σ‘ΰΉ¨γƒζ£±μŒŒΨ΅δŒ ') returns 46, use: to fix it

 $j = preg_match_all('/.{1}/us', $string, $data); 

Problem with ord

Using '㬁'.charCodeAt(0) returns 15105, and ord('㬁') returns 227, to use fix:

 function unicode_ord($char) { list(, $ord) = unpack('N', mb_convert_encoding($char, 'UCS-4BE', 'UTF-8')); return $ord; } 

Source: fooobar.com/questions/591213 / ...

Problem with chr

Using String.fromCharCode(15104) returns 㬁 and chr(15104) return empty / blank, to use fix:

 function unicode_chr($u) { return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES'); } 

Source: fooobar.com/questions/773629 / ...


Full code:

 <?php function unicode_ord($char) { list(, $ord) = unpack('N', mb_convert_encoding($char, 'UCS-4BE', 'UTF-8')); return $ord; } function unicode_chr($u) { return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES'); } function bytesToWords($bytes) { $str = ''; $j = count($bytes); for($i = 0; $i < $j; $i += 2) { $char = $bytes[$i] << 8; if ($bytes[$i + 1]) { $char |= $bytes[$i + 1]; } $str .= unicode_chr($char); } return $str; } function bytesFromWords($string) { $bytes = array(); $j = preg_match_all('/.{1}/us', $string, $data); $data = $data[0]; foreach ($data as $char) { $char = unicode_ord($char); $bytes[] = $char >> 8; $bytes[] = $char & 0xFF; } return $bytes; } $data = bytesFromWords('γ¬ζ„ƒθ†˜Ζ˜αƒ€δšβ¦€ι£ ε™‹&Σ‘ΰΉ¨γƒζ£±μŒŒΨ΅δŒ '); echo implode(', ', $data), '<br>'; echo bytesToWords($data); 
+2
source

JavaScript uses UCS-2 encoding for Unicode strings, so to achieve the same ordinal representation, you first need to convert your string, for example. using mb_convert_encoding() or iconv() , if necessary.

The trick for quickly getting ordinal values ​​from a string is unpack() .

 function bytesFromWords($string) { $x = mb_convert_encoding($string, 'UCS-2', 'UTF-8'); $data = unpack('C*', $x); return array_values($data); } 

Demo

+2
source

You use mb_substr() , which can return you multibyte strings (even if it is only one code).

But ord() doesn't like that ... it only accepts the first byte passed (not a character).

To get what you want, you just have to break the string and take single bytes:

 $bytes = str_split($string); foreach ($bytes as &$chr) { $chr = ord($chr); } 

Yes, this is not what you have in Javascript. In Javascript, you get the identifier code via string.charCodeAt() , not a sequence of UTF-8 bytes.

The trick for getting bytes in Javascript will be (copied from fooobar.com/questions/210013 / ... ~ Jonathan Lonowski ):

 var utf8 = unescape(encodeURIComponent(string)); var arr = []; for (var i = 0; i < utf8.length; i++) { arr.push(utf8.charCodeAt(i)); } 

But if you need a unicode id in PHP ... just do a quick search (for example, How to get the code point number for a given character in utf-8 string? )

+1
source

All Articles