Finding a good Unicode compatible alternative to the PHP function ord ()

After quite a bit of searching and testing, the simplest method I found for a Unicode-compatible alternative to the PHP function ord() is this:

 $utf8Character = 'Ą'; list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8')); echo $ord; # 260 

I found it here . However, it was mentioned that this method is rather slow. Does anyone know a more efficient method that is almost as simple? And what does UCS-4BE mean?

+4
source share
3 answers

You can also implement this function using iconv() , but the mb_convert_encoding method you have seems reasonable to $utf8Character make sure $utf8Character is a single character and not a long string, and it will work quite well.

UCS-4BE is a Unicode encoding that stores each character as a 32-bit (4 byte) integer. This explains the "UCS-4"; the prefix "BE" indicates that integers are stored in ordinary order. The reason for this encoding is that, unlike small encodings (for example, UTF-8 or UTF-16), it does not require surrogate pairs - each character is a fixed size.

+3
source

I just wrote polyfill for the missing multibyte versions of ord and chr , given the following:

  • It defines the functions mb_ord and mb_chr only if they do not already exist. If they exist in your structure or in any future version of PHP, the polyfill will be ignored.

  • It uses the widely used mbstring for conversion. If the mbstring not loaded, it will use the iconv extension instead.

I also added functions for encoding / decoding HTML, encoding / decoding in JSON format, as well as some demo code for using these functions


Code :

 if (!function_exists('codepoint_encode')) { function codepoint_encode($str) { return substr(json_encode($str), 1, -1); } } if (!function_exists('codepoint_decode')) { function codepoint_decode($str) { return json_decode(sprintf('"%s"', $str)); } } if (!function_exists('mb_internal_encoding')) { function mb_internal_encoding($encoding = NULL) { return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding); } } if (!function_exists('mb_convert_encoding')) { function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) { return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str); } } if (!function_exists('mb_chr')) { function mb_chr($ord, $encoding = 'UTF-8') { if ($encoding === 'UCS-4BE') { return pack("N", $ord); } else { return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE'); } } } if (!function_exists('mb_ord')) { function mb_ord($char, $encoding = 'UTF-8') { if ($encoding === 'UCS-4BE') { list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char); return $ord; } else { return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE'); } } } if (!function_exists('mb_htmlentities')) { function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') { return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) { return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0])); }, $string); } } if (!function_exists('mb_html_entity_decode')) { function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') { return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding); } } 

How to use :

 echo "Get string from numeric DEC value\n"; var_dump(mb_chr(50319, 'UCS-4BE')); var_dump(mb_chr(271)); echo "\nGet string from numeric HEX value\n"; var_dump(mb_chr(0xC48F, 'UCS-4BE')); var_dump(mb_chr(0x010F)); echo "\nGet numeric value of character as DEC int\n"; var_dump(mb_ord('ď', 'UCS-4BE')); var_dump(mb_ord('ď')); echo "\nGet numeric value of character as HEX string\n"; var_dump(dechex(mb_ord('ď', 'UCS-4BE'))); var_dump(dechex(mb_ord('ď'))); echo "\nEncode / decode to DEC based HTML entities\n"; var_dump(mb_htmlentities('tchüß', false)); var_dump(mb_html_entity_decode('tchüß')); echo "\nEncode / decode to HEX based HTML entities\n"; var_dump(mb_htmlentities('tchüß')); var_dump(mb_html_entity_decode('tchüß')); echo "\nUse JSON encoding / decoding\n"; var_dump(codepoint_encode("tchüß")); var_dump(codepoint_decode('tch\u00fc\u00df')); 

Output :

 Get string from numeric DEC value string(4) "ď" string(2) "ď" Get string from numeric HEX value string(4) "ď" string(2) "ď" Get numeric value of character as DEC string int(50319) int(271) Get numeric value of character as HEX string string(4) "c48f" string(3) "10f" Encode / decode to DEC based HTML entities string(15) "tchüß" string(7) "tchüß" Encode / decode to HEX based HTML entities string(15) "tchüß" string(7) "tchüß" Use JSON encoding / decoding string(15) "tch\u00fc\u00df" string(7) "tchüß" 
+2
source

Here is my line for converting int using this formula. You can also blow up a string and use array_reduce to summarize it.

 /** * @param $string * @param int $index * @return mixed */ function convertEncoding($string, $index = 0, $carryResult = 0) { $remainder = mb_strlen(mb_substr($string, $index)); while ($remainder) { $currentCharacter = $string[$index]; list(, $ord) = unpack('N', mb_convert_encoding($currentCharacter, 'UCS-4BE', 'UTF-8')); return $this->convertEncoding($string, $index += 1, $ord += $carryResult); } return $carryResult; } 
0
source

All Articles