How to change Unicode string

In the commentary on the answer to this question, an indication was given that PHP cannot cancel Unicode strings.

As for Unicode, it works in PHP because most applications treat it as a binary. Yes, PHP is 8-bit. Trying the equivalent of this in PHP: perl -Mutf8 -e 'scan scalar reverse ("ほ げ ほ げ") "You will get garbage, not" げ ほ げ ほ ". - jrockway

And, unfortunately, it is correct that PHP unicode support in atm is "absent" at best. This will hopefully change dramatically with PHP6 .

PHP Functions MultiByte provide the basic functions needed to work with unicode, but they are incompatible and do not have a large number of functions. One of them is a function for changing a string.

Of course, I wanted to cancel this text for another reason, to find out if this is possible. And I made a function to accomplish this huge complex task of handling this text in Unicode, so you can relax a bit before PHP6.

Test code:

$enc = 'UTF-8'; $text = "ほげほげ"; $defaultEnc = mb_internal_encoding(); echo "Showing results with encoding $defaultEnc.\n\n"; $revNormal = strrev($text); $revInt = mb_strrev($text); $revEnc = mb_strrev($text, $enc); echo "Original text is: $text .\n"; echo "Normal strrev output: " . $revNormal . ".\n"; echo "mb_strrev without encoding output: $revInt.\n"; echo "mb_strrev with encoding $enc output: $revEnc.\n"; if (mb_internal_encoding($enc)) { echo "\nSetting internal encoding to $enc from $defaultEnc.\n\n"; $revNormal = strrev($text); $revInt = mb_strrev($text); $revEnc = mb_strrev($text, $enc); echo "Original text is: $text .\n"; echo "Normal strrev output: " . $revNormal . ".\n"; echo "mb_strrev without encoding output: $revInt.\n"; echo "mb_strrev with encoding $enc output: $revEnc.\n"; } else { echo "\nCould not set internal encoding to $enc!\n"; } 
+10
string php unicode reverse
source share
6 answers

Grapheme functions handle the UTF-8 string more correctly than mbstring, and the PCRE / Mbstring and PCRE functions can interrupt characters. You can see the difference between them by running the following code.

 function str_to_array($string) { $length = grapheme_strlen($string); $ret = []; for ($i = 0; $i < $length; $i += 1) { $ret[] = grapheme_substr($string, $i, 1); } return $ret; } function str_to_array2($string) { $length = mb_strlen($string, "UTF-8"); $ret = []; for ($i = 0; $i < $length; $i += 1) { $ret[] = mb_substr($string, $i, 1, "UTF-8"); } return $ret; } function str_to_array3($string) { return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY); } function utf8_strrev($string) { return implode(array_reverse(str_to_array($string))); } function utf8_strrev2($string) { return implode(array_reverse(str_to_array2($string))); } function utf8_strrev3($string) { return implode(array_reverse(str_to_array3($string))); } // http://www.php.net/manual/en/function.grapheme-strlen.php $string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) ."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) var_dump(array_map(function($elem) { return strtoupper(bin2hex($elem)); }, [ 'should be' => "o\xCC\x88"."a\xCC\x8A", 'grapheme' => utf8_strrev($string), 'mbstring' => utf8_strrev2($string), 'pcre' => utf8_strrev3($string) ])); 

The result is here.

 array(4) { ["should be"]=> string(12) "6FCC8861CC8A" ["grapheme"]=> string(12) "6FCC8861CC8A" ["mbstring"]=> string(12) "CC886FCC8A61" ["pcre"]=> string(12) "CC886FCC8A61" } 

IntlBreakIterator can be used with PHP 5.5 (intl 3.0);

 function utf8_strrev($str) { $it = IntlBreakIterator::createCodePointInstance(); $it->setText($str); $ret = ''; $pos = 0; $prev = 0; foreach ($it as $pos) { $ret = substr($str, $prev, $pos - $prev) . $ret; $prev = $pos; } return $ret; } 
+4
source share

here is another approach using regex:

 function utf8_strrev($str){ preg_match_all('/./us', $str, $ar); return implode(array_reverse($ar[0])); } 
+8
source share

Here is another way. This seems to work without the need for an output encoding (checked with a pair of different mb_internal_encoding s):

 function mb_strrev($text) { return join('', array_reverse( preg_split('~~u', $text, -1, PREG_SPLIT_NO_EMPTY) )); } 
+5
source share

Answer

 function mb_strrev($text, $encoding = null) { $funcParams = array($text); if ($encoding !== null) $funcParams[] = $encoding; $length = call_user_func_array('mb_strlen', $funcParams); $output = ''; $funcParams = array($text, $length, 1); if ($encoding !== null) $funcParams[] = $encoding; while ($funcParams[1]--) { $output .= call_user_func_array('mb_substr', $funcParams); } return $output; } 
+4
source share

Another method:

 function mb_strrev($str, $enc = null) { if(is_null($enc)) $enc = mb_internal_encoding(); $str = mb_convert_encoding($str, 'UTF-16BE', $enc); return mb_convert_encoding(strrev($str), $enc, 'UTF-16LE'); } 
0
source share

Easy utf8_strrev( $str ) . See the corresponding source code for my library, which I copied below:

 function utf8_strrev( $str ) { return implode( array_reverse( utf8_split( $str ) ) ); } function utf8_split( $str , $split_length = 1 ) { $str = ( string ) $str; $ret = array( ); if( pcre_utf8_support( ) ) { $str = utf8_clean( $str ); $ret = preg_split('/(?<!^)(?!$)/u', $str ); // \X is buggy in many recent versions of PHP //preg_match_all( '/\X/u' , $str , $ret ); //$ret = $ret[0]; } else { //Fallback $len = strlen( $str ); for( $i = 0 ; $i < $len ; $i++ ) { if( ( $str[$i] & "\x80" ) === "\x00" ) { $ret[] = $str[$i]; } else if( ( ( $str[$i] & "\xE0" ) === "\xC0" ) && ( isset( $str[$i+1] ) ) ) { if( ( $str[$i+1] & "\xC0" ) === "\x80" ) { $ret[] = $str[$i] . $str[$i+1]; $i++; } } else if( ( ( $str[$i] & "\xF0" ) === "\xE0" ) && ( isset( $str[$i+2] ) ) ) { if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) ) { $ret[] = $str[$i] . $str[$i+1] . $str[$i+2]; $i = $i + 2; } } else if( ( ( $str[$i] & "\xF8" ) === "\xF0" ) && ( isset( $str[$i+3] ) ) ) { if( ( ( $str[$i+1] & "\xC0" ) === "\x80" ) && ( ( $str[$i+2] & "\xC0" ) === "\x80" ) && ( ( $str[$i+3] & "\xC0" ) === "\x80" ) ) { $ret[] = $str[$i] . $str[$i+1] . $str[$i+2] . $str[$i+3]; $i = $i + 3; } } } } if( $split_length > 1 ) { $ret = array_chunk( $ret , $split_length ); $ret = array_map( 'implode' , $ret ); } if( $ret[0] === '' ) { return array( ); } return $ret; } function utf8_clean( $str , $remove_bom = false ) { $regx = '/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./s'; $str = preg_replace( $regx , '$1' , $str ); if( $remove_bom ) { $str = utf8_str_replace( utf8_bom( ) , '' , $str ); } return $str; } function utf8_str_replace( $search , $replace , $subject , &$count = 0 ) { return str_replace( $search , $replace , $subject , $count ); } function utf8_bom( ) { return "\xef\xbb\xbf"; } function pcre_utf8_support( ) { static $support; if( !isset( $support ) ) { $support = @preg_match( '//u', '' ); //Cached the response } return $support; } 
0
source share

All Articles