Truncate a UTF-8 string to match the given number of bytes in PHP

Question

Truncate a UTF-8 string to match the given number of bytes in PHP

Say we have a UTF-8 $s string, and we need to shorten it so that it can be stored in bytes N. Blindly truncating it to bytes N can ruin it. But decode it to find the boundaries of characters, is drag and drop. Is there a neat way?

[Edit 20100414] In response to S.Mark s: mb_strcut() I recently found another function to complete the task: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from intl extension. Since intl is an ICU shell, I have a lot of confidence in it.

+6

string php unicode utf-8 truncate

user213154 Dec 28 '09 at 0:32

source share

6 answers

Edit: The answer to S.Mark is actually better than mine - PHP has a (poorly documented) built-in function that solves exactly this problem.

The original back to bits response follows:

Truncate the desired number of bytes
If the last byte begins with 110 (binary), discard it as well
If the second-last byte begins with 1110 (binary), discard the last 2 bytes
If the third-last byte begins with 11110 (binary), leave the last 3 bytes

This ensures that you don't have an incomplete character dangling at the end, which is the main thing that can go wrong when truncating UTF-8.

Unfortunately (as Andrei reminds me in the comments), there are also cases where two separately encoded Unicode codes form one character (mostly diacritics, such as accents, can be represented as a separate code point that changes the previous letter).

This type of work requires an extended Unicode-Fu, which is not available in PHP, and may even be impossible for all cases (there are some strange scripts out there!), But, fortunately, it is relatively rare, at least for Latin-based ones. languages.

+11

Michael borgwardt Dec 28 '09 at 0:55

source share

I encoded this simple function for this purpose, you need mb_string .

 function str_truncate($string, $bytes = null) { if (isset($bytes) === true) { // to speed things up $string = mb_substr($string, 0, $bytes, 'UTF-8'); while (strlen($string) > $bytes) { $string = mb_substr($string, 0, -1, 'UTF-8'); } } return $string; }

While this code also works, S.Mark's answer is obviously the way to go.

+1

Alix axel Dec 28 '09 at 2:00

source share

Here is the test for mb_strcut() . This does not prove that he does exactly what we are looking for, but I find it pretty convincing.

 <?php ini_set('default_charset', 'UTF-8' ); $strs = array( 'Iñtërnâtiônàlizætiøn', 'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית', 'ايران لا ترى تغييرا في الموقف الأمريكي', '独・米で死傷者を出した銃の乱射事件', '國會預算處公布驚人的赤字數據後', '이며 세계 경제 회복에 걸림돌이 되고 있다', '      ', 'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่', 'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ', 'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་', 'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το', 'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը', 'რუსეთი ასევე გეგმავს სამხედრო'); for ( $i = 10; $i <= 30; $i += 5 ) { foreach ($strs as $s) { $t = mb_strcut($s, 0, $i, 'UTF-8'); print( sprintf('%3s%3s ', mb_strlen($t, 'UTF-8'), mb_strlen($t, 'latin1')) . ( mb_check_encoding($t, 'UTF-8') ? ' OK ' : ' Bad ' ) . $t . "\n"); } } ?>

+1

user213154 Dec 28 '09 at 11:57

source share

In addition to S.Mark's answer, which was mb_strcut() , I recently found another function to do a similar job: grapheme_extract($s, $n, GRAPHEME_EXTR_MAXBYTES); from the intl extension.

The functionality is slightly different: the mb_strcut() documentation claims that it cuts the closest border of the UTF-8 character, so it does not take into account multi-character graphemes, but grapheme_extract() , otoh, does. Thus, depending on what you need, grapheme_extract() might be better (for example, to display a string) or mb_strcut() might be better (for example for indexing). In any case, although I would mention it.

(And since intl is an ICU shell, I have a lot of confidence about this.)

+1

user213154 Apr 14 '10 at 17:20

source share

~~Not.~~ ~~There is no way to do this other than decoding.~~ However, the coding is pretty mechanical. See a nice table in the wikipedia article

Edit: Michael Borgwardt shows us how to do this without decrypting the entire string. Clever.

0

John knoeller Dec 28 '09 at 12:49

source share

YOU · Accepted Answer · 2009-12-28T02:18:01+0000

I think you do not need to reinvent the wheel, you can just use mb_strcut and make sure that you set the encoding to UTF -8 .

 mb_internal_encoding('UTF-8'); echo mb_strcut("\xc2\x80\xc2\x80", 0, 3); //from index 0, cut 3 characters.

his return

 \xc2\x80

because in \ xc2 \ x80 \ xc2 the latter is not valid

Truncate a UTF-8 string to match the given number of bytes in PHP

More articles: