How to get each character from a word with special encoding

Question

How to get each character from a word with special encoding

I need to get an array with all the characters from the word, but the word has letters with special encoding like á when I execute the following code:

$word = 'withá'; $word_arr = array(); for ($i=0;$i<strlen($word);$i++) { $word_arr[] = $word[$i]; }

or

 $word_arr = str_split($word);

I get:

array (6) {[0] => string (1) "w" [1] => string (1) "i" [2] => string (1) "t" [3] => string (1) "h" [4] => line (1) "Ã" [5] => line (1) "¡"}

How can I do to get each character as follows?

array (5) {[0] => string (1) "w" [1] => string (1) "i" [2] => string (1) "t" [3] => string (1) "h" [4] => line (1) "á"}

+8

php encoding tokenize character-encoding

leticia Nov 21 '12 at 20:42

source share

4 answers

~~I think mb_split will do it for you: http://www.php.net/manual/en/function.mb-split.php~~

If you use special encodings, you probably want to familiarize yourself with how PHP handles multibyte encoding in general ...

EDIT: No, I can't figure out how to make mb_split do it myself, but looking back, SO has a few more questions that preg_split answered. I tested this and it seems to do exactly what you want:

 preg_split('//',$word,-1,PREG_SPLIT_NO_EMPTY);

I would still highly recommend that you read multibyte characters in PHP. It's kind of a mess, IMHO.

Here are some good links: http://www.joelonsoftware.com/articles/Unicode.html as well as http://akrabat.com/php/utf8-php-and-mysql/ and much more can be found ...

+2

Aerik Nov 21 '12 at 20:46

source share

You must use multibyte-Functions for all Multibyte encoders! I assume mb_split is the pendant:

http://php.net/manual/en/function.mb-split.php

0

wegus Nov 21 '12 at 20:51

source share

as found at: http://www.php.net/manual/en/function.str-split.php#107658

  function str_split_unicode($str, $l = 0) { if ($l > 0) { $ret = array(); $len = mb_strlen($str, "UTF-8"); for ($i = 0; $i < $len; $i += $l) { $ret[] = mb_substr($str, $i, $l, "UTF-8"); } return $ret; } return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY); } $word = 'withá'; $word = str_split_unicode($word); var_dump($word);

0

Slavenko miljic Nov 21 '12 at 20:52

source share

Tim withers · Accepted Answer · 2012-11-21T20:52:45+0000

Since this is a UTF-8 string, just do

 $word = 'withá'; $word = utf8_decode($word); $word_arr = array(); for ($i=0;$i<strlen($word);$i++) { $word_arr[] = $word[$i]; }

The reason for this is that although it looks right in your script, the interpreter will convert it to a multibyte character (why mb_split() works). To convert it to the correct UTF-8 format, you can use the mb functions or just specify utf8_decode() .

How to get each character from a word with special encoding

More articles: