Str_word_count () function does not display Arabic correctly

I made the following function to return a certain number of words from text:

function brief_text($text, $num_words = 50) { $words = str_word_count($text, 1); $required_words = array_slice($words, 0, $num_words); return implode(" ", $required_words); } 

and it works well with English, but when I try to use it in Arabic, it fails and does not return words as expected. For instance:

 $text_en = "Cairo is the capital of Egypt and Paris is the capital of France"; echo brief_text($text_en, 10); 

infers Cairo is the capital of Egypt and Paris is the , and

 $text_ar = "القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا"; echo brief_text($text_ar, 10); 

prints .

I know that the problem is with str_word_count , but I don’t know how to fix it.

UPDATE

I already wrote another function that works well with English and Arabic, but I was looking for a solution to the problem caused by str_word_count() when used with Arabic. Anyway, here is my other function:

  function brief_text($string, $number_of_required_words = 50) { $string = trim(preg_replace('/\s+/', ' ', $string)); $words = explode(" ", $string); $required_words = array_slice($words, 0, $number_of_required_words); // get sepecific number of elements from the array return implode(" ", $required_words); } 
+4
source share
2 answers

Try using this function to count words:

 // You can call the function as you like if (!function_exists('mb_str_word_count')) { function mb_str_word_count($string, $format = 0, $charlist = '[]') { mb_internal_encoding( 'UTF-8'); mb_regex_encoding( 'UTF-8'); $words = mb_split('[^\x{0600}-\x{06FF}]', $string); switch ($format) { case 0: return count($words); break; case 1: case 2: return $words; break; default: return $words; break; } }; } echo mb_str_word_count("القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا") . PHP_EOL; 

Resources

Recommentations

  • Use the <meta charset="UTF-8"/> in HTML files
  • Always add Content-type: text/html; charset=utf-8 headers Content-type: text/html; charset=utf-8 Content-type: text/html; charset=utf-8 when serving pages
+2
source

To receive ASCII characters:

 if (!function_exists('mb_str_word_count')) { function mb_str_word_count($string, $format = 0, $charlist = '[]') { $string=trim($string); if(empty($string)) $words = array(); else $words = preg_split('~[^\p{L}\p{N}\']+~u',$string); switch ($format) { case 0: return count($words); break; case 1: case 2: return $words; break; default: return $words; break; } } } 
+1
source

All Articles