Delete Unicode Zero Width Space PHP

I have Burmese text, UTF-8. I use PHP to work with text. At some point along the way, some ZWSPs filled up, and I would like to remove them. I tried two different ways to delete characters, and none of them work.

At first I tried to use:

$newBody = str_replace("​", "", $newBody); 

to find the HTML object and delete it, as it appears in the Web Inspector. Spaces are not removed. I also tried this as:

  $newBody = str_replace("&#8203", "", $newBody); 

and get the same result.

The second method I tried was found on this subject Remove the ZERO WIDTH NON-JOINER character from a string in PHP

which is as follows:

  $newBody = str_replace("\xE2\x80\x8C", "", $newBody); 

but I also did not get the result. ZWSP has not been deleted.

 An example word in the text ($newBody) looks like this : α€šα€°β€‹​က​​α€›α€­α€”α€Ί And I want to make it look like this : α€šα€°α€€α€›α€­α€”α€Ία€Έ 

Any ideas? Would preg_replace work better?

So i tried

 $newBody = preg_replace("/\xE2\x80\x8B/", "", $newBody); 

and it seems to work, but now there is another problem.

 <a class="defined" title="Ukraine">α€šα€°&#8203;α€€&#8203;α€›α€­α€”α€Ία€Έ</a> 

converted to

 <a class="defined _tt_t_" title="Ukraine" style="font-family: 'Masterpiece Uni Sans', TharLon, Myanmar3, Yunghkio, Padauk, Parabaik, 'WinUni Innwa', 'Win Uni Innwa', 'MyMyanmar Unicode', Panglong, 'Myanmar Sangam MN', 'Myanmar MN';">α€šα€°α€€α€›α€­α€”α€Ία€Έ</a> 

I do not want him to add all the extra stuff. Any idea why this is happening? Besides the fact that you somehow focus only on text, is there another way to prevent this additional material from being added to preg_replace? By the way, using google chrome on Mac. It looks like it works a little differently with firefox ...

+8
php unicode str-replace
source share
1 answer

It:

 $newBody = str_replace("&#8203;", "", $newBody); 

assumes that the text is encoded in HTML format. It:

 $newBody = str_replace("\xE2\x80\x8C", "", $newBody); 

should work if infringing characters are not encoded but matches the wrong character (0xe2808c). To match the same character as # 8203; you need 0xe2808b:

 $newBody = str_replace("\xE2\x80\x8B", "", $newBody); 
+13
source share

All Articles