UTF-8 and HTML objects

Question

UTF-8 and HTML objects

I am trying to extract text from a Word.DOC file using PHP. Everything looks fine, but the only problem is something like

&#x0421;&#x0423;&#x0414;&#x041e;&#x0412;&#x0410; &#x0411;&#x0423;&#x0425;&#x0413;&#x0410;&#x041b;&#x0422;&#x0415;&#x0420;&#x0406;&#x042f;

instead of the Russian text. I tried using html_entity_decode and utf8_encode, but they did not help. Is there a simple solution?

+5

php utf-8

Ximik Jun 04 '11 at 15:31

source share

1 answer

Gumbo · Accepted Answer · 2011-06-04T15:33:03+0000

html_entity_decode should work with the appropriate parameters (unless you are using PHP 5.3.3 or later):

html_entity_decode($str, ENT_QUOTES, 'UTF-8')

This converts character references to UTF-8. Prior to PHP 5.3.3, the default value for the character set parameter was ISO-8859-1. In this case, Cyrillic characters cannot be converted to a character set. ISO 8859-1 does not contain them.

UTF-8 and HTML objects

More articles: