UTF-8 and HTML objects

I am trying to extract text from a Word.DOC file using PHP. Everything looks fine, but the only problem is something like

СУДОВА БУХГАЛТЕРІЯ

instead of the Russian text. I tried using html_entity_decode and utf8_encode, but they did not help. Is there a simple solution?

+5
source share
1 answer

html_entity_decode should work with the appropriate parameters (unless you are using PHP 5.3.3 or later):

html_entity_decode($str, ENT_QUOTES, 'UTF-8')

This converts character references to UTF-8. Prior to PHP 5.3.3, the default value for the character set parameter was ISO-8859-1. In this case, Cyrillic characters cannot be converted to a character set. ISO 8859-1 does not contain them.

+4
source

All Articles