Convert Word document to usable HTML in PHP

I have a set of Word documents that I want to publish using the PHP tool I wrote. I copy and paste Word documents into a text box and then save them in MySQL using a PHP program. The problem I Have arises from all the non-standard characters that Word documents have, for example, italic quotes and ellipses ("..."). What I'm doing at the moment is finding and replacing such things manually (as well as external characters such as e-sharp) using plain text or HTML objects (& eacute, etc.). Is there a function in PHP that I can name that will output a Word document and convert everything that should be entities into entities, and other characters that do not display properly in Firefox into displayable characters.

Thank!

+5
php ms-word
Oct 13 '08 at 19:20
source share
5 answers

The best solution would be to ensure that your database is configured to support UTF-8 characters. Additional characters available in the extended set should cover all the "non-standard" characters that you are talking about.

Otherwise, if you really have to convert these characters to HTML objects, use htmlentities () .

+3
Oct 13 '08 at 19:27
source share

This has served me well in the past:

$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8') 
+5
Oct. 13 '08 at 19:34
source share

I think that all these answers miss one vital point. Windows itself uses the windows latin1 attribute, so if you insert some special characters (for example, asymmetric quotes) in a form on a Windows machine and are sent to a unix box (or something non-muckrosoft) (be it in a database or something then) some of the characters do not correspond to everything that the unix system understands, hence the confused and distorted characters. This means that even if you have a UTF-8 database and you use htmlentities, some nasty things will still go away because they are characters that the OS does not recognize - they are not even part of UTF-8 - are inventions based Microsoft only. I would like to know about a smooth solution - what I am doing is a blacklist of Microsoft-only character code characters that I came across with a (also manual) UTF-8 character list, do str_replace for all of these and THEN you can do with them all you want is iconv, htmlentities, save directly to the utf8 database, it doesnโ€™t matter anymore.

My grasp on this is all a little shaky - see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html for the excellent explanation that I crippled in the short form above, - If anyone has the best solution (of course, there is one!) on how to explain what this article explains ... I would love to hear it!

+1
May 18 '10 at 9:32 a.m.
source share

htmlspecialchars () takes you a long way, but be careful because Word documents are dirty.

0
Oct. 13 '08 at 19:28
source share

Here is the solution I prepared for the problem with a set of portable windows that are not portable. This replaces offensive Latin-1 characters with their equivalent HTML objects.

 $translation=array( // reference from http://www.cs.tut.fi/~jkorpela/www/windows-chars.html "\x82" => "‚", "\x83" => "ƒ", "\x84" => "„", "\x85" => "…", "\x86" => "†", "\x87" => "‡", "\x88" => "ˆ", "\x89" => "‰", "\x8a" => "Š", "\x8b" => "‹", "\x8c" => "Œ", "\x91" => "‘", "\x92" => "’", "\x93" => "“", "\x94" => "”", "\x95" => "•", "\x96" => "–", "\x97" => "—", "\x98" => "˜", "\x99" => "™", "\x9a" => "š", "\x9b" => "›", "\x9c" => "œ", "\x9f" => "Ÿ", ); return str_replace(array_keys($translation),array_values($translation),$input); 

It works for me TM

0
Jul 03 2018-11-11T00:
source share



All Articles