From MS Word or Libre Office to clean HTML

People who post content to my website use Word, so I get a lot of Word documents for conversion to HTML. I want to keep only the basic formatting - headings, lists and emphasis - without images.

When I convert them to Libre Office "Save as HTML", the resulting files are huge, for example, the doc 112K file becomes 450K HTML, most of which are useless with the FONT and SPAN tags (for some reason, each punctuation mark is enclosed in its own scale!) .

I tried this script: http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708 based on tidy and sed, and he reduced the size to 150K, but there are still many useless SPANs.

I tried to copy and went to Kompozer - an HTML editor, and then save as HTML; but he converted all my non-Latin (Hebrew) letters into objects, such as "ึฐ", which increased the size to 750 thousand!

I tried docvert: https://github.com/holloway/docvert/issues/6 , but found out that it requires a python library, which requires other libraries, etc. that seems to have an infinite dependency path ...

Is there an easy way to create clean HTML from Office documents?

+7
source share
6 answers

I understand that this question is old, but the other answers never answered the question. If you don't mind writing PHP code, CubicleSoft Ultimate Web Scraper Toolkit has the TagFilter class:

https://github.com/cubiclesoft/ultimate-web-scraper/blob/master/support/tag_filter.php

You pass two things: an array of parameters and data for analysis as HTML.

To clear the broken HTML, the default parameters from TagFilter :: GetHTMLOptions () will be used as a good starting point. These parameters form the basis of valid HTML content and, without doing anything else, will clear any input into something that another tool, such as the Simple HTML DOM, can correctly parse in the DOM model.

However, another way to use the class is to change the default parameters and add the "callback" option to the parameter array. For each tag in HTML, the specified callback function will be called. The callback is expected to return what to do with each tag that the real strength of TagFilter enters. You can save any tag and some or all of its attributes (or change them), get rid of the tag, but keep the internal content, save the tag, but get rid of the content, change the content (to close the tags) or get rid of the tag and internal content. This approach allows you to tremendously improve control over the most confusing HTML and processes input in a single pass. See the same set of repository tests, for example, using TagFilter.

The only drawback is that the callback should keep track of where it is between each call, while something like the Simple HTML DOM selects things based on a DOM-like model. BUT, whatโ€™s the only drawback if the document being processed has things like "id" and "class" ... most of the content of Word / Libre HTML doesnโ€™t mean that it means a giant frame of unrecognizable / unchecked HTML in relation to DOM processing tools go .

+1
source

In your situation, you may need to take turns to convert the main parts of your doc, then come back and clear any additional tags. If you do not mind this approach, then consider this solution ...

  • After saving your doc as a web page, open the same web page in Notepad ++.
  • Then use the Replace function for this document.
  • In the search box, enter <[^>] +>
  • In search mode, for this, in the window, select "Regular Expression"

Now all you need to do from this point is to click "Find Next" until you get the tags you want to replace, and click "Replace" for each tag that needs to be replaced. Verify that the Replace With: field is blank.

I donโ€™t know if there is a more convenient way, but this method is 100% free and easy to clear HTML tags using Notepad ++.

Regarding the conversion of inline styles to external CSS (which I recommend as a second process after replacing unnecessary tags), try this app ... http://inlinecssextractor.com/home.html

Good luck.

0
source

I have found that these two cleaners are effective. First, I ran the word filter html through

http://textism.com/wordcleaner/

Then I used some regular expressions to convert some labeled paragraph elements to lists (li). Then I executed the result through

http://infohound.net/tidy/

to wrap list items with unordered lists (ul) and clear other errors. I was very pleased with the result, which went from 1.5 to 225 thousand.

0
source

I used http://word2cleanhtml.com/ until I realized that MS Word itself makes it possible to save a document as HTML.

Selecting this file .docx becomes .html and is the best html version of the word doc that I have seen. This is definitely better than all these online tools.

0
source

Here is a set of PowerShell scripts that will clean up Word-Filtered HTML and correctly tag super / indexes in about 95% of cases. (No, you can't get better; Word is made for printing.)

https://github.com/suzumakes/replaceit

ReadMe has instructions, and if you come across any additional characters that need to be caught or come up with any improvements / improvements, I would be glad to see your pull request.

0
source

ophir.php does a pretty nice job of creating a clean HTML file from .odt files. To run it, you need a php hosting environment.

0
source

All Articles