Convert docX to custom XML

I am trying to convert docX files to XML, which I made to order. My users want their data to be converted to this XML to simplify the request for content in their web application, and they want the input to be from their docX.

I tried looking for a Java API converter, but none of them matched my requirements. I looked through docx4j, but realized that it only converts to HTML and PDF. I think there is a converter API to which I can introduce, say, an intermediate translator (XSLT), and the output will be my custom XML complete with data from my docX.

Is there an existing tool for this? If not, any suggestions regarding the approach that I should use to encode my own converter, for example, from openXML, convert to XSL-FO first before user-defined XML?

I would like to hear from the community.

Thank you very much.

+4
source share
3 answers

docx4j can be used to convert OpenXML to arbitrary XML through XSLT.

Assuming the xslt and javax.xml.transform.stream templates . StreamResult , you would do something like this:

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath)); MainDocumentPart mdp = wordMLPackage.getMainDocumentPart(); // DOM document to input to transform org.w3c.dom.Document doc = XmlUtils.marshaltoW3CDomDocument( mdp.getJaxbElement() ); XmlUtils.transform(doc, xslt, null, result); 

However, if all you want to do is convert to XML, then docx4j (and Apache POI, for that matter) are redundant. You can simply use OpenXML4J directly.

Whether converting via XSLT is probably the best approach depends on whether your target XML is documented or data oriented.

If it is document oriented, XSLT is a good approach.

If it is data oriented, you may need to consider content data binding. (There was a different approach called customxml, but the i4i patent farce may make this approach inappropriate if you rely on Word for editing)

+3
source

As far as I know, docx files are just xml files in a ZIP container. To convert them to any XML format of your design, you need to unzip the file (in a new folder or in memory), load the target Xml document and apply XSLT to this XML file. I donโ€™t think you mention anything about your development environment other than the docx4j tag. Are you developing in Java? If so, Iโ€™m afraid I wonโ€™t know which libraries will point you to the zip processing and xml conversion libraries (although I know that they exist and it only takes 5 minutes to find them). )

To check the xml files in docx, simply change the file extension from ".docx" to ".zip" and open the ZIP archive in your favorite utility.

+1
source

I had the most successful saving of docx as html directly from Word. Html is not xHtml, so you will need to use it carefully. Otherwise, it works well enough if you must use a Word-based workflow. You can write a VBA script so that Word opens the file and also saves it in Html.

0
source

All Articles