From HTML <figure> and <figcaption> to Microsoft Word

I have HTML with the tags figure , img and figcaption , and I would like to convert them to a Microsoft Word document.

The image indicated by img must be inserted into the Word document, and the figcaption must be converted to its signature (also keeping the figure number).

I tried to open html with Word 2013, but figcaption does not translate as a caption of the picture, but it is just plain text below the image.

Is there any minimum working pattern to do this? I looked at https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Word_XML_Format_example , but too much to capture only a sample of the Hello world.

 figure .image { width: 100%; } figure { text-align: center; display: table; max-width: 30%; /* demo; set some amount (px or %) if you can */ margin: 10px auto; /* not needed unless you want centered */ } article { counter-reset: figures; } figure { counter-increment: figures; } figcaption:before { content: "Fig. " counter(figures) " - "; /* For I18n support; use data-counter-string. */ } 
 <figure> <p><img class="image" src="https://upload.wikimedia.org/wikipedia/commons/c/ca/Matterhorn002.jpg"></p> <figcaption>Il monte Cervino.</figcaption> </figure> <figure> <p><img class="image" src="https://upload.wikimedia.org/wikipedia/commons/2/26/Banner_clouds.jpg"></p> <figcaption>La nuvola che spesso รจ vicino alla vetta.</figcaption> </figure> 

I tried with pandoc on windows

 pandoc -f html -t docx -o hello.docx hello.html 

but no luck, as you see that there are no "Fig. 1" and "Figure 2":

enter image description here

My pandoc:

 c:\temp>.\pandoc.exe -v pandoc.exe 1.19.2.1 Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4 Default user data directory: C:\Users\ale\AppData\Roaming\pandoc Copyright (C) 2006-2016 John MacFarlane Web: http://pandoc.org This is free software; see the source for copying conditions. There is no warranty, not even for merchantability or fitness for a particular purpose. 

Change 1

It is good to also use some C # to do this. Perhaps I can convert HTML to XML Word format using a C # program.

+7
html c # css ms-word pandoc
source share
4 answers

It may be cooler than you would like, but if you save the file in pdf format (I went to Adobe and created a pdf file from an html file containing figure / figcaption, but you could do it programmatically) then export this pdf file per word, then you can create a text document. The middle step may be too much, but it works!

Hope this helps (maybe pdf will do?)

pdf (enlarged to page level

EDIT 1: I just found a jquery plugin from Mark Windsoll that converts HTML to Word. I made a codepen to include a drawing / figcaption in it . When you click a button, it prints like Word. (I suppose you can save it too, but its original code handle didnโ€™t actually do anything at the click of a link that says export to doc .. sigh ..)

  jQuery(document).ready(function print($) { $(".word-export").click(function(event) { $("#page-content").wordExport(); }); }); 
 img{width:300px; height:auto;} figcaption{width:350px;text-align:center;} h1{margin-top:10px;} h1, h2{margin-left:35px;} p{width:95%; padding-top:20px; margin:0px auto;} button{margin: 15px 30px; padding:5px;} 
 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script> <script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/FileSaver.js"></script> <script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/jquery.wordexport.js"></script> <link href="https://www.jqueryscript.net/css/jquerysctipttop.css" rel="stylesheet"/> <h1>jQuery Word Export Plugin Demo</h1> <div id="page-content"> <h2>Lovely Trees</h2> <figure> <img src="http://www.rachelgallen.com/images/autumntrees.jpg"></figure> <figcaption>Autumn Trees</figcaption> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec vehicula bibendum lacinia. Pellentesque placerat interdum nisl non semper. Integer ornare, nunc non varius mattis, nulla neque venenatis nibh, vitae cursus risus quam ut nulla. Aliquam erat volutpat. Aliquam erat volutpat. </p> <p>And some more text here, but that quite enough lorem ipsum rubbish!</p> </div> <button class="word-export" onclick="print();"> Export as .doc </button> 

EDIT 2: To convert HTML to Word using C # , you can use Gembox for free if you are not buying the professional version (you could use it for free for a while to evaluate it).

C # code

 // Convert HTML to Word (DOCX) document. DocumentModel.Load("Document.html").Save("Document.docx"); 

Rachel

+3
source share

I never used pandoc , I think it does not support many advanced CSS3 features now .

1. Using Aspose.Words

I copied you CSS and HTML codes to make an Html file called figure.htm , and using Aspose.Words to convert this html file, it works just like your hope.

Word demo

I use C # to indicate the following code:

 using Aspose.Words; Document doc = new Document(); DocumentBuilder builder = new DocumentBuilder(doc); using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm")) { string html = sr.ReadToEnd(); builder.InsertHtml(html); } doc.Save("d:\\DocumentBuilder.InsertTableFromHtml Out.doc"); 

My version of Aspose.Words is 16.7.0.0.

2. The format of the figcaption tag

There is another way to save pandoc so that it works. You can process the html file to fix the format before converting using pandoc. In your question, the pandoc base point cannot work with many advanced CSS3 features, so if you can finish this, this works well too.

I give you some test code and I use 'RegularExpressions'. To execute below code, figure1.htm is a new HTML file and it replaces the entire figcaption inter HTML file with HTML format.

  Regex regex = new Regex("<(?<tag>[a-zA-Z]+?)>(?<html>.+)</\\1>", RegexOptions.Compiled); using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm", Encoding.UTF8)) { string html = sr.ReadToEnd(); int i = 1; string newHtml = regex.Replace(html, new MatchEvaluator((m) => { string tag = m.Groups["tag"].Value; string text = m.Groups["html"].Value; if (tag.ToLower() == "figcaption") { return $"<{tag}>Fig. {i++} - {text}</{tag}>"; } return m.Value; })); using (System.IO.StreamWriter sw = new System.IO.StreamWriter("./figure1.htm", false, Encoding.UTF8)) { sw.Write(newHtml); sw.Flush(); } } 

HTML tag format

Wish my answer helps you!

+2
source share

Pandoc already downloads the images and inserts them into the docx file using the command you sent.

I just implemented and sent a request to parse the HTML elements figure and figcaption , which were combined (now it will be in nightly assemblies soon or later in pandoc 2.0). Using this code, in your example, a docx file is created with the text of the signature with the style of the paragraph "Image signature".

0
source share

Enlarge Rachel Galan a great find; the following code, which, I think, can be used to run the converter on a line containing the full HTML page generated by the outline:

Will this work to convert the output from the process that creates the page (loop)? (Javascript and CSS loaded by wp_enqueue .. before calling this code)

  <?php $x = $post_output ; // $post_output contains an HTML page with doctype/head/body/etc that was generated by the loop $dom = new DOMDocument; libxml_use_internal_errors(false); // supress errors $dom->loadHTML($x, LIBXML_NOERROR); // supress errors ?> <script type="text/javascript"> $dom.wordExport(); </script> 

... Rick ...

0
source share

All Articles