XML for TeX or how to get beautiful PDF from XHTML source

Superficially, an easy question: how can I get a great PDF from my XML document? In fact, my input is a subset of XHTML with the addition of several custom attributes (to save some information about the sources of quotes, etc.). I studied some routes and would like to get some feedback if someone has tried any of this before.

Note. I looked at XSL-FO for creating PDFs, but I heard that the typographic quality of open source tools is still far behind TeX. Guess the most advanced is Apache FOP . But I'm really interested in great PDFs (otherwise I could use my browser’s print dialog). Any thoughts, updates on this?

So, I was thinking about using XSLT to convert my custom XML / XHTML dialect to DocBook and from there ( DocBook through XSLT to the correct HTML seems to work quite well, so I can use it for that too). But how do I upgrade from DocBook to TeX? I came across a number of solutions.

  • dblatex A collection of XSLT stylesheets that LaTeX displays .
  • db2latex Started as a dblatex clone, but now provides tighter integration with LaTex packages and provides one script for PDF output, which is pretty nice.
  • passiveTex Instead of XSLT, it uses an XML parser written in TeX.
  • TeXML is essentially a LaTeX XML serialization that can be used as an intermediate format and an accompanying python tool that converts from this XML format to LaTeX / ConTeXt. They argued that this avoided the problems of existing solutions with special characters, lost some braces or spaces and only supported Latin-1 encoding. (Is that still the case?)

Since my input XML can contain quite a few special characters represented in Unicode, the last moment is especially important for me. I also thought about using XeTeX instead of pdfTeX to get around this problem. (Although I may lose some typographic quality, it may be even better than modern open-source processors XSL-FO?) Thus, db2latex and TeXML seem to be favorites. So can anyone comment on their reliability?

As an alternative, I may be able to use ConTeXt directly, as it seems that the interest in the ConTeXt community in XML is quite significant . In particular, I could take a deeper look at “My Way: Getting Web Content and PDF Output from One Source” and “Working with XML in ConTeXt MkIV” . Both documents describe an approach that uses ConTeXt in conjunction with LuaTeX. ( DocBook In ConTeXt seems to be doing roughly the same thing, but the latest version is from 2003.) A second document notes:

You may wonder why we are doing these manipulations in TEX, and not instead of xslt. The advantage of an integrated approach is that it simplifies use. Consider not only processing a document, but also using xml to manage resources in the same mode. The Xslt approach is just as verbose (after all, you still need to create TEX code) and perhaps less readable. In the case of MkIV, the integrated approach is also faster and gives us the ability to manage content at runtime using Lua.

What do you think about this? Please keep in mind that I have some experience with XSLT and TeX, but I have never been terribly deep in any of them. I have never tried many different LaTeX packages or alternatives such as ConTeXt (or XeTeX / LuaTeX instead of pdfTeX), but I am ready to learn some new things in order to eventually get my beautiful PDF files;)

Also, I came across Pandoc , but could not find any information on how it compares with the other approaches mentioned. And finally, a link to fairly extensive documentation on how to use TeXML with ConTeXt .

+7
source share
3 answers

In the end, I decided to go with Pandoc , it seems very polished and reliable code base. One of the potential drawbacks is that you have to limit yourself to the number of markup features available in the Pandoc internal view, which matches mostly one-to-one with its extended markdown .

Because I didn’t think that creating markdowns from my XHTML-like source was a good idea, I managed to run the pandoc component that reads DocBook , which is currently located in the leading repo division of Pandoc development. So now I have a simple XSLT stylesheet that will convert from my XHTML dialect to DocBook (which is also XML), and then I use Pandoc to export other formats to the elevator, including PDF via ConTeXt.

+1
source

In the past, I did something like this (that is, I supported major versions of documents in XML and would like to get LaTeX output from them).

I used PassiveTeX in the past, but I found that creating style sheets is hard work - the usual result of writing two languages ​​at the same time. I got it to work, and the result looked very good, but it was probably more effort than was worth it. However, if you need to add the style you need to add, it will be small, then this can be a good route, because it is one step.

The most successful route (readable, flexible, and attractive) was to use XSLT to transform the document into a structural LaTeX that matches the intended structure of the resulting document, but does not try to do more than minimal formatting. Depending on your document, this may be a normal LaTeX, or it may have custom structures. Then write or adapt a LaTeX style sheet or class file that formats the output into something attractive. Thus, you use XSLT for your strengths (and do not go beyond them, which is very frustrating), using LaTeX for your strengths, and do not confuse yourself.

That is, this more or less corresponds to the approach to your first two alternatives, and regardless of whether you go with them or write / set up a LaTeX style sheet with printing, the function depends on how comfortable you feel with the tables LaTeX styles, and how much complex or specialized formatting you need to do.

Since you are saying that you need to handle Unicode characters in the input, then yes, XeLaTeX would be a good choice for the LaTeX part of the pipeline.

+2
source

You might want to check out the issues tagged with XML on TeX.sx , especially this . I suggest you use ConTeXt; the current version has no problems with Unicode and does an excellent job with OpenType - and it is programmed in Lua. The most commonly used alternative to LaTeX is XMLTeX , but it requires a lot of TeX foo.

If your documents can be processed by pandoc, use this: you will have several output options, more than from any TeX-based system.

+1
source

All Articles