How to create a tagged PDF file from a "complex" XML file

Question

How to create a tagged PDF file from a "complex" XML file

I have a complex XML document. I used the iText library to create a tagged PDF from this XML document. I mentioned examples in the 15th chapter of the iText book in Action, but they are limited to a simple XML file that has a hierarchy that is only sibling.

How can I extend my algorithm that works with a flat structure so that it can process such hierarchical XML, for example, in the example below?

An example of a "complex" XML document:

<?xml version="1.0" encoding="UTF-8" ?> <movies> <movie duration="141" imdb="0062622" year="1968"> <title>2001: A Space Odyssey</title> <directors> <director>Kubrick, Stanley</director> </directors> <countries> <country>United Kingdom</country> <country>United States</country> </countries> </movie> </movies>

+2

pdf-generation itext tagged-pdf

Shriram Kalpathy Mohan Feb 23 '12 at 5:26

source share

1 answer

Shriram Kalpathy Mohan · Accepted Answer · 2012-02-24T04:00:39+0000

My teammate came up with a solution to this problem. The idea is to create elements of the DefaultMutableTreeNode tree. Each of the DefaultMutableTreeNode will contain a PdfStructureElement element. The tree should represent an XML hierarchy, for example, consider a piece of XML code in a previous comment. The first DefaultMutableTreeNode should have a PdfStructureElement (PdfName - movies), the parent of which is writer.getStructureTreeRoot (). The child of this node must be another PdfStructureElement (PdfName - movie) whose parent is a PdfStrucutreElement named 'movies', etc.

As soon as the steps mentioned above are completed (this is essentially parsing), we get the PdfStrucutreElements tree. Now we need to analyze the contents. When we analyze the content, we need to go through each of the nodes in the tree. If the parsed node is a leaf node, then we need to get the PdfStructureElement element inside this node. Else If the parsed node is a non-leaf node, then we need to get the PdfName of the PdfStructureElement element within that node. In other words, we can just use the qName variable.

  if (node is a leaf) 
      PdfStructureElement element = (PdfStructureElement) node.getUserObject ();
      canvas.beginMarkedContentSequence (element); 
 else 
      canvas.beginMarkedContentSequence (qName);

How to create a tagged PDF file from a "complex" XML file

More articles: