I have a set of images over which I run the OCR application. This process results in an XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now I would like to add the XML file information as an invisible text layer to the PDF to get a searchable PDF. Is there a simple and free way?
Some information:
I do not want to use the Acrobat OCR features,
The OCR process leads to an XML file that contains elements such as:
<line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>
Update: it may be possible to do what I want differently. Suppose you already have a PDF file generated from a set of images that already contains OCRed text. Is it possible (possibly programmatically) to access only the image of each page, process it (for example, convert to monochrome) and save it back to a PDF file? If so, OCRed text will not be lost.
[Should I put this update on a separate question?]
xml pdf ocr
kepler
source share