How to embed an external OCR into an existing PDF?

Question

How to embed an external OCR into an existing PDF?

I have a set of images over which I run the OCR application. This process results in an XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now I would like to add the XML file information as an invisible text layer to the PDF to get a searchable PDF. Is there a simple and free way?

Some information:

I do not want to use the Acrobat OCR features,
The OCR process leads to an XML file that contains elements such as:
<line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>

Update: it may be possible to do what I want differently. Suppose you already have a PDF file generated from a set of images that already contains OCRed text. Is it possible (possibly programmatically) to access only the image of each page, process it (for example, convert to monochrome) and save it back to a PDF file? If so, OCRed text will not be lost.

[Should I put this update on a separate question?]

+6

xml pdf ocr

kepler Sep 28 '09 at 21:35

source share

2 answers

If all you want to do is convert the existing PDF to grayscale, try Imagemagick :

 convert foo.pdf -colorspace Gray -compress zip gray.pdf

I do not think this will change any other attributes in your pdf.

-one

Dave parillo 01 Oct '09 at 16:15

source share

Jukka matilainen · Accepted Answer · 2009-10-05T22:28:48+0000

For your next question about processing PDF files without losing hidden layers: I believe Ghostscript can do this. For example, the following command should convert the PDF to grayscale:

 gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dColorConversionStrategy=/Gray -dProcessColorModel=/DeviceGray -sOutputFile=output.pdf input.pdf

How to embed an external OCR into an existing PDF?

More articles: