How to embed an external OCR into an existing PDF?

I have a set of images over which I run the OCR application. This process results in an XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now I would like to add the XML file information as an invisible text layer to the PDF to get a searchable PDF. Is there a simple and free way?

Some information:

  • I do not want to use the Acrobat OCR features,

  • The OCR process leads to an XML file that contains elements such as:

    <line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>

Update: it may be possible to do what I want differently. Suppose you already have a PDF file generated from a set of images that already contains OCRed text. Is it possible (possibly programmatically) to access only the image of each page, process it (for example, convert to monochrome) and save it back to a PDF file? If so, OCRed text will not be lost.

[Should I put this update on a separate question?]

+6
xml pdf ocr
source share
2 answers

For your next question about processing PDF files without losing hidden layers: I believe Ghostscript can do this. For example, the following command should convert the PDF to grayscale:

 gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dColorConversionStrategy=/Gray -dProcessColorModel=/DeviceGray -sOutputFile=output.pdf input.pdf 
+1
source share

If all you want to do is convert the existing PDF to grayscale, try Imagemagick :

 convert foo.pdf -colorspace Gray -compress zip gray.pdf 

I do not think this will change any other attributes in your pdf.

-one
source share

All Articles