Is there any open source solution for creating searchable PDFs in Windows?

I am looking for an all-in-one solution for creating searchable PDF files (via OCR) from PDF files (scanned documents) only for images in one step (for example, calling a command line from another program)

I found several software packages:

  • pdfsandwich (its hard port for Windows systems)
  • watchOCR (terminated :-()

I played around whit tesseract, but it only supports one TIFF image as input, and then I need to combine the OCR result with the image, bind all the combined pages to a new PDF document.

I am writing a Java-based program, so I check the PDF files and, if necessary, must convert them to searchable PDF (pdf with a text layer, recognized images via OCR)

It would be great if someone knew how I could simplify all these individual steps and use Tesseract for the following workflow:

PDF with scanned images =====> output (processing) output ====> recognized PDF with search text

Thank you very much in advance

Regards

Shannon

+4
source share
3 answers

There are some Java-based hOCR-to-PDF solutions listed on the Tesseract 3rdParty page. You will first need to convert the PDF to images (using Ghostscript) before sending them to Tesseract for conversion to hOCR format.

+1
source

There is a .Net Project NAPS2 that takes an image file as an input file and creates a searchable PDF file. It also provides utlity command line automation

+1
source

If an OCR online solution is acceptable, then there is a free ocr.space api that includes the ability to make PDF files searchable .

This is a one-step solution. You send the image or PDF to the api, and it returns a downloadable PDF link.

0
source

All Articles