Is there any open source solution for creating searchable PDFs in Windows?

Question

Is there any open source solution for creating searchable PDFs in Windows?

I am looking for an all-in-one solution for creating searchable PDF files (via OCR) from PDF files (scanned documents) only for images in one step (for example, calling a command line from another program)

I found several software packages:

pdfsandwich (its hard port for Windows systems)
watchOCR (terminated :-()

I played around whit tesseract, but it only supports one TIFF image as input, and then I need to combine the OCR result with the image, bind all the combined pages to a new PDF document.

I am writing a Java-based program, so I check the PDF files and, if necessary, must convert them to searchable PDF (pdf with a text layer, recognized images via OCR)

It would be great if someone knew how I could simplify all these individual steps and use Tesseract for the following workflow:

PDF with scanned images =====> output (processing) output ====> recognized PDF with search text

Thank you very much in advance

Regards

Shannon

+4

windows pdf ocr searchable

Shannon Sep 19 '13 at 10:30

source share

3 answers

nguyenq · Answer 1 · 2013-09-20T23:26:35+0000

There are some Java-based hOCR-to-PDF solutions listed on the Tesseract 3rdParty page. You will first need to convert the PDF to images (using Ghostscript) before sending them to Tesseract for conversion to hOCR format.

Hassan nazeer · Answer 2 · 2016-05-18T12:22:36+0000

There is a .Net Project NAPS2 that takes an image file as an input file and creates a searchable PDF file. It also provides utlity command line automation

Tim b · Answer 3 · 2017-09-28T20:41:46+0000

If an OCR online solution is acceptable, then there is a free ocr.space api that includes the ability to make PDF files searchable .

This is a one-step solution. You send the image or PDF to the api, and it returns a downloadable PDF link.

Is there any open source solution for creating searchable PDFs in Windows?

More articles: