I am looking for an all-in-one solution for creating searchable PDF files (via OCR) from PDF files (scanned documents) only for images in one step (for example, calling a command line from another program)
I found several software packages:
- pdfsandwich (its hard port for Windows systems)
- watchOCR (terminated :-()
I played around whit tesseract, but it only supports one TIFF image as input, and then I need to combine the OCR result with the image, bind all the combined pages to a new PDF document.
I am writing a Java-based program, so I check the PDF files and, if necessary, must convert them to searchable PDF (pdf with a text layer, recognized images via OCR)
It would be great if someone knew how I could simplify all these individual steps and use Tesseract for the following workflow:
PDF with scanned images =====> output (processing) output ====> recognized PDF with search text
Thank you very much in advance
Regards
Shannon
source share