How to identify pdf files that require OCR?

I have over 30,000 PDF files. Some files are already OCR, and some are not. Is there any way to know which files are already OCR'd and which PDF files are just images?

It will take all the time if I run every single file through the OCR processor.

+5
source share
2 answers

I would write a small script to extract text from PDF files and see if it is "empty". If there is text, the PDF has already been OCRed. You can use ghostscript or XPDF to extract the text.

EDIT: This should start:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

, PDF pdftotext, - , , , OCR PDF.

+3

XPDF -. , .

PDF . pdffonts.exe, , . "" .

> Config Error: No display font for 'Symbol' 
> Config Error: No display font for 'ZapfDingbats' 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- --------- 
> Helvetica                            Type 1            no  no  no       7  0

, PDF ""

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> ABCDEE+Calibri                       TrueType          yes yes no       7  0
> ABCDEE+Calibri,Bold                  TrueType          yes yes no       9  0
0

All Articles