PDF task to tiff ImageMagick

I am trying to convert PDF files to tiff images for the next OCR. I use "-density 300x300 -depth 8" as parameters. The first problem is that I get a 72 MB TIF file from a 500 KB PDF file. The second problem is the poor image quality resulting in an OCR failure. Here you can see it for yourself. Adobe Acrobat Reader generated (printed) a tiff image: enter image description here

ImageMaggick tiff image: enter image description here

The difference is huge. How can I get an Adobe image using ImageMaggick? Not all the same, other formats will also be good.

UPD: I found the "antialias" option. Now it is much better. But still, the OCR result is not as accurate as for the Adobe version.

+4
source share
2 answers

My suggestion: use the Ghostscript command line. Since ImageMagick uses Ghostscript anyway, in the background (IM technical term for this: Ghostscript is the "delegate" for some transformations like PDF-> TIFF).

Here is the command line that should work well for letter-format pages in a multi-page PDF file:

gswin32c.exe ^ -o page_%03d.tif ^ -sDEVICE=tiffg4 ^ -r720x720 ^ -g6120x7920 ^ input.pdf 

The -g... parameter controls the absolute width + height of the output pages using the "device points" ... (and with 6120x7920 at 720dpi this happens as the size of letters).

These TIFF Pages ...

  • ... will be black + white,
  • ... will have a resolution of 720 dpi,
  • ... G4 will be compressed and
  • ... will be much smaller than your compressed 300dpi from the IM command line

Your IM -depth 8 parameter is not suitable for getting good results from the pov of the later OCR, as it will create shades of gray around letters that do not help with this.

Your OCR results will now be much better than before.

If your OCR cannot handle the TIFF G4 format (which I doubt), you can generate other TIFF subformats using Ghostscript. For instance:

 gswin32c.exe ^ -o page_%03d.tif ^ -sDEVICE=tiffgray ^ -r720x720 ^ -g6120x7920 ^ -sCompression=lzw ^ input.pdf 

.

 gswin32c.exe ^ -o page_%03d.tif ^ -sDEVICE=tiff24nc ^ -r720x720 ^ -g6120x7920 ^ -sCompression=lzw ^ input.pdf 

The tiffgray device generates 8-bit gray output. The tiff24nc device creates an 8-bit RGB color output. Both types of TIFFs, of course, will be larger than the tiffg4 output.

+5
source

For European paper size A4 and unix / linux:

 gs -o output.tif -sDEVICE=tiffg4 -r720x720 -sPAPERSIZE=a4 input.pdf 
0
source

All Articles