Pdf
You have various options.
pdftotext:
Download XPDF Utilities . There are various command line utilities in the .zip file. One of them is pdftotext(.exe) . It can extract all text content from a good PDF file. Type pdftotext -help to learn about some command line options.
Ghostscript:
Install the latest version of Ghostscript (v.8.71). Ghostscript is a PostScript and PDF interpreter. You can also use it to extract text from a PDF:
gswin32c.exe ^ -q ^ -sFONTPATH=c:/windows/fonts ^ -dNODISPLAY ^ -dSAFER ^ -dDELAYBIND ^ -dWRITESYSTEMDICT ^ -dSIMPLE ^ -f ps2ascii.ps ^ -dFirstPage=3 ^ -dLastPage=7 ^ input.pdf ^ -dQUIET
This will output the text contained in pages 3-7 of input.pdf to standard output. You can redirect this to a file by adding > /path/to/output.txt to the command. (Make sure the PostScript ps2ascii.ps utility is present in your Ghostscript lib subdirectory.)
If you omit the -dSIMPLE , the text output will guess line breaks and word spacing. See the Comments inside the ps2ascii.ps file for more ps2ascii.ps . You can even replace this option with -dCOMPLEX for more information on formatting text.
Kurt pfeifle
source share