Some PDF files are created without special information, which is critical for successfully extracting text from them. Even with Adobe tools. In principle, such files do not contain information about mapping characters to characters.
Such files will be displayed and printed just fine (because the shapes of the characters are correctly defined), but the text from them cannot be correctly copied / extracted (because there is no information about the meaning of the glyphs / shapes used).
For example, Distiller creates such files when the "Smallest file size" preset is used.
Other than OCR, there is no other way to get text from such files, I'm afraid.
Complementing the original answer
The original answer mentions “meaning for the glyphs / shapes used”. This information should be contained in a PDF structure called a table /ToUnicode . Such a table is required for each font that is embedded in a subset and uses a non-standard ( Custom ) encoding.
To quickly evaluate the chances of extracting text content, you can use the pdffonts command-line pdffonts . It tabulates a series of articles about each font used by PDF. The presence of the table /ToUnicode indicated by the uni column.
A few examples of exits:
$ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf name type encoding emb sub uni object ID ------------------------ ----------- ---------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes yes 13 0 $ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf name type encoding emb sub uni object ID ------------------------ ----------- ---------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes no 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0 $ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf name type encoding emb sub uni object ID ------------------------ ----------- ---------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
good.pdf allows good.pdf to extract the text content for both fonts correctly, because both fonts have an accompanying table /ToUnicode .
For bad1.pdf and bad2.pdf text extraction is performed for only one of the two fonts and not for the other, because only one font has a table /ToUnicode .
I, Kurt Pfeifle , recently created a series of manual PDF encodings to demonstrate the impact of existing, erroneous, managed, or missing tables /ToUnicode in the PDF source code. These PDF files are widely commented and suitable for study with a text editor. The above pdffonts output examples were created using these manually encoded files. (There are several more PDF files showing different results that an interested reader might want to study ...)
Bobrovsky
source share