Extract text from a distorted PDF file

I have a PDF file with valuable textual information.

The problem is that I cannot extract the text, all I get is a bunch of garbled characters. The same thing happens if I copy and paste text from a PDF reader into a text file. Even File → Save As Text in Acrobat Reader fails.

I used all the tools in which I could get my hands on, and the result is the same. I think this has something to do with embedding fonts, but I don’t know what exactly?

My questions:

  • What is the culprit of this strange distortion of the text ?
  • How to extract text content from PDF (programmatically, using a tool, directly manipulating bits, etc.)?
  • How to fix PDF so as not to crop on copy?
+8
pdf file-format text-analysis
source share
3 answers

I have turned to many people for help, and OCR is the only solution to this problem.

+11
source share

Some PDF files are created without special information, which is critical for successfully extracting text from them. Even with Adobe tools. In principle, such files do not contain information about mapping characters to characters.

Such files will be displayed and printed just fine (because the shapes of the characters are correctly defined), but the text from them cannot be correctly copied / extracted (because there is no information about the meaning of the glyphs / shapes used).

For example, Distiller creates such files when the "Smallest file size" preset is used.

Other than OCR, there is no other way to get text from such files, I'm afraid.


Complementing the original answer

The original answer mentions “meaning for the glyphs / shapes used”. This information should be contained in a PDF structure called a table /ToUnicode . Such a table is required for each font that is embedded in a subset and uses a non-standard ( Custom ) encoding.

To quickly evaluate the chances of extracting text content, you can use the pdffonts command-line pdffonts . It tabulates a series of articles about each font used by PDF. The presence of the table /ToUnicode indicated by the uni column.

A few examples of exits:

 $ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-good.pdf name type encoding emb sub uni object ID ------------------------ ----------- ---------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes yes 13 0 $ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad1.pdf name type encoding emb sub uni object ID ------------------------ ----------- ---------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes no 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0 $ kp@mbp:git.PDF101.angea> pdffonts handcoded/textextract/textextract-bad2.pdf name type encoding emb sub uni object ID ------------------------ ----------- ---------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0 

good.pdf allows good.pdf to extract the text content for both fonts correctly, because both fonts have an accompanying table /ToUnicode .

For bad1.pdf and bad2.pdf text extraction is performed for only one of the two fonts and not for the other, because only one font has a table /ToUnicode .

I, Kurt Pfeifle , recently created a series of manual PDF encodings to demonstrate the impact of existing, erroneous, managed, or missing tables /ToUnicode in the PDF source code. These PDF files are widely commented and suitable for study with a text editor. The above pdffonts output examples were created using these manually encoded files. (There are several more PDF files showing different results that an interested reader might want to study ...)

+22
source share

I had the same problem. Uploading it to Google Drive, opening it using Google Docs and copying text from me worked.

+3
source share

All Articles