Copying + pasting text from PDF leads to garbage

I am writing a master's thesis - the NLP system. I have one component - an extractor.

It extracts plain text from PDF files. There are several PDF files that cannot be extracted correctly. Extractor (PDFBox library) returns the following line:

"┤xDn║if | d├gDF" Ti & cD╬lh d FÁhis ~ n ╗xd f "d┤ffih" h "

or

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I checked every file that does this extraction problem, and the text of all these files also cannot be copied to PDF Reader (Adobe Reader and FoxIt reader). Viewing them in these readers is enabled, but after selecting its contents and copying it to the clipboard, I get the same incorrect text (as described above - lines of not semantically correct characters or lines of numbers and letters).

Can someone help me ???

+7
pdf pdfbox
source share
7 answers

If you can successfully select and copy text in Adobe Reader - indicated that the PDF file contains text objects, but you cannot paste the copied text into Notepad without looking like a bunch of garbage characters, then the problem is probably related to the CMap, which uses the selected text.

There are many options in the PDF specification for displaying text content and correspondingly extracting text content . CMap indicates the mapping of characters to character selectors. The PDF specification describes some predefined CMaps, but other CMaps can also be embedded.

I assume that either the CMap for this text is corrupted, or that the PDFBox library does not support this CMap. I suggest trying a different SDK just to see if you have different results.

+5
source share

What created the PDF file. Some PDF files do not contain any encoding information, but simply data for drawing it. Therefore, it is impossible to extract data.

+1
source share

Very often in such cases, when you cannot select, copy the text from the Acrobat (Reader) window, there is another option that may work:

  • Open the File menu,
  • select "Save as ...",
  • select "Text (plain) (* .txt)",
  • find the target directory
  • enter the name you want to use for the text file.

You will have all the text from all the pages in the file, and you will need to find the place that you would like to copy from the beginning, because it is not as convenient as a direct copy. But it works more reliably ....

It also works with acroread on Linux (but you should select "Save as text ..." in the file menu).

Update

You can use the pdffonts command-line pdffonts to get a quick analysis of the fonts used by PDF.

Here is an example output that shows where the problem for extracting text is very likely. It uses one of these hand-encoded PDF files from GitHub-Repository , which was created to provide example PDF files that are well-commented and can be easily opened in a text editor:

 $ pdffonts textextract-bad2.pdf name type encoding emb sub uni object ID ------------------------------- ------------ ----------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0 

How to interpret this table?

  • The above PDF file uses two subsets of fonts (as shown by the prefixes BAAAAA+ and CAAAAA+ for their names, as well as yes entries in the sub column), Helvetica and Helvtica-Bold .
  • Both fonts are of type TrueType .
  • Both fonts use WinAnsi encoding (the font encoding matches the char identifiers used in the PDF source code for the glyphs to be drawn). However, only for the /Helvetica font there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold it is not), as indicated by yes / no in uni -column).

The /ToUnicode should provide backward matching from character identifiers / character codes.

The missing /ToUnicode table for a particular font is almost always a valid indicator that text strings using this font cannot be extracted or copied from PDF. (Even if there is a table in /ToUnicode , extracting text may still be a problem because this table may be damaged, incorrect or incomplete - as can be seen from many real-time PDF files in real time and also shown by several related files in the aforementioned GitHub repository. )

+1
source share

When opened as a Gmail attachment in Chrome (internal PDF browser), copying copies the usual readable characters!

This worked for me when I had this problem, and for others as well . I think the Chrome PDF viewer automatically uses the Google Drive OCR ... It sounds like magic!

+1
source share

The best way to handle this (assuming you have Adobe Acrobat or something similar, not sure if Reader can do this), saves the document as a JPEG. Then recompile all the images as a single PDF file, then use the OCR function to search for text on the pages, then you can copy and paste the text.

0
source share

Select the text you want to copy. Right-click. Select the Export As option. In the dialog box, select a file name and save the new file in Rich Text Format (RTF) format. Open RTF to see the text!

-one
source share

PDF is not a text document. It is rather a vector graphic format, which sometimes may contain text. Thus, there are some documents from which you cannot extract text if you do not want to do OCR. It is as it is.

-2
source share

All Articles