Very often in such cases, when you cannot select, copy the text from the Acrobat (Reader) window, there is another option that may work:
- Open the File menu,
- select "Save as ...",
- select "Text (plain) (* .txt)",
- find the target directory
- enter the name you want to use for the text file.
You will have all the text from all the pages in the file, and you will need to find the place that you would like to copy from the beginning, because it is not as convenient as a direct copy. But it works more reliably ....
It also works with acroread on Linux (but you should select "Save as text ..." in the file menu).
Update
You can use the pdffonts command-line pdffonts to get a quick analysis of the fonts used by PDF.
Here is an example output that shows where the problem for extracting text is very likely. It uses one of these hand-encoded PDF files from GitHub-Repository , which was created to provide example PDF files that are well-commented and can be easily opened in a text editor:
$ pdffonts textextract-bad2.pdf name type encoding emb sub uni object ID ------------------------------- ------------ ----------- --- --- --- --------- BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0 CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
How to interpret this table?
- The above PDF file uses two subsets of fonts (as shown by the prefixes
BAAAAA+ and CAAAAA+ for their names, as well as yes entries in the sub column), Helvetica and Helvtica-Bold . - Both fonts are of type
TrueType . - Both fonts use
WinAnsi encoding (the font encoding matches the char identifiers used in the PDF source code for the glyphs to be drawn). However, only for the /Helvetica font there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold it is not), as indicated by yes / no in uni -column).
The /ToUnicode should provide backward matching from character identifiers / character codes.
The missing /ToUnicode table for a particular font is almost always a valid indicator that text strings using this font cannot be extracted or copied from PDF. (Even if there is a table in /ToUnicode , extracting text may still be a problem because this table may be damaged, incorrect or incomplete - as can be seen from many real-time PDF files in real time and also shown by several related files in the aforementioned GitHub repository. )
Kurt pfeifle
source share