Advanced PDF analysis using Python (extracting text without tables, etc.): What is the best library?

Question

Advanced PDF analysis using Python (extracting text without tables, etc.): What is the best library?

I am looking for a PDF library that will allow me to extract text from a PDF document. I looked at PyPDF and it can extract text from a PDF document very nicely. The problem is that if the document has tables, the text in the tables is extracted in accordance with the rest of the document text. This can be problematic because it creates sections of text that are not useful and look distorted (for example, a lot of numbers torn together).

I am looking for something more advanced. I would like to extract text from a PDF document, excluding any tables and special formatting. Is there a library that does this? Or am I forced to do some post-processing of the output text to get rid of these sections?

+79

python parsing pdf information-extraction text-extraction

Mike Cialowicz Dec 04 '09 at 17:28

source share

2 answers

To solve a difficult task, since visually similar PDF files can have a completely different structure depending on how they were created. In the worst case scenario, the library will have to act primarily as an OCR. A PDF, on the other hand, may contain sufficient structure and metadata to easily remove tables and numbers that the library can be adapted for use.

I'm sure there are no open source tools that solve your problem for a wide variety of PDF files, but I remember hearing about commercial software that requires you to do what you ask. I am sure that you will encounter them while searching the Internet.

0

akaihola Dec 04 '09 at 23:14

source share

Etienne · Accepted Answer · 2009-12-05 03:07

You can also take a look at PDFMiner , another PDF parser in Python.

A feature of PDFMiner is that you can control how it groups text parts when they are extracted. You do this by defining the space between lines, words, characters, etc. So, perhaps by changing this, you can achieve what you want (it depends on the variability of your documents). PDFMiner can also give you the location of the text on the page, it can retrieve data by object ID and other things. So delve into PDFMiner and be creative!

But your problem is actually not easy to solve, because in PDF the text is not continuous, but consists of many small groups of characters located absolutely on the page. The main task of PDF is to keep the layout intact. It is not focused on content, but focused on presentation.

Advanced PDF analysis using Python (extracting text without tables, etc.): What is the best library?

More articles: