Extract text from PDF

Question

Extract text from PDF

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when I use one of the many utilities available for this, it loses all formatting, and all the data in the tables in the PDF becomes messy. Is it possible to use Python to extract text from a PDF by specifying posts, etc.?

Thanks.

+7

python pdf

Mridang agarwalla Jun 30 '10 at 11:31

source share

4 answers

I had a similar problem and ended up using XPDF from http://www.foolabs.com/xpdf/ One of the utils is PDFtoText, but I think it all figured out how the PDF file was created.

+1

Verakso Feb 10 '11 at 22:14

source share

$ pdftotext -layout thingwithtablesinit.pdf

will create a text file thingwithtablesinit.txt with the correct tables.

+1

John lawrence aspden Jan 13 '12 at 13:07

source share

As explained in other answers, extracting text from a PDF is not a direct task. However, there are certain Python libraries, such as pdfminer ( pdfminer3k for Python 3), that are quite efficient.

The code snippet below shows a Python class that can be created to extract text from a PDF. This will work in most cases.

(source - https://gist.github.com/vinovator/a46341c77273760aa2bb )

 # Python 2.7.6 # PdfAdapter.py """ Reusable library to extract text from pdf file Uses pdfminer library; For Python 3.x use pdfminer3k module Below links have useful information on components of the program https://euske.imtqy.com/pdfminer/programming.html http://denis.papathanasiou.org/posts/2010.08.04.post.html """ from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage # From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter # from pdfminer.pdfdevice import PDFDevice # To raise exception whenever text extraction from PDF is not allowed from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.layout import LAParams, LTTextBox, LTTextLine from pdfminer.converter import PDFPageAggregator import logging __doc__ = "eusable library to extract text from pdf file" __name__ = "pdfAdapter" """ Basic logging config """ log = logging.getLogger(__name__) log.addHandler(logging.NullHandler()) class pdf_text_extractor: """ Modules overview: - PDFParser: fetches data from pdf file - PDFDocument: stores data parsed by PDFParser - PDFPageInterpreter: processes page contents from PDFDocument - PDFDevice: translates processed information from PDFPageInterpreter to whatever you need - PDFResourceManager: Stores shared resources such as fonts or images used by both PDFPageInterpreter and PDFDevice - LAParams: A layout analyzer returns a LTPage object for each page in the PDF document - PDFPageAggregator: Extract the decive to page aggregator to get LT object elements """ def __init__(self, pdf_file_path, password=""): """ Class initialization block. Pdf_file_path - Full path of pdf including name password = If not passed, assumed as none """ self.pdf_file_path = pdf_file_path self.password = password def getText(self): """ Algorithm: 1) Txr information from PDF file to PDF document object using parser 2) Open the PDF file 3) Parse the file using PDFParser object 4) Assign the parsed content to PDFDocument object 5) Now the information in this PDFDocumet object has to be processed. For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager 6) Finally process the file page by page """ # Open and read the pdf file in binary mode with open(self.pdf_file_path, "rb") as fp: # Create parser object to parse the pdf content parser = PDFParser(fp) # Store the parsed content in PDFDocument object document = PDFDocument(parser, self.password) # Check if document is extractable, if not abort if not document.is_extractable: raise PDFTextExtractionNotAllowed # Create PDFResourceManager object that stores shared resources # such as fonts or images rsrcmgr = PDFResourceManager() # set parameters for analysis laparams = LAParams() # Create a PDFDevice object which translates interpreted # information into desired format # Device to connect to resource manager to store shared resources # device = PDFDevice(rsrcmgr) # Extract the decive to page aggregator to get LT object elements device = PDFPageAggregator(rsrcmgr, laparams=laparams) # Create interpreter object to process content from PDFDocument # Interpreter needs to be connected to resource manager for shared # resources and device interpreter = PDFPageInterpreter(rsrcmgr, device) # Initialize the text extracted_text = "" # Ok now that we have everything to process a pdf document, # lets process it page by page for page in PDFPage.create_pages(document): # As the interpreter processes the page stored in PDFDocument # object interpreter.process_page(page) # The device renders the layout from interpreter layout = device.get_result() # Out of the many LT objects within layout, we are interested # in LTTextBox and LTTextLine for lt_obj in layout: if (isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine)): extracted_text += lt_obj.get_text() return extracted_text.encode("utf-8")

Note. There are other libraries, such as PyPDF2 , that are good at converting PDFs, such as merging PDF pages, splitting or cropping specific pages from PDFs, etc.

0

Hvs Jul 12 '16 at 14:07

source share

mark stephens · Accepted Answer · 2010-07-01T07:09:04+0000

PDF files do not contain tabular data if they do not contain structured content. Some tools include heuristics to try to guess the data structure and bring it back. I wrote a blog article explaining the problems with extracting PDF text at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

Extract text from PDF

More articles: