Finding text in PDF using Python?

Problem
I am trying to determine what type of document (e.g. prayer, correspondence, agenda, etc.), Looking through its text, preferably using python. All PDF files are searchable, but I did not find a solution for parsing it with python and using a script to find it (not until it is converted to a text file, but it can be resource-intensive for n documents).

What have i done so far
I looked at the pypdf, pdfminer, adobe pdf documentation and any questions that I could find (although no one seemed to solve this problem directly). PDFminer seems to have the greatest potential, but after reading the documentation, I'm not even sure where to start.

Is there a simple, effective method for reading PDF text, whether by page, line, or the entire document? Or any other workarounds?

+13
python text parsing pdf
source share
6 answers

This is called PDF mining, and it is very difficult because:

  • PDF is a document format intended for printing, not analysis. Inside a PDF document, the text does not have a special order (if the order is not important for printing), most of the time the original text structure is lost (letters cannot be grouped because words and words cannot be grouped in sentences, and the order they are placed on paper, often random).
  • There are many programs that create PDF files, many of which are defective.

Tools like PDFminer use heuristics to re-group letters and words based on their position on the page. I agree, the interface is rather low, but it makes sense when you know what problem they are trying to solve (in the end, it is important to choose how close the letter / word / line should be to the neighbors to be considered part of the paragraph).

An expensive alternative (in terms of computer time / power) creates images for each page and feeds them to the OCR, it might be worth a try if you have a very good OCR.

Therefore, my answer is no, there is no simple method for efficiently extracting files from PDF files - if your documents have a well-known structure, you can fine-tune the rules and get good results, but this is always a gamble,

I would really like to be proved wrong.

[Refresh]

The answer has not changed, but recently I was involved in two projects: one of them uses computer vision to extract data from scanned forms of the hospital. Other data extraction from court records. I found out:

  1. In 2018, computer vision will reach mere mortals. If you have a good example of already classified documents, you can use OpenCV or SciKit-Image to extract functions and train the machine learning classifier to determine which type of document.

  2. If the PDF you are analyzing is โ€œsearchable,โ€ you can very quickly extract all the text using software such as pdftotext and the Bayesian filter (the same algorithm used to classify SPAM).

Thus, there is no reliable and effective method for extracting text from PDF files, but you may not need it to solve the problem (document type classification).

+24
source share

I wrote extensive systems for the company in which I work to convert PDF to data for processing (invoices, calculations, scanned tickets, etc.), and @Paulo Scardine is true - there is no absolutely reliable and easy way to do this. However, the fastest, most reliable, and least intense way is to use pdftotext , part of the xpdf toolkit. This tool quickly converts searchable PDFs to a text file that you can read and parse using Python. Hint: use the -layout argument. And by the way, not all PDFs are searchable, only those that contain text. Some PDF files contain only images without text.

+8
source share

I agree with @Paulo PDF data mining is a huge pain. But you may have success with pdftotext , which is part of the Xpdf package available here:

http://www.foolabs.com/xpdf/download.html

This should be enough for your purpose if you are just looking for single keywords.

pdftotext is a command line utility, but very easy to use. It will provide you with text files that will make it easier for you to work with.

+3
source share

I recently started using ScraperWiki to do what you described.

Here's an example of using ScraperWiki to extract PDF data.

The scraperwiki.pdftoxml() function returns an XML structure.

Then you can use BeautifulSoup to analyze this in the navigation tree.

Here is my code for -

 import scraperwiki, urllib2 from bs4 import BeautifulSoup def send_Request(url): #Get content, regardless of whether an HTML, XML or PDF file pageContent = urllib2.urlopen(url) return pageContent def process_PDF(fileLocation): #Use this to get PDF, covert to XML pdfToProcess = send_Request(fileLocation) pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read()) return pdfToObject def parse_HTML_tree(contentToParse): #returns a navigatibale tree, which you can iterate through soup = BeautifulSoup(contentToParse) return soup pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf') pdfToSoup = parse_HTML_tree(pdf) soupToArray = pdfToSoup.findAll('text') for line in soupToArray: print line 

This code will print a whole, big ugly bunch of <text> tags. Each page is shared using </page> if that is a consolation.

If you want the content inside the <text> tags to include headers enclosed in <b> , for example, use line.contents

If you need only every line of text, not including tags, use line.getText()

This is messy and painful, but it will work for finding PDFs. So far, I have found that it was accurate, but painful.

+3
source share

I am a completely green hand, but somehow this script works for me:

 # import packages import PyPDF2 import re # open the pdf file object = PyPDF2.PdfFileReader("test.pdf") # get number of pages NumPages = object.getNumPages() # define keyterms String = "Social" # extract text and do the search for i in range(0, NumPages): PageObj = object.getPage(i) print("this is page " + str(i)) Text = PageObj.extractText() # print(Text) ResSearch = re.search(String, Text) print(ResSearch) 
+2
source share

Here is a solution that seemed convenient for this problem. In a text variable, you get text from a PDF to search in it. But I also retained the idea of โ€‹โ€‹splashing text by keywords, as I found on this website: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python -85aab86c544f from this I made this decision, although creating nltk was not very simple, it can be useful for further purposes:

 import PyPDF2 import textract from nltk.tokenize import word_tokenize from nltk.corpus import stopwords def searchInPDF(filename, key): occurrences = 0 pdfFileObj = open(filename,'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) num_pages = pdfReader.numPages count = 0 text = "" while count < num_pages: pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() if text != "": text = text else: text = textract.process(filename, method='tesseract', language='eng') tokens = word_tokenize(text) punctuation = ['(',')',';',':','[',']',','] stop_words = stopwords.words('english') keywords = [word for word in tokens if not word in stop_words and not word in punctuation] for k in keywords: if key == k: occurrences+=1 return occurrences pdf_filename = '/home/florin/Downloads/python.pdf' search_for = 'string' print searchInPDF (pdf_filename,search_for) 
+1
source share

All Articles