Search on which page the search bar is in pdf using python

What python packages can I use to find out which page the specific "search string" is on?

I looked at several python pdf packages, but couldn't figure out which one I should use. PyPDF does not seem to have this functionality, and PDFMiner seems too complex for such a simple task. Any advice?

More precisely: I have several PDF documents, and I would like to extract pages that are between the "Begin" line and the "End" line.

+5
python pdf pypdf
source share
3 answers

I finally realized that pyPDF can help. I am sending him if he can help someone else.

(1) string search function

def fnPDF_FindText(xFile, xString): # xfile : the PDF file in which to look # xString : the string to look for import pyPdf, re PageFound = -1 pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb")) for i in range(0, pdfDoc.getNumPages()): content = "" content += pdfDoc.getPage(i).extractText() + "\n" content1 = content.encode('ascii', 'ignore').lower() ResSearch = re.search(xString, content1) if ResSearch is not None: PageFound = i break return PageFound 

(2) function to extract pages of interest

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd): from pyPdf import PdfFileReader, PdfFileWriter output = PdfFileWriter() pdfOne = PdfFileReader(file(xFileNameOriginal, "rb")) for i in range(xPageStart, xPageEnd): output.addPage(pdfOne.getPage(i)) outputStream = file(xFileNameOutput, "wb") output.write(outputStream) outputStream.close() 

I hope this will be useful for someone else

+15
source share

In addition to what was mentioned in @ user1043144,

Use with python 3.x

Use PyPDF2

 import PyPDF2 

Use open instead of file

 PdfFileReader(open(xFile, 'rb')) 
+2
source share

I was able to successfully get the output using the code below.

The code:

 import PyPDF2 import re # Open the pdf file object = PyPDF2.PdfFileReader(r"C:\TEST.pdf") # Get number of pages NumPages = object.getNumPages() # Enter code here String = "Enter_the_text_to_Search_here" # Extract text and do the search for i in range(0, NumPages): PageObj = object.getPage(i) Text = PageObj.extractText() if re.search(String,Text): print("Pattern Found on Page: " + str(i)) 

Output Example:

 Pattern Found on Page: 7 
0
source share

All Articles