Search on which page the search bar is in pdf using python

Question

Search on which page the search bar is in pdf using python

What python packages can I use to find out which page the specific "search string" is on?

I looked at several python pdf packages, but couldn't figure out which one I should use. PyPDF does not seem to have this functionality, and PDFMiner seems too complex for such a simple task. Any advice?

More precisely: I have several PDF documents, and I would like to extract pages that are between the "Begin" line and the "End" line.

+5

python pdf pypdf

user1043144 Sep 24 '12 at 19:50

source share

3 answers

user1043144 · Answer 1 · 2013-01-17T21:35:40+0000

I finally realized that pyPDF can help. I am sending him if he can help someone else.

(1) string search function

def fnPDF_FindText(xFile, xString): # xfile : the PDF file in which to look # xString : the string to look for import pyPdf, re PageFound = -1 pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb")) for i in range(0, pdfDoc.getNumPages()): content = "" content += pdfDoc.getPage(i).extractText() + "\n" content1 = content.encode('ascii', 'ignore').lower() ResSearch = re.search(xString, content1) if ResSearch is not None: PageFound = i break return PageFound

(2) function to extract pages of interest

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd): from pyPdf import PdfFileReader, PdfFileWriter output = PdfFileWriter() pdfOne = PdfFileReader(file(xFileNameOriginal, "rb")) for i in range(xPageStart, xPageEnd): output.addPage(pdfOne.getPage(i)) outputStream = file(xFileNameOutput, "wb") output.write(outputStream) outputStream.close()

I hope this will be useful for someone else

Supernova · Answer 2 · 2018-12-03T04:10:29+0000

In addition to what was mentioned in @ user1043144,

Use with python 3.x

Use PyPDF2

 import PyPDF2

Use open instead of file

 PdfFileReader(open(xFile, 'rb'))

Prathamesh tanawade · Answer 3 · 2019-06-28T11:17:59+0000

I was able to successfully get the output using the code below.

The code:

 import PyPDF2 import re # Open the pdf file object = PyPDF2.PdfFileReader(r"C:\TEST.pdf") # Get number of pages NumPages = object.getNumPages() # Enter code here String = "Enter_the_text_to_Search_here" # Extract text and do the search for i in range(0, NumPages): PageObj = object.getPage(i) Text = PageObj.extractText() if re.search(String,Text): print("Pattern Found on Page: " + str(i))

Output Example:

 Pattern Found on Page: 7

Search on which page the search bar is in pdf using python

More articles: