Create index from pdf

Possible duplicate:
How to specify PDF files and keyword search?

Create an index from a PDF.

+5
source share
1 answer

I think you can use pyPdf Python for this (http://pybrary.net/pyPdf/). This code shows the page numbers that include the desired word:

from pyPdf import PdfFileReader

input = PdfFileReader(file("YourPDFFile.pdf", "rb"))

numberOfPages = input.getNumPages()

i = 1
while i <  numberOfPages:
    oPage = input.getPage(i)
    text = oPage.extractText()
    text.encode('utf8', 'ignore')
    if text.find('What are you looking for') != -1:
        print i
    i += 1

The same thing, but working with Python 3

from pyPdf import PdfFileReader

input = PdfFileReader(open("YourPDFFile.pdf", "rb"))

numberOfPages = input.getNumPages()

i = 1
while i <  numberOfPages:
    oPage = input.getPage(i)
    text = oPage.extractText()
    text.encode('utf8', 'ignore')
    if text.find('What are you looking for') != -1:
        print(i)
    i += 1
+1
source

All Articles