I recently started using ScraperWiki to do what you described.
Here's an example of using ScraperWiki to extract PDF data.
The scraperwiki.pdftoxml() function returns an XML structure.
Then you can use BeautifulSoup to analyze this in the navigation tree.
Here is my code for -
import scraperwiki, urllib2 from bs4 import BeautifulSoup def send_Request(url): #Get content, regardless of whether an HTML, XML or PDF file pageContent = urllib2.urlopen(url) return pageContent def process_PDF(fileLocation): #Use this to get PDF, covert to XML pdfToProcess = send_Request(fileLocation) pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read()) return pdfToObject def parse_HTML_tree(contentToParse): #returns a navigatibale tree, which you can iterate through soup = BeautifulSoup(contentToParse) return soup pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf') pdfToSoup = parse_HTML_tree(pdf) soupToArray = pdfToSoup.findAll('text') for line in soupToArray: print line
This code will print a whole, big ugly bunch of <text> tags. Each page is shared using </page> if that is a consolation.
If you want the content inside the <text> tags to include headers enclosed in <b> , for example, use line.contents
If you need only every line of text, not including tags, use line.getText()
This is messy and painful, but it will work for finding PDFs. So far, I have found that it was accurate, but painful.
Jastononhair
source share