How to create a Web crawler that can extract specific information from any site?

So, I'm trying to create a web crawler that I can turn on any site survey, and ask him to safely clean the user feedback on the text. That is, instead of building a scraper, say, Amazon and Overstocked, I only need one scraper that can clear the product reviews from both, even if they had to sacrifice accuracy. I briefly spoke with one of my professors, and he mentioned that I could just implement some heuristics and collect data from this (as a basic example just take all the text in the tags p). At the moment I am really looking for some tips to move in any direction.

(If it matters, at the moment I'm using mechanize and lxml (Python) to bypass the individual sites.)

Thanks!

+4
source share
3 answers

This question is not "the answer", but in the interests of anyone who is faced with this question:

The concept of "total" pig - at best - an interesting academic exercise. This is hardly possible with any useful method.

Two useful project for viewing: Scrapy , framework python web scraping and http://www.nltk.org/ , Natural the Language Toolkit , a large collection of python modules related to er processing, natural language text.

+4
source

On the same day (around 1993), I wrote a spider to extract the target content from different sites, which uses a set of "rules" for each specific site.

Rules are expressed as regular expressions, and were classified as "training rules" (those who massaged the extracted pages to better identify / isolate the extracted data), and the rules of "extraction" (those that caused the extraction of useful data.)

For example, on page:

<html> <head><title>A Page</title></head> <body> <!-- Other stuff here --> <div class="main"> <ul> <li>Datum 1</li> <li>Datum 2</li> </ul> </div> <!-- Other stuff here --> <div> <ul> <li>Extraneous 1</li> <li>Extraneous 2</li> </ul> </div> <!-- Other stuff here --> </body> </html> 

The rules for retrieving only "Datum" values ​​can be:

  • using '^.*?<div class="main">' = "main">'
  • dividing a piece of tape with the help of '</div>.+</html>$' / html> $'
  • extract the result using '<li>([^<]+)</li>' ] +) </ li>'

This worked well for the majority of sites, as long as they do not change their layout, then the rules for the site need to be adjusted.

Today, I would have done the same thing, using Raggett by Dave HTMLTidy , to normalize all of the extracted pages in XHTML and legal XPATH / XSLT massage pages in the correct format.

+3
source

There vocabulary for RDF reviews and microformat . If your reviews are in this format, they will be easy to parse.

0
source

All Articles