On the same day (around 1993), I wrote a spider to extract the target content from different sites, which uses a set of "rules" for each specific site.
Rules are expressed as regular expressions, and were classified as "training rules" (those who massaged the extracted pages to better identify / isolate the extracted data), and the rules of "extraction" (those that caused the extraction of useful data.)
For example, on page:
<html> <head><title>A Page</title></head> <body> <div class="main"> <ul> <li>Datum 1</li> <li>Datum 2</li> </ul> </div> <div> <ul> <li>Extraneous 1</li> <li>Extraneous 2</li> </ul> </div> </body> </html>
The rules for retrieving only "Datum" values ββcan be:
- using
'^.*?<div class="main">' = "main">' - dividing a piece of tape with the help of
'</div>.+</html>$' / html> $' - extract the result using
'<li>([^<]+)</li>' ] +) </ li>'
This worked well for the majority of sites, as long as they do not change their layout, then the rules for the site need to be adjusted.
Today, I would have done the same thing, using Raggett by Dave HTMLTidy , to normalize all of the extracted pages in XHTML and legal XPATH / XSLT massage pages in the correct format.
source share