Find repeating patterns on web pages in ruby

I am trying to find a way to find duplicate patterns on web pages so that I can retrieve content in my database.

EDIT: I don’t know what the repetition pattern is in front of the hand, so I can’t just search for the given pattern through regex or something like that.

For example, if you have 10 sites selling cars, but the sites are all different, looking at each site, the cars are listed in html in a repeating way down the page for this site.

Other sites will be listed differently, but each will have a repeating pattern.

Does anyone know how or have experience with this kind of thing?

I love ruby, so I was hoping to do it in ruby ​​if someone knows or knows any libs / gems that can help me?

+5
source share
2 answers

Rick, mapping a machine template is a complex topic, not what you find in the library from the Ruby library.

Kyle's answer was the beginning, as soon as you get the page with Ruby, the typical technology for this would be xpath or "XML Path Language".

Using Xpath, you can write a simple selector that will extract every element that matches the template, for example, each link in the HTML document can be //a, each h1will be //h1, and each image directly inside the div, where the image has a class of β€œcar”, will be something like this : //div/image[class="car"].

XPath , , content() .

Ruby Nokogiri avaiable - , , , .

Ruby , HTML/XML Nokogiri, - Anemone, " - Ruby" - .

+2

Ruby, -, , , Net::HTTP. get -.

Net::HTTP.get 'http://www.target-site.com', '/target-page.html'

, - XML Parser , . Hpricot.

-1

All Articles