Which language / tool should be used for HTML parsing?

I have several websites on which I want to extract data and based on previous experiences, it is not as simple as it seems. What for? Just because the HTML pages I have to parse are incorrectly formatted (missing closing tag, etc.).

Given that I have no restrictions on the technology, language or tool that I can use , what are your suggestions that are easy to parse and extract data from HTML pages ? I tried HTML Agility Pack, BeautifulSoup, and even these tools are not perfect (HTML Agility Pack is buggy, and BeautifulSoup's parsing mechanism does not work with the pages that I pass to it).

Thanks!

+5
source share
6 answers

You can use almost any language you like, just don’t try to parse HTML with regular expressions.

So let me rephrase this and say: you can use any language that you like, it has an HTML parser, which has been largely invented over the past 15-20 years.

If you are having problems with specific pages, I suggest you study them using HTML Tidy .

+4
source

, hpricot ( ) - . scrubyt , html- Ruby http://scrubyt.org/

http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb

require 'rubygems'
require 'scrubyt'

# Simple exmaple for scraping basic
# information from a public Twitter
# account.

# Scrubyt.logger = Scrubyt::Logger.new

twitter_data = Scrubyt::Extractor.define do
  fetch 'http://www.twitter.com/scobleizer'

  profile_info '//ul[@class="about vcard entry-author"]' do
    full_name "//li//span[@class='fn']"
    location "//li//span[@class='adr']"
    website "//li//a[@class='url']/@href"
    bio "//li//span[@class='bio']"
  end
end

puts twitter_data.to_xml
+2

Java Jsoup .

+2
0

PHP DOMDocument. HTML. . , DOCTYPE HTML, , Firebug HTML, . , , DOMDocument HTML. , , , , libxml .

$html = file_get_contents('http://example.com');

$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);

echo $dom->saveHTML();
0

, HTML DOM, .

perl HTML:: TreeBuilder .

0

All Articles