Which language / tool should be used for HTML parsing?

Question

Which language / tool should be used for HTML parsing?

I have several websites on which I want to extract data and based on previous experiences, it is not as simple as it seems. What for? Just because the HTML pages I have to parse are incorrectly formatted (missing closing tag, etc.).

Given that I have no restrictions on the technology, language or tool that I can use , what are your suggestions that are easy to parse and extract data from HTML pages ? I tried HTML Agility Pack, BeautifulSoup, and even these tools are not perfect (HTML Agility Pack is buggy, and BeautifulSoup's parsing mechanism does not work with the pages that I pass to it).

Thanks!

+5

html html-parsing screen-scraping

Martin Feb 24 '09 at 14:25

source share

6 answers

, hpricot ( ) - . scrubyt , html- Ruby http://scrubyt.org/

http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb

require 'rubygems'
require 'scrubyt'

# Simple exmaple for scraping basic
# information from a public Twitter
# account.

# Scrubyt.logger = Scrubyt::Logger.new

twitter_data = Scrubyt::Extractor.define do
  fetch 'http://www.twitter.com/scobleizer'

  profile_info '//ul[@class="about vcard entry-author"]' do
    full_name "//li//span[@class='fn']"
    location "//li//span[@class='adr']"
    website "//li//a[@class='url']/@href"
    bio "//li//span[@class='bio']"
  end
end

puts twitter_data.to_xml

+2

Stewart Robinson 24 . '09 14:48

Java Jsoup .

+2

cuneytykaya 04 . '13 12:28

hpricot , .

0

Colin Pickard 24 . '09 14:31

PHP DOMDocument. HTML. . , DOCTYPE HTML, , Firebug HTML, . , , DOMDocument HTML. , , , , libxml .

$html = file_get_contents('http://example.com');

$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);

echo $dom->saveHTML();

0

Ionuț G. Stan 24 . '09 14:45

, HTML DOM, .

perl HTML:: TreeBuilder .

0

Boris Ivanov 09 . '15 21:17

cletus · Accepted Answer · 2009-02-24T14:26:40+0000

You can use almost any language you like, just don’t try to parse HTML with regular expressions.

So let me rephrase this and say: you can use any language that you like, it has an HTML parser, which has been largely invented over the past 15-20 years.

If you are having problems with specific pages, I suggest you study them using HTML Tidy .

Which language / tool should be used for HTML parsing?

More articles: