How to extract useful text from HTML

Question

How to extract useful text from HTML

I would like to analyze the html page and extract meaningful text from it. Does anyone know any good algorithms?

I develop my applications on Rails, but I think Ruby is a bit slower at that, so I think that if there was some good library in c, that would be appropriate.

Thanks!!

PD: Please do not recommend anything with java

UPDATE: I found this link text

Sorry, is in python

+6

c html ruby html-parsing html-content-extraction

Nisanio Oct 19 '10 at 14:30

source share

4 answers

Ruby Integrated Solutions

use Nokogiri as recommended by Amigable Clark kant
Use Hpricot

External solutions

If your HTML is well-formed, you can use Expat XML Parser to do this.
For something more HTML-only, W3C has actually released code for LibWWW , which contains a simple HTML parser ( documentation ).

+2

haylem Oct 19 '10 at 14:45

source share

Lynx can do this. This is open source if you want to take a look at it.

-one

mouviciel Oct 19 '10 at 14:36

source share

You must cut out the entire bracket with the text, and then collapse the white spaces. Theoretically, < and > should not be in other cases. Pages contain < and > everywhere instead of them.

Collapse spaces: convert all TAB, newline, etc. into spaces, and then replace each sequence of spaces with one space.

UPDATE: And you should start by looking for the <body> .

-3

Notinlist Oct 19 '10 at 14:37

source share

Prof. · Accepted Answer · 2010-10-19T14:41:48+0000

Use Nokogiri , which is fast and written in C for Ruby.

(Using regexp to parse recursive expressions such as HTML is generally difficult and error prone , and I would not go this way. The problem seems to occur again and again.)

Using a real parser, such as, for example, Nokogiri, mentioned above, you also get an additional advantage that preserves the structure and logic of the HTML document, and sometimes you really need these tips.

How to extract useful text from HTML

Ruby Integrated Solutions

External solutions

More articles: