How to extract useful text from HTML

I would like to analyze the html page and extract meaningful text from it. Does anyone know any good algorithms?

I develop my applications on Rails, but I think Ruby is a bit slower at that, so I think that if there was some good library in c, that would be appropriate.

Thanks!!

PD: Please do not recommend anything with java

UPDATE: I found this link text

Sorry, is in python

+6
c html ruby html-parsing html-content-extraction
source share
4 answers

Use Nokogiri , which is fast and written in C for Ruby.

(Using regexp to parse recursive expressions such as HTML is generally difficult and error prone , and I would not go this way. The problem seems to occur again and again.)

Using a real parser, such as, for example, Nokogiri, mentioned above, you also get an additional advantage that preserves the structure and logic of the HTML document, and sometimes you really need these tips.

+6
source share

Ruby Integrated Solutions

External solutions

+2
source share

Lynx can do this. This is open source if you want to take a look at it.

-one
source share

You must cut out the entire bracket with the text, and then collapse the white spaces. Theoretically, < and > should not be in other cases. Pages contain &lt; and &gt; everywhere instead of them.

Collapse spaces: convert all TAB, newline, etc. into spaces, and then replace each sequence of spaces with one space.

UPDATE: And you should start by looking for the <body> .

-3
source share

All Articles