How to filter CDATA and get text from HTML?

Question

How to filter CDATA and get text from HTML?

I want to parse an HTML file using Nokogiri. I can do this, but I only need the text, not CDATA or JavaScript, as my script and div tags are found throughout the file.

+4

ruby nokogiri

Ramil Aug 19 '10 at 7:31

source share

1 answer

akuhn · Answer 1 · 2011-07-07T01:11:30+0000

You can remove all script elements,

doc.search('script').remove

... and then select all text elements

 doc.xpath('//text()')

... or just select text elements in div elements

 doc.xpath('//div//text()')

How to filter CDATA and get text from HTML?

More articles: