How to extract child text using Nokogiri?

Question

How to extract child text using Nokogiri?

I came across this HTML:

<div class='featured'>
    <h1>
        How to extract this?
        <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
        <span class="moredetail ">
            <a href="/hello" title="hello">hello</a>
        </span>
        <div class="clear"></div>
    </h1>
</div>

I want to extract the text <h1>" How to extract this?". How should I do it?

I tried with the following code, but another element was added there. I am not sure how to exclude them, so I only get text <h1>.

doc = Nokogiri::HTML(open(url))      
records = doc.css(".featured h1")

+5

ruby ruby-on-rails nokogiri

Tony takeshi Dec 18 '11 at 4:58

source share

1 answer

Joshua Cheek · Answer 1 · 2011-12-18T05:31:33+0000

#cssreturns a collection, use #at_cssto get the first node match. All its contents, even text, are children, in which case the text is its first child. You can also do something like children.reject &element?if you want all the children not to be elements.

data = '
<div class="featured">
    <h1>
        How to extract this?
        <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
        <span class="moredetail ">
            <a href="/hello" title="hello">hello</a>
        </span>
        <div class="clear"></div>
    </h1>
</div>
'

require 'nokogiri'
text = Nokogiri::HTML(data).at_css('.featured h1').children.first.text
text # => "\n        How to extract this?\n        "

xpaths:

Nokogiri::HTML(data).at_xpath('//*[@class="featured"]/h1/text()').text

How to extract child text using Nokogiri?

More articles: