Get the first few elements of an html fragment with xpath on ruby

Question

Get the first few elements of an html fragment with xpath on ruby

For a blog such as a project, I want to get the first few paragraphs, headings, lists, or something else within the range of characters from the markup of the generated html fragment for display as a summary.

So, if I have

<h1>hello world</h1> <p>Lets say these are 100 chars</p> <ul> <li>some bla bla, 40 chars</li> </ul> <p>some other text</p>

And suppose I want to summarize the text in the first 150 characters (it’s not necessary to be too precise, I could just get the first 150 characters, including tags, and continue with this, but will probably create some artifacts in the tail, which can be more complicated outstanding ...), he should give me h1, p and ul, but not the final p (which would be truncated). If the first element should have more than 150 characters, I would take the full first element.

How can i get this? Using XPath or regex? I am a little without ideas about this ...

Edit

First I want to give THANKS to everyone who answered!

While I had really excellent answers in this thread, it was much easier for me to connect before the markdown interpreter got in, take the first n text blocks separated by \ r \ n \ r \ n, and just pass it to the md generation .

  class String def summarize_md length arr = self.split(/\r\n\r\n/) sum ="" arr.each do |ea| break if sum.length + ea.length > length sum = sum+"#{ea}\r\n\r\n" end sum end end

although it’s possible that this code can be reduced to one line, it is still much simpler and cpu more friendly than any of the proposed solutions. In any case, since my question can be interpreted, for example, if html was the starting point (and not the text md), I will just give an answer to the first guy ... I hope that is simple ...

+4

html ruby regex xpath markdown

Jan Oct 20 '10 at 23:48

source share

4 answers

Pure XPath 1.0 solution :

substring (/ *, +1150)

where the parent of the provided XHTML fragment is the top element ( /* or /html ).

There is a very accurate XPath 2.0 solution :

  for $t in (//text())[not(sum((.| preceding::text())/string-length(.)) gt 150)] return ($t, '&#xA;')

Pay attention . An XML document should be parsed in a mode that discards text nodes with only space. Otherwise, string-length(.) Must be replaced with string-length(normalize-space(.))

+1

Dimitre novatchev Oct 21 '10 at 2:42

source share

How can i get this?

XSLT of course!

This style sheet:

 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:strip-space elements="*"/> <xsl:param name="pMaxLength" select="73"/> <xsl:template match="node()"> <xsl:param name="pPrecedingLength" select="0"/> <xsl:variable name="vContent"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:apply-templates select="node()[1]"> <xsl:with-param name="pPrecedingLength" select="$pPrecedingLength"/> </xsl:apply-templates> </xsl:copy> </xsl:variable> <xsl:variable name="vLength" select="$pPrecedingLength + string-length($vContent)"/> <xsl:if test="$pMaxLength > $vLength and (string-length($vContent) or not(node())) or not($pPrecedingLength)"> <xsl:copy-of select="$vContent"/> <xsl:apply-templates select="following-sibling::node()[1]"> <xsl:with-param name="pPrecedingLength" select="$vLength"/> </xsl:apply-templates> </xsl:if> </xsl:template> </xsl:stylesheet>

Output:

 <html> <h1>hello world</h1> <p>Lets say these are 100 chars</p> <ul> <li>some bla bla, 40 chars</li> </ul> </html>

+1

user357812 Oct 21 '10 at 16:06

source share

For my purposes, I always wanted to remove tags, because they could include all kinds of nasty things that would fully display the summary of the summary. They can also seriously distort the number of letters depending on the tags and whether they contain parameters.

I have used something like this many times.

 require 'nokogiri' html = %q{ <h1>hello world</h1> <p>Lets say these are 100 chars</p> <ul> <li>some bla bla, 40 chars</li> </ul> <p>some other text</p> } doc = Nokogiri::HTML(html) puts doc.content.gsub(/\n/, ' ').squeeze(' ').strip[0 .. 150]

What are the exits

 hello world Lets say these are 100 chars some bla bla, 40 chars some other text

I will leave this to you to figure out how to ignore or subtract the text from the last <p> , but finding that tag and grabbing its contents and then removing it from the end of the line should not be too heavy.

+1

the tin man Oct 21 '10 at 22:42

source share

Mark thomas · Accepted Answer · 2010-10-21T00:30:56+0000

Using XPath is the most reliable and flexible. Here is an example application:

 require 'rubygems' require 'nokogiri' html = <<End <h1>hello world</h1> <p>Lets say these are 100 chars.......................................................................</p> <ul> <li>some bla bla, 40 chars</li> </ul> <p>some other text</p> End LIMIT = 150 summary = "" doc = Nokogiri::HTML.parse(html) doc.xpath('//text()').each do |node| text = node.text break if summary.length + text.length >= LIMIT summary << text end puts summary puts summary.length

XPath //text() simply selects all the text nodes in the document. If you want to know more specifically which elements you are interested in, you can.

Get the first few elements of an html fragment with xpath on ruby

Edit

More articles: