Ruby Nokogiri HTML Table Parsing

I am using mechanize / nokogiri and you need to parse the following line of HTML. can someone help me with xpath syntax for this or any other methods that will work?

<table> <tr class="darkRow"> <td> <span> <a href="?x=mSOWNEBYee31H0eV-V6JA0ZejXANJXLsttVxillWOFoykMg5U65P4x7FtTbsosKRbbBPuYvV8nPhET7b5sFeON4aWpbD10Dq"> <span>4242YP</span> </a> </span> </td> <td> <span>Subject of Meeting</span> </td> <td> <span> <span>01:00 PM</span> <span>Nov 11 2009</span> <span>America/New_York</span> </span> </td> <td> <span>30</span> </td> <td> <span> <span> example@email.com </span> </span> </td> <td> <span>39243368</span> </td> </tr> . . . <more table rows with the same format> </table> 

I want this as a result

 "4242YP","Subject of Meeting","01:00 PM Nov 11 2009 America/New_York","30"," example@email.com ", "39243368" . . . <however many rows exist in the html table> 
+4
source share
2 answers

something like that?

 items=doc.xpath('//tr').map {|row| row.xpath('.//span/text()').select{|item| item.text.match(/\w+/)}.map {|item| item.text} } 

returns: => [["4242YP", "Theme of the meeting", "13:00 PM", "November 11, 2009", "America / New York", "30", " example@email.com ", "39243368" ] ["ABCDEFG"]]

The selection includes only those parts that begin with the characters of the word (for example, excluding spaces that have some of your spaces). You may need to refine the "select" filter for your particular case.

I added a minimalist line containing a span containing abcdefg so you can see the nested array.

+4
source

Here the XSL part converts your input if you have an XSL transformer:

 <?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:apply-templates select="//tr"/> </xsl:template> <xsl:template match="tr"> "<xsl:value-of select="td/span/a/span"/>","<xsl:value-of select="td[position()=2]/span"/>","<xsl:value-of select="td[position()=3]/span/span[position()=1]"/>" </xsl:template> </xsl:stylesheet> 

The output is as follows:

 "4242YP","Subject of Meeting","01:00 PM" "4242YP","Subject of Meeting","01:00 PM" 

(I duplicated your first row of the table).

The XSL selection bits give you an idea of ​​which XPATH input you need to get the rest.

0
source

All Articles