Troubleshoot child issues when using Mojo :: DOM

Question

Troubleshoot child issues when using Mojo :: DOM

I am trying to extract text from the old vBulletin forum using WWW::Mechanize and Mojo::DOM .

vBulletin does not use HTML and CSS for semantic markup, and it is difficult for me to use Mojo::DOM->children to get specific elements.

These vBulletin posts are structured differently depending on their content.

Single message:

 <div id="postid_12345">The quick brown fox jumps over the lazy dog.<div>

One message quoting another user:

 <div id="postid_12345"> <div> <table> <tr> <td> <div>Quote originally posted by Bob</div> <div>Everyone knows the sky is blue.</div> </td> </tr> </table> </div> I disagree with you, Bob. It obviously green. </div>

One message with spoilers:

 <div id="postid_12345"> <div class="spoiler">Yoda is Luke father!</div> </div>

One message quoting another user with spoilers:

 <div id="postid_12345"> <div> <table> <tr> <td> <div>Quote originally posted by Fred</div> <div class="spoiler">Yoda is Luke father!</div> </td> </tr> </table> </div> <div class="spoiler">No waaaaay!</div> </div>

Assuming the above HTML and array are filled with the necessary message identifiers:

 for (@post_ids) { $mech->get($full_url_of_specific_forum_post); my $dom = Mojo::DOM->new($mech->content); my $div_id = 'postid_' . $_; say $dom->at($div_id)->children('div')->first; say $dom->at($div_id)->text; }

Using $dom->at($div_id)->all_text gives me everything in a continuous line, which makes it difficult to determine what is quoted and what the original is in the message.

Using $dom->at($div_id)->text skips all children, so the quoted text and spoilers are not matched.

I tried the options $dom->at($div_id)->children('div')->first , but that gives me everything, including HTML.

Ideally, I would like to get all the text in each message, with each child in its own line, for example

  POSTID12345: + Quote originally posted by Bob + Everyone knows the sky is blue. I disagree with you, Bob. It obviously green.

I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking at the documentation and messing with it for several hours, my brain was messy, and I was at a loss. I just don’t understand how Mojo::DOM and Mojo::Collections .

Any help would be greatly appreciated.

+4

perl mojolicious

Chuck h Dec 28 '12 at 17:38

source share

2 answers

Joel berger · Answer 1 · 2012-12-28T22:14:51+0000

Looking at the source of Mojo :: DOM, basically the all_text method recursively passes the DOM and fetches all the text. Use this source to write your own DOM function. Its recursive function depends on returning a single line in which you can return an array with any context that you need.

EDIT:

After some discussion of IRC, an updated web cleaning example, it can help you with the guide. http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping

creaktive · Answer 2 · 2012-12-29T17:26:08+0000

There is a module for aligning the HTML tree, HTML :: Linear . Explaining the purpose of aligning the HTML tree is a bit long and boring, so an image is displayed here showing the output of the xpathify tool associated with this module:

As you can see, the nodes of the HTML tree become unit keys / values, where the key is the XPath for this node and the value is the text attribute of the node. In a few keystrokes, you use HTML :: Linear:

 #!/usr/bin/env perl use strict; use utf8; use warnings; use Data::Printer; use HTML::Linear; my $hl = HTML::Linear->new; $hl->parse_file(q(vboard.html)); for my $el ($hl->as_list) { my $hash = $el->as_hash; next unless keys %{$hash}; p $hash; }

Troubleshoot child issues when using Mojo :: DOM

More articles: