Troubleshoot child issues when using Mojo :: DOM

I am trying to extract text from the old vBulletin forum using WWW::Mechanize and Mojo::DOM .

vBulletin does not use HTML and CSS for semantic markup, and it is difficult for me to use Mojo::DOM->children to get specific elements.

These vBulletin posts are structured differently depending on their content.

Single message:

 <div id="postid_12345">The quick brown fox jumps over the lazy dog.<div> 

One message quoting another user:

 <div id="postid_12345"> <div> <table> <tr> <td> <div>Quote originally posted by Bob</div> <div>Everyone knows the sky is blue.</div> </td> </tr> </table> </div> I disagree with you, Bob. It obviously green. </div> 

One message with spoilers:

 <div id="postid_12345"> <div class="spoiler">Yoda is Luke father!</div> </div> 

One message quoting another user with spoilers:

 <div id="postid_12345"> <div> <table> <tr> <td> <div>Quote originally posted by Fred</div> <div class="spoiler">Yoda is Luke father!</div> </td> </tr> </table> </div> <div class="spoiler">No waaaaay!</div> </div> 

Assuming the above HTML and array are filled with the necessary message identifiers:

 for (@post_ids) { $mech->get($full_url_of_specific_forum_post); my $dom = Mojo::DOM->new($mech->content); my $div_id = 'postid_' . $_; say $dom->at($div_id)->children('div')->first; say $dom->at($div_id)->text; } 

Using $dom->at($div_id)->all_text gives me everything in a continuous line, which makes it difficult to determine what is quoted and what the original is in the message.

Using $dom->at($div_id)->text skips all children, so the quoted text and spoilers are not matched.

I tried the options $dom->at($div_id)->children('div')->first , but that gives me everything, including HTML.

Ideally, I would like to get all the text in each message, with each child in its own line, for example

  POSTID12345: + Quote originally posted by Bob + Everyone knows the sky is blue. I disagree with you, Bob. It obviously green. 

I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking at the documentation and messing with it for several hours, my brain was messy, and I was at a loss. I just don’t understand how Mojo::DOM and Mojo::Collections .

Any help would be greatly appreciated.

+4
source share
2 answers

There is a module for aligning the HTML tree, HTML :: Linear . Explaining the purpose of aligning the HTML tree is a bit long and boring, so an image is displayed here showing the output of the xpathify tool associated with this module:

screenshot

As you can see, the nodes of the HTML tree become unit keys / values, where the key is the XPath for this node and the value is the text attribute of the node. In a few keystrokes, you use HTML :: Linear:

 #!/usr/bin/env perl use strict; use utf8; use warnings; use Data::Printer; use HTML::Linear; my $hl = HTML::Linear->new; $hl->parse_file(q(vboard.html)); for my $el ($hl->as_list) { my $hash = $el->as_hash; next unless keys %{$hash}; p $hash; } 
+2
source

All Articles