I am trying to extract text from the old vBulletin forum using WWW::Mechanize and Mojo::DOM .
vBulletin does not use HTML and CSS for semantic markup, and it is difficult for me to use Mojo::DOM->children to get specific elements.
These vBulletin posts are structured differently depending on their content.
Single message:
<div id="postid_12345">The quick brown fox jumps over the lazy dog.<div>
One message quoting another user:
<div id="postid_12345"> <div> <table> <tr> <td> <div>Quote originally posted by Bob</div> <div>Everyone knows the sky is blue.</div> </td> </tr> </table> </div> I disagree with you, Bob. It obviously green. </div>
One message with spoilers:
<div id="postid_12345"> <div class="spoiler">Yoda is Luke father!</div> </div>
One message quoting another user with spoilers:
<div id="postid_12345"> <div> <table> <tr> <td> <div>Quote originally posted by Fred</div> <div class="spoiler">Yoda is Luke father!</div> </td> </tr> </table> </div> <div class="spoiler">No waaaaay!</div> </div>
Assuming the above HTML and array are filled with the necessary message identifiers:
for (@post_ids) { $mech->get($full_url_of_specific_forum_post); my $dom = Mojo::DOM->new($mech->content); my $div_id = 'postid_' . $_; say $dom->at($div_id)->children('div')->first; say $dom->at($div_id)->text; }
Using $dom->at($div_id)->all_text gives me everything in a continuous line, which makes it difficult to determine what is quoted and what the original is in the message.
Using $dom->at($div_id)->text skips all children, so the quoted text and spoilers are not matched.
I tried the options $dom->at($div_id)->children('div')->first , but that gives me everything, including HTML.
Ideally, I would like to get all the text in each message, with each child in its own line, for example
POSTID12345: + Quote originally posted by Bob + Everyone knows the sky is blue. I disagree with you, Bob. It obviously green.
I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking at the documentation and messing with it for several hours, my brain was messy, and I was at a loss. I just donβt understand how Mojo::DOM and Mojo::Collections .
Any help would be greatly appreciated.