How can I render HTML as text using Perl like Lynx?

Possible duplicate:
Which CPAN module would you recommend for turning HTML into plain text?

Question:

  • Is there a module for rendering HTML , in particular, for collecting text, adhering to font style tags such as <tt> , <b> , <i> , etc. and break-line <br> , similar to Lynx .

For example :

# cat test.html

 <body> <div id="foo" class="blah"> <tt>test<br> <b>test</b><br> whatever<br> test</tt> </div> </body> 

# lynx.exe --dump test.html

 test test whatever test 

Note: the second line must be bold.

+4
source share
3 answers

Lynx is a great program, and its rendering of html will be trivial.

How about this:

 my $lynx = '/path/to/lynx'; my $html = [ html here ]; my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`; 
+10
source

Go to search.cpan.org and search for HTML text that will give you many options to suit your specific needs. HTML :: FormatText is a good base, and then forks into certain variants of it, for example HTML :: FormatText :: WithLinks , if you want to save links as footnotes.

+6
source

I am on Windows, so I can not fully verify this, but you can adapt the htext that comes with HTML :: Parser :

 #!/usr/bin/perl use strict; use warnings; use HTML::Parser; use Term::ANSIColor; use HTML::Parser 3.00 (); my %inside; sub tag { my($tag, $num) = @_; $inside{$tag} += $num; print " "; # not for all tags } sub text { return if $inside{script} || $inside{style}; my $esc = 1; if ( $inside{b} or $inside{strong} ) { print color 'blue'; } elsif ( $inside{i} or $inside{em} ) { print color 'yellow'; } else { $esc = 0; } print $_[0]; print color 'reset' if $esc; } HTML::Parser->new(api_version => 3, handlers => [ start => [\&tag, "tagname, '+1'"], end => [\&tag, "tagname, '-1'"], text => [\&text, "dtext"], ], marked_sections => 1, )->parse_file(shift) || die "Can't open file: $!\n";; 
+2
source

All Articles