How to remove every third HTML tag in Perl?

Question

How to remove every third HTML tag in Perl?

This is a quick-prepared script, but I have some difficulties due to ignorance with regular expressions and Perl.

The script is supposed to be read in an HTML file. There is a place (in itself) in the file where I have the group <div> s. I want to remove every third of them - they are grouped four times.

My script below will not compile, let alone run.

#!/usr/bin/perl use warnings; use strict; &remove(); sub remove { my $input = $ARGV[0]; my $output = $ARGV[1]; open INPUT, $input or die "couldn't open file $input: $!\n"; open OUTPUT, ">$output" or die "couldn't open file $output: $!\n"; my @file = <INPUT>; foreach (@file) { my $int = 0; if ($_ =~ '<div class="cell">') { $int++; { // this brace was the wrong way if ($int % 4 == 3) { $_ =~ '/s\<div class="cell">\+.*<\/div>/;/g'; } } print OUTPUT @file; }

Thank you for your help. I know that parsing a regular expression is wrong, but I just want it to work.

Postmortem: the problem is almost resolved. And I am ashamed of those who told me that the regular expression is not very good - I knew this for a start. But then again, I wanted something fast and programmed the XSLT that created it. In this case, I did not have a source to start it again, otherwise I would program it in XSLT.

+4

html regex perl

Overflown Mar 16 '09 at 1:11

source share

5 answers

When your code does not compile, read the received error messages and warnings. If they do not make sense, refer to perldoc perldiag (or “use diagnostics”; in your code to automatically do this for you).

+3

ysth Mar 16 '09 at 1:54

source share

Well, you're right that you shouldn't parse HTML with regular expressions. And since it is, it probably will not “just work.”

Ideally, you need to use an HTML parsing and processing library. Do not think that HTML is a big line for manipulating text functions: it is a serialized, formatted data structure. You should monkey with him use only the library for this purpose. Various libraries have already fixed hundreds of errors that you are likely to encounter, making it twice as likely that a simple text manipulation procedure written against them will “just work”. Perl programmers at the master level usually don’t parse HTML in this way, and that’s not because they are obsessive and irrational with regard to the quality and cleanliness of the code - because they know that inventing the wheel alone is unlikely to be achieved by the existing mechanism.

I recommend HTML :: Tree because it works the way I think of HTML (and XML). I think there are several other libraries that may be more popular.

The real truth is that if you can’t even compile your program for compilation, you need to invest a little more time (half a day or so), figuring out the basics before coming to seek help. You have a syntax error for using the regex operator s /// g, and you need to figure out how this should work before moving on. It is not difficult, and you can find out what you need from the book of the Camel, or man perlretut, or several other sources. If you don’t know how to debug your program now, then probably any help you get here will simply lead you to the next syntax error that you cannot overcome.

+2

skiphoppy Mar 16 '09 at 1:30

source share

Once you get the square brackets matching each other and start using the regular expression correctly for the replacement, you also need to move

 my $int = 0;

from a for loop - it is currently reset for each line read, so it will only have a value of 0 or 1.

+2

Cebjyre Mar 16 '09 at 2:24

source share

The routine has lost its way. Let's start by exploring the structure of this:

 sub remove { # First opening bracket my $input = $ARGV[0]; my $output = $ARGV[1]; open INPUT, $input or die "couldn't open file $input: $!\n"; open OUTPUT, ">$output" or die "couldn't open file $output: $!\n"; my @file = <INPUT>; foreach (@file) { # Second opening bracket my $int = 0; if ($_ =~ '<div class="cell">') { # Third opening bracket $int++; { # Fourth opening bracket if ($int % 4 == 3) { # Fifth opening bracket $_ =~ '/s\<div class="cell">\+.*<\/div>/;/g'; } # First closing bracket } # Second closing bracket print OUTPUT @file; } # Third closing bracket # No fourth closing bracket? # No fifth closing bracket?

I think you wanted this:

 sub remove { my $input = $ARGV[0]; my $output = $ARGV[1]; open INPUT, $input or die "couldn't open file $input: $!\n"; open OUTPUT, ">$output" or die "couldn't open file $output: $!\n"; my @file = <INPUT>; foreach (@file) { my $int = 0; if ($_ =~ '<div class="cell">') { $int++; } if ($int % 4 == 3) { $_ =~ '/s\<div class="cell">\+.*<\/div>/;/g'; } } print OUTPUT @file; }

This will compile and lead us to the next question: why are you just specifying a regular expression? (Also see Cebjyre's point on placing my $int = 0 )

(To pick up the Ysth point, you can also always run the script with the perl -Mdiagnostics script-name to get longer diagnostic messages.)

+1

Telemachus Mar 16 '09 at 1:57

source share

Ken fox · Accepted Answer · 2009-03-16T02:17:47+0000

I agree that HTML cannot be parsed using regular expressions, but for quick little HTML hacks that you know the format, regular expressions work fine. The trick to doing repeated substitutions with a regular expression is to repeat the repeat in a regular expression. If you don’t do this, you’ll have problems synchronizing the matching position of the regular expressions with the input you entered.

Here's a quick and dirty way to write Perl. It removes the third div element, even if it is nested in the first two divs. The whole file is read, and then I use the global replacement modifier "g" to make a regex for counting. If you have not seen the “x” modifier yet, all it does is let you add spaces for formatting - spaces are ignored in the regular expression.

  remove (@ARGV);

 sub remove {
   my ($ input, $ output) = @_;

   open (INPUT, "<", $ input) or die "couldn't open file $ input: $! \ n";
   open (OUTPUT, ">", $ output) or die "couldn't open file $ output: $! \ n";

   my $ content = join ("", <INPUT>);
   close (INPUT);

   $ content = ~ s | (. *? <div \ s + class = "cell">. *? <div \ s + class = "cell">. *?)
                 <div \ s + class = "cell">. *?  </div>
                 (. *? <div \ s + class = "cell">) | $ 1 $ 2 | sxg;

   print OUTPUT $ content;
   close OUTPUT;
 }

How to remove every third HTML tag in Perl?

More articles: