Why does my Perl script continue to read from the same file even though I closed it?

I am writing this Perl script that receives two command line arguments: directory and year. This directory contains a ton of text files or html files (depending on the year). For example, 2010 contains files that look like <number>rank.html with a number from 2001 to 2212. I want it to open each file individually and accept part of the header in the html file and print its text file. However, when I run my code, it just prints the first file header in a text file. It seems that he only ever opens the first 2001rank.html file and others. I will post the code below and thanks to everyone who helps.

 my $directory = shift or "Must supply directory\n"; my $year = shift or "Must supply year\n"; unless (-d $directory) { die "Error: Directory must be a directory\n"; } unless ($directory =~ m/\/$/) { $directory = "$directory/"; } open COLUMNS, "> columns$year.txt" or die "Can't open columns file"; my $column_name; for (my $i = 2001; $i <= 2212; $i++) { if ($year >= 2009) { my $html_file = $directory.$i."rank.html"; open FILE, $html_file; #check if opened correctly, if not, skip it unless (defined fileno(FILE)) { print "skipping $html_file\n"; next; } $/ = "\n"; my $line = <FILE>; if (defined $line) { $column_name = ""; $_ = <FILE> until m{</title>}; $_ =~ m{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i; $column_name = $1; } else { close FILE; next; } close FILE; } else { my $text_file = $directory.$i."rank.txt"; open FILE, $text_file; unless (defined fileno(FILE)) { print "skipping $text_file\n"; next; } $/ = "\r"; my $line = <FILE>; if (defined $line) { $column_name = ""; $_ = <FILE> until /Rank/i; $_ =~ /Rank(\s+)Country(\s+)(.+)(\s+)Date/i; $column_name = $3; } else { close FILE; next; } close FILE; } print "Adding $column_name to text file\n"; print COLUMNS "$column_name\n"; } close COLUMNS; 

In other words, $column_name gets the same value for each pass in the loop, although I know that the html files are different.

+4
source share
3 answers

You can probably debug this much faster if you convert using local lexical files for your file descriptors instead of global ones, and also enable strict checking:

 use strict; use warnings; while (...) { # ... open my $filehandle, $html_file; # ... my $line = <$filehandle>; } 

Thus, the file descriptor (s) will go out of scope during each iteration of the loop so that you can more clearly see what exactly is referenced and where. (Hint: you may have missed the condition that the file descriptor is closed, so the next time it will be reused again.)

For more on best practices using open and file descriptors, see

Some other points:

  • Never assign $_ explicitly for problems. Declare your own variable for storing your data: my $line = <$filehandle> (as in the example above)
  • Pull your matches directly into variables, instead of using $1 , $2 , etc., and use only the brackets for the parts you need: my ($column_name) = ($line =~ m/Rank\s+Country\s+.+(\s+)Date/i);
  • set error conditions first, so the bulk of your code can be surpassed by one (or more) level (s). This will improve readability, since when the main part of your algorithm is visible immediately on the screen, you can better visualize what it does and catch errors.

If you apply the points above, I am sure you will notice your mistake. I noticed this by doing this last edit, but I think you will know more if you discover it yourself. (I'm not trying to be arrogant, trust me on that!)

+5
source

Your processing is similar to HTML and text files, so make your life easy and drop the common part:

 sub scrape { my($path,$pattern,$sep) = @_; unless (open FILE, $path) { warn "$0: skipping $path: $!\n"; return; } local $/ = $sep; my $column_name; while (<FILE>) { next unless /$pattern/; $column_name = $1; last; } close FILE; ($path,$column_name); } 

Then make it specific to two input types:

 sub scrape_html { my($directory,$i) = @_; scrape $directory.$i."rank.html", qr{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i, "\n"; } sub scrape_txt { my($directory,$i) = @_; scrape $directory.$i."rank.txt", qr/Rank\s+Country\s+(.+)\s+Date/i, "\r"; } 

Then your main program is simple:

 my $directory = shift or die "$0: must supply directory\n"; my $year = shift or die "$0: must supply year\n"; die "$0: $directory is not a directory\n" unless -d $directory; # add trailing slash if necessary $directory =~ s{([^/])$}{$1/}; my $columns_file = "columns$year.txt"; open COLUMNS, ">", $columns_file or die "$0: open $columns_file: $!"; for (my $i = 2001; $i <= 2212; $i++) { my $process = $year >= 2009 ? \&scrape_html : \&scrape_txt; my($path,$column_name) = $process->($directory,$i); next unless defined $path; if (defined $column_name) { print "$0: Adding $column_name to text file\n"; print COLUMNS "$column_name\n"; } else { warn "$0: no column name in $path\n"; } } close COLUMNS or warn "$0: close $columns_file: $!\n"; 

Notice how careful you must close the global file descriptors. Use lexical file descriptors, as in

 open my $fh, $path or die "$0: open $path: $!"; 

Passing $fh as a parameter or adding it to hashes is much nicer. In addition, lexical file descriptors are automatically closed when they go beyond. There is no way to stomp on the handle that someone else is using.

+2
source

Did you consider grep ?

grep enter only the line from HTML containing the header, and then process the output of grep .

Simplification, since you do not need to write file processing code. You did not say what you want with this heading - if you only need a list, you may not need to write any code at all.

Try something like:

 grep -ri title <directoryname> 
0
source

All Articles