Your processing is similar to HTML and text files, so make your life easy and drop the common part:
sub scrape { my($path,$pattern,$sep) = @_; unless (open FILE, $path) { warn "$0: skipping $path: $!\n"; return; } local $/ = $sep; my $column_name; while (<FILE>) { next unless /$pattern/; $column_name = $1; last; } close FILE; ($path,$column_name); }
Then make it specific to two input types:
sub scrape_html { my($directory,$i) = @_; scrape $directory.$i."rank.html", qr{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i, "\n"; } sub scrape_txt { my($directory,$i) = @_; scrape $directory.$i."rank.txt", qr/Rank\s+Country\s+(.+)\s+Date/i, "\r"; }
Then your main program is simple:
my $directory = shift or die "$0: must supply directory\n"; my $year = shift or die "$0: must supply year\n"; die "$0: $directory is not a directory\n" unless -d $directory;
Notice how careful you must close the global file descriptors. Use lexical file descriptors, as in
open my $fh, $path or die "$0: open $path: $!";
Passing $fh as a parameter or adding it to hashes is much nicer. In addition, lexical file descriptors are automatically closed when they go beyond. There is no way to stomp on the handle that someone else is using.
source share