The main problem here is that you said “millions of columns” and did not indicate how many rows. To check each value in each row relative to its copy in each column, you look at a large number of checks.
Of course, you can reduce the number of columns along the way, but you still need to check each of them to the last row. So ... a lot of processing.
We can make a "seed" hash to start with the first two lines:
use strict; use warnings; open my $fh, '<', "inputfile.txt" or die; my %matches; my $line = <$fh>; my $nextline = <$fh>; my $i=0; while ($line =~ s/\t(\d+)//) { my $num1 = $1; if ($nextline =~ s/\t(\d+)//) { if ($1 == $num1) { $matches{$i} = $num1 } } else { die "Mismatched line at line $."; } $i++; }
Then, using this "seed" hash, you can read the rest of the lines and remove inappropriate values from the hash, for example:
while($line = <$fh>) { my $i = 0; while ($line =~ s/\t(\d+)//) { if (defined $matches{$i}) { $matches{$i} = undef if ($matches{$i} != $1); } $i++; } }
You can imagine a solution in which you deleted all the lines that have already been proven to be unique, but for this you need to create an array of strings or create a regular expression, and I'm not sure that it will not take as much time as it simply passes through the string.
Then, after processing all the lines, you will have a hash with the values of the duplicated numbers so that you can reopen the file and print:
open my $fh, '<', "inputfile.txt" or die; open my $outfile, '>', "outfile.txt" or die; while ($line = <$fh>) { my $i = 0; if ($line =~ s/^([^\t]+)(?=\t)//) { print $outfile $1; } else { warn "Missing header at line $.\n"; } while ($line =~ s/(\t\d+)//) { if (defined $matches{$i}) { print $1 } $i++; } print "\n"; }
This is a rather difficult operation and this code has not been verified. This will give you a hint for the solution, it will take some time to process the whole file. I suggest running some tests to see if it works with your data and configure it.
If you have only a few suitable columns, it is much easier to just extract them from a row, but I hesitate to use split on such long rows. Something like:
while ($line = <$fh>) { my @line = split /\t/, $line; for my $key (sort { $b <=> $a } keys %matches) { splice @line, $key + 1, 1; } $line = join ("\t", @line); $line =~ s/\n*$/\n/;
Note that we will have to sort the keys in descending order, so that we trim the values from the end. Otherwise, we ruin the uniqueness of subsequent numbers of arrays.
In any case, this may be one of the ways. This is a pretty big operation. I would keep backups .;)