What is the fastest way in Perl to get all lines of file1 that don't appear in file2?

I have two (very large) text files. What is the fastest way - in terms of runtime - to create a third file containing all lines of file1 that are not displayed in file2?

So, if file1 contains:

  Sally  
 Joe  
 Tom  
 Suzie

And file2 contains:

  Sally  
 Suzie  
 Harry  
 Tom

Then the output file should contain:

  Joe
+4
source share
6 answers

Create a hash map containing each line from file 2. Then for each line in file 1, if it is not in hashmap, output it. This will be O (N), which is the best performance class you can achieve, given that you need to read the input.

Perl implementation:

#!/usr/bin/env perl use warnings; use strict; use Carp (); my $file1 = 'file1.txt'; my $file2 = 'file2.txt'; my %map; { open my $in, '<',$file2 or Carp::croak("Cant open $file2"); while (<$in>) { $map{$_} = 1; } close($in) or Carp::carp("error closing $file2"); } { open my $in,'<', $file1 or Carp::croak("Cant open $file1"); while (<$in>) { if (!$map{$_}) { print $_; } } close $in or Carp::carp("error closing $file1"); } 

If file 2 is so large that hashmap does not fit into memory, then we have another problem. A possible solution is to use the above solution on pieces of file 2 (small enough to fit into memory), outputting the results to temporary files. If there are sufficient matches between file 1 and file 2, the total size should be a reasonable size. To calculate the final results, we perform the intersection of lines in temporary files, i.e. For a line that should be in the final results, it should appear in all temporary files.

+13
source

Not Perl, but effective:

diff <(sort file1) <(sort file2)

+7
source

I am surprised that no one has yet proposed the most memory efficient method, which is to sort both files separately and then merge lists. If you are running Unix, the following 3 simple commands will do what you need quickly.

 sort < file1 > file1.sorted sort < file2 > file2.sorted comm -2 -3 file1.sorted file2.sorted > differences 

If the files are so large that Perl has a VM page to load all the rows into the hash table, this method will be much faster. Otherwise, the hash table approach should be faster.

If you use Unix, it is better to use the external sort command for your system, since it is reasonable in memory usage - using Perl sort() means reading the entire contents of the file into memory.

If you are running Windows, note that the supplied sort command is case-insensitive and cannot be disabled! In addition, there is no comm command on Windows, so you need to flip it yourself - replace the third line above:

 perl subtract_sets.pl file1.sorted file2.sorted > differences.txt 

subtract_sets.pl

 #!/usr/bin/perl open my $f1, '<', $ARGV[0] or die; open my $f2, '<', $ARGV[1] or die; my $x = <$f1>; my $y = <$f2>; while (defined $x && defined $y) { if ($x lt $y) { print $x; $x = <$f1>; } elsif ($y lt $x) { print $y; $y = <$f2>; } else { # Lines match $x = <$f1>; $y = <$f2>; } } while (defined $x) { print $x; $x = <$f1>; } while (defined $y) { print $y; $y = <$f2>; } 
+5
source
 #!/usr/bin/perl use warnings; use strict; open(my $alpha, '<', 'file1') || die "Couldn't open file1"; open(my $beta, '<' , 'file2') || die "Couldn't open file2"; my %data; map {$data{$_} = 1} <$alpha>; map {print $_ unless $data{$_}} <$beta>; 
+4
source

What is very important for you? The larger your RAM? Your main answer is to use a hash, if the files are larger than your RAM, then you need to use a hash tied to dbm .

+2
source

Just some performance benchmarks:

10k lines, 10-character random lines maximum per line.

  Rate slipset marcog slipset 47.6/s -- -16% marcog 56.7/s 19% -- 

100 thousand lines, the number of characters is 10 characters per line.

  Rate slipset marcog slipset 3.02/s -- -34% marcog 4.60/s 52% - 

1000 thousand lines, 10-character random lines per line.

  s/iter slipset marcog slipset 4.09 -- -33% marcog 2.75 49% -- 

1k lines, 100 characters for each line.

  Rate slipset marcog slipset 379/s -- -12% marcog 431/s 14% -- 

100k lines, 100 characters of random lines maximum per line

  Rate slipset marcog slipset 2.15/s -- -30% marcog 3.08/s 44% -- 

1k lines, 1000-character random lines maximum per line

  Rate slipset marcog slipset 133/s -- -10% marcog 148/s 11% -- 

100 thousand lines, maximum 1000-character line per line

  Rate slipset marcog slipset 1.01/s -- -18% marcog 1.22/s 22% -- 

Memory efficiency

Marcog: 100k lines, 1000 characters of random lines per line:

 Memory usage summary: heap total: 163_259_635, heap peak: 61_536_800, stack peak: 17_648 total calls total memory failed calls malloc| 307_425 162_378_090 0 realloc| 1_461 96_878 0 (nomove:1_218, dec:1_026, free:0) calloc| 12_762 784_667 0 free| 307_598 155_133_460 

Slipset: 100k lines, 1000-character random lines, maximum for each line:

 Memory usage summary: heap total: 647_103_469, heap peak: 118_445_776, stack peak: 17_648 total calls total memory failed calls malloc| 508_089 186_752_811 0 realloc| 399_907 459_553_775 0 (nomove:334_169, dec:196_380, free:0) calloc| 12_765 796_883 0 free| 507_584 256_315_688 
+2
source

All Articles