What is the best way to compare string arrays in perl

Question

What is the best way to compare string arrays in perl

I am trying to compare multiple arrays of strings containing lists of directory files. The goal is to determine which files exist in each directory AND which files do not exist. Consider:

List1 List2 List3 List4 aaef bbdg cfah

The result should be:

List1:

  List1 List2 List3 List4 a yes yes yes no b yes yes no no c yes no no no

List2:

  List1 List2 List3 List4 a yes yes yes no b yes yes no no f no yes no yes

...

I could go through all the arrays and go through each entry, go through all the other arrays and do grep:

  for my $curfile (@currentdirfiles) { if( grep(/$curfile/, @otherarrsfiles) ) { // Set 'yes' } else { // set 'no' } }

My only concern is that I finish the 0 ^ 2n order. Perhaps there is nothing I can do about it, because in any case, I will still go through all the arrays. One improvement might be in the grep function, but I'm not sure.

Any thoughts?

+4

grep perl string-comparison

Edj Apr 27 '11 at 4:31

source share

7 answers

For a large string search, you usually want to use hashes. Here is one way to do this:

 use strict; use warnings; # Define the lists: my @lists = ( [qw(abc)], # List 1 [qw(abf)], # List 2 [qw(eda)], # List 3 [qw(fgh)], # List 4 ); # For each file, determine which lists it is in: my %included; for my $n (0 .. $#lists) { for my $file (@{ $lists[$n] }) { $included{$file}[$n] = 1; } # end for each $file in this list } # end for each list number $n # Print out the results: my $fileWidth = 8; for my $n (0 .. $#lists) { # Print the header rows: printf "\nList %d:\n", $n+1; print ' ' x $fileWidth; printf "%-8s", "List $_" for 1 .. @lists; print "\n"; # Print a line for each file: for my $file (@{ $lists[$n] }) { printf "%-${fileWidth}s", $file; printf "%-8s", ($_ ? 'yes' : 'no') for @{ $included{$file} }[0 .. $#lists]; print "\n"; } # end for each $file in this list } # end for each list number $n

+2

cjm Apr 27 '11 at 5:05

source share

The easiest way is to use perl5i and autoboxing:

 use perl5i; my @list1 = qw(one two three); my @list2 = qw(one two four); my $missing = @list1 -> diff(\@list2); my $both = @list1 -> intersect(\@list2);

In a more limited setup, use hashes for this, as the file names will be unique:

 sub in_list { my ($one, $two) = @_; my (@in, @out); my %a = map {$_ => 1} @$one; foreach my $f (@$two) { if ($a{$f}) { push @in, $f; } else { push @out, $f; } } return (\@in, \@out); } my @list1 = qw(one two three); my @list2 = qw(one two four); my ($in, $out) = in_list(\@list1, \@list2); print "In list 1 and 2:\n"; print " $_\n" foreach @$in; print "In list 2 and not in list 1\n"; print " $_\n" foreach @$out;

+1

Alex Apr 27 '11 at 5:06

source share

Why not just remember where each file is when you read them.

Let's say you have a list of directories to read from @dirlist :

 use File::Slurp qw( read_dir ); my %in_dir; my %dir_files; foreach my $dir ( @dirlist ) { die "No such directory $dir" unless -d $dir; foreach my $file ( read_dir($dir) ) { $in_dir{$file}{$dir} = 1; push @{ $dir_files{$dir} }, $file; } }

Now $in_dir{filename} will have entries defined for each directory of interest, and $dir_files{directory} will have a list of files for each directory ...

 foreach my $dir ( @dirlist ) { print "$dir\n"; print join("\t", "", @dirlist); foreach my $file ( @{ $dir_files{$dir} } ) { my @info = ($file); foreach my $dir_for_file ( @dirlist ) { if ( defined $in_dir{$file}{$dir_for_file} ) { push @info, "Yes"; } else { push @info, "No"; } } print join("\t", @info), "\n"; } }

+1

unpythonic Apr 27 '11 at 5:07

source share

My code is simpler, but the result is not quite what you want:

 @lst1=('a', 'b', 'c'); @lst2=('a', 'b', 'f'); @lst3=('e', 'd', 'a'); @lst4=('f', 'g', 'h'); %hsh=(); foreach $item (@lst1) { $hsh{$item}="list1"; } foreach $item (@lst2) { if (defined($hsh{$item})) { $hsh{$item}=$hsh{$item}." list2"; } else { $hsh{$item}="list2"; } } foreach $item (@lst3) { if (defined($hsh{$item})) { $hsh{$item}=$hsh{$item}." list3"; } else { $hsh{$item}="list3"; } } foreach $item (@lst4) { if (defined($hsh{$item})) { $hsh{$item}=$hsh{$item}." list4"; } else { $hsh{$item}="list4"; } } foreach $key (sort keys %hsh) { printf("%s %s\n", $key, $hsh{$key}); }

gives:

 a list1 list2 list3 b list1 list2 c list1 d list3 e list3 f list2 list4 g list4 h list4

0

Ali B Apr 27 '11 at 5:12

source share

Sorry for the late answer, I polished this a bit because I don't need another negative result (pushes me).

This is an interesting efficiency issue. I don’t know if my decision will work for you, but I thought I would share it. This is probably only effective if your arrays don't change too often, and if your arrays contain many duplicate values. I have not performed any performance checks.

Basically, the solution is to remove one cross-validation size by turning the array values in bits and performing bitwise comparisons of the entire array at a time. Array values are divided, sorted, and the serial number is set. Serial serial numbers of arrays are then stored in a single value, bitwise or. Thus, one array can be checked for one serial number with only one operation, for example:

if ( array & serialno )

To prepare the data, you need one run, which can then be stored in a cache or the like. This data can then be used until your data changes (for example, files / folders are deleted or added). I added a fatal output to undefined values, which means the data needs to be updated when this happens.

Good luck

 use strict; use warnings; my @list1=('a', 'b', 'c'); my @list2=('a', 'b', 'f'); my @list3=('e', 'd', 'a'); my @list4=('f', 'g', 'h'); # combine arrays my @total = (@list1, @list2, @list3, @list4); # dedupe (Thanks Xetius for this code snippet) my %unique = (); foreach my $item (@total) { $unique{$item} ++; } # Default sort(), don't think it matters @total = sort keys %unique; # translate to serial numbers my %serials = (); for (my $num = 0; $num <= $#total; $num++) { $serials{$total[$num]} = $num; } # convert array values to serial numbers, and combine them my @tx = (); for my $entry (@list1) { $tx[0] |= 2**$serials{$entry}; } for my $entry (@list2) { $tx[1] |= 2**$serials{$entry}; } for my $entry (@list3) { $tx[2] |= 2**$serials{$entry}; } for my $entry (@list4) { $tx[3] |= 2**$serials{$entry}; } &print_all; sub inList { my ($value, $list) = @_; # Undefined serial numbers are not accepted if (! defined ($serials{$value}) ) { print "$value is not in the predefined list.\n"; exit; } return ( 2**$serials{$value} & $tx[$list] ); } sub yesno { my ($value, $list) = @_; return ( &inList($value, $list) ? "yes":"no" ); } # # The following code is for printing purposes only # sub print_all { printf "%-6s %-6s %-6s %-6s %-6s\n", "", "List1", "List2", "List3", "List4"; print "-" x 33, "\n"; &table_print(@list1); &table_print(@list2); &table_print(@list3); &table_print(@list4); } sub table_print { my @list = @_; for my $entry (@list) { printf "%-6s %-6s %-6s %-6s %-6s\n", $entry, &yesno($entry, 0), &yesno($entry, 1), &yesno($entry, 2), &yesno($entry, 3); } print "-" x 33, "\n"; }

0

TLP Apr 27 '11 at 21:34

source share

I would build a hash using directory entries, like keys containing hashes (actually sets) of each list in which it was found. Iterate through each list, as each new record adds it to an external hash with one set (or hash) containing the identifier of the listing in which it first met. For any entry found in the hash, simply add the current list identifier to the set / hash value.

From there, you can simply process the sorted hash keys and create the rows of your resulting table.

Personally, I think Perl is ugly, but here's a sample in Python:

 #!/usr/bin/env python import sys if len(sys.argv) < 2: print >> sys.stderr, "Must supply arguments" sys.exit(1) args = sys.argv[1:] # build hash entries by iterating over each listing d = dict() for each_file in args: name = each_file f = open(each_file, 'r') for line in f: line = line.strip() if line not in d: d[line] = set() d[line].add(name) f.close() # post process the hash report_template = "%-20s" + (" %-10s" * len(args)) print report_template % (("Dir Entries",) + tuple(args)) for k in sorted(d.keys()): row = list() for col in args: row.append("yes") if col in d[k] else row.append("no") print report_template % ((k,)+tuple(row))

Basically it should be legible, as if it were pseudo-code. The expressions (k,) and ("Dir Entries",) may look a little strange; but to make them be tuples that need to be unpacked into a format string using the % operator for strings. They can also be written as tuple([k]+row) for example (wrapping the first element in [] makes it a list that you can add to another list and convert everything to a tuple).

Also, translating to Perl should be fairly simple, just using hashes instead of dictionaries and collections.

(By the way, this example will work with an arbitrary number of lists, presented as arguments and displayed as columns. Obviously, after dozens of columns, the output will be rather cumbersome to print or display, but it was easy to generalize to).

-2

Jim dennis Apr 27 '11 at 5:30

source share

Jonathan leffler · Accepted Answer · 2011-04-27T05:09:11+0000

Now that the question has been changed, it gives the answer you want. It works in O (n ³ ) time, which is optimal for the task (there are n ³ outputs).

 #!/usr/bin/env perl use strict; use warnings; #List1 List2 List3 List4 #aaef #bbdg #cfah my(@lists) = ( { a => 1, b => 1, c => 1 }, { a => 1, b => 1, f => 1 }, { e => 1, d => 1, a => 1 }, { f => 1, g => 1, h => 1 }, ); my $i = 0; foreach my $list (@lists) { analyze(++$i, $list, @lists); } sub analyze { my($num, $ref, @lists) = @_; printf "List %d\n", $num; my $pad = " "; foreach my $i (1..4) { print "$pad List$i"; $pad = ""; } print "\n"; foreach my $file (sort keys %{$ref}) { printf "%-8s", $file; foreach my $list (@lists) { my %dir = %{$list}; printf "%-8s", (defined $dir{$file}) ? "yes" : "no"; } print "\n"; } print "\n"; }

The output I get is:

 List 1 List1 List2 List3 List4 a yes yes yes no b yes yes no no c yes no no no List 2 List1 List2 List3 List4 a yes yes yes no b yes yes no no f no yes no yes List 3 List1 List2 List3 List4 a yes yes yes no d no no yes no e no no yes no List 4 List1 List2 List3 List4 f no yes no yes g no no no yes h no no no yes

What is the best way to compare string arrays in perl

More articles: