How to determine the longest similar part of several lines?

Question

How to determine the longest similar part of several lines?

According to the headline, I'm trying to find a way to programmatically determine the longest part of the similarity between multiple lines.

Example:

file:///home/gms8994/Music/tATu/
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

Ideally, I would return file:///home/gms8994/Music/ , because this is the longest part that is common to all three lines.

In particular, I am looking for a Perl solution, but a solution in any language (or even a pseudo-language) is enough.

From the comments: yes, only at the beginning; but it is possible to have another entry in the list that will be ignored for this question.

+8

string algorithm perl similarity

Glen Solsberry Feb 01 '09 at 1:12

source share

7 answers

The link already given by Brett Daniel for the Wikipedia entry “ The longest common substring problem ” is a very good general link (with pseudo-code) for your question, as indicated. However, the algorithm may be exponential. And it looks like you really need an algorithm for the longest common prefix, which is a much simpler algorithm.

Here is the one I'm using for a long generic prefix (and link to the source URL):

 use strict; use warnings; sub longest_common_prefix { # longest_common_prefix( $|@ ): returns $ # URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl # find longest common prefix of scalar list my $prefix = shift; for (@_) { chop $prefix while (! /^\Q$prefix\E/); } return $prefix; } my @str = map {chomp; $_} <DATA>; print longest_common_prefix(@ARGV), "\n"; __DATA__ file:///home/gms8994/Music/tATu/ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/

If you really want to implement LCSS, refer to these discussions ( Longest General Substring and Longest General Subsequence) at PerlMonks.org. Tree :: Suffix will probably be the best general solution for you and, as far as I know, implements the best algorithm. Unfortunately, the latest builds are broken. But a working routine exists in the discussions referenced by PerlMonks in this post of Limbic ~ Region (reproduced here with your data).

 #URLref: http://www.perlmonks.org/?node_id=549876 #by Limbic~Region use Algorithm::Loops 'NestedLoops'; use List::Util 'reduce'; use strict; use warnings; sub LCS{ my @str = @_; my @pos; for my $i (0 .. $#str) { my $line = $str[$i]; for (0 .. length($line) - 1) { my $char= substr($line, $_, 1); push @{$pos[$i]{$char}}, $_; } } my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str; my %map; CHAR: for my $char (split //, $sh_str) { my @loop; for (0 .. $#pos) { next CHAR if ! $pos[$_]{$char}; push @loop, $pos[$_]{$char}; } my $next = NestedLoops([@loop]); while (my @char_map = $next->()) { my $key = join '-', @char_map; $map{$key} = $char; } } my @pile; for my $seq (keys %map) { push @pile, $map{$seq}; for (1 .. 2) { my $dir = $_ % 2 ? 1 : -1; my @offset = split /-/, $seq; $_ += $dir for @offset; my $next = join '-', @offset; while (exists $map{$next}) { $pile[-1] = $dir > 0 ? $pile[-1] . $map{$next} : $map{$next} . $pile[-1]; $_ += $dir for @offset; $next = join '-', @offset; } } } return reduce {length($a) > length($b) ? $a : $b} @pile; } my @str = map {chomp; $_} <DATA>; print LCS(@str), "\n"; __DATA__ file:///home/gms8994/Music/tATu/ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/

+5

rivy Feb 01 '09 at 3:15

source share

It looks like you want a k-common subscript algorithm . It is very easy to program and a good example of dynamic programming.

+3

Brett Daniel Feb 01 '09 at 1:38

source share

My first instinct is to start a loop by taking the next character from each line until the characters are equal. Keep counting which position in the line you are in, and then take a substring (from any of the three lines) from 0 to position before the characters are equal.

In Perl, you will have to split the string first into characters, using something like

@array = split(//, $string);

(splitting into an empty character sets each character in its own array element)

Then run a loop, perhaps in general:

 $n =0; @array1 = split(//, $string1); @array2 = split(//, $string2); @array3 = split(//, $string3); while($array1[$n] == $array2[$n] && $array2[$n] == $array3[$n]){ $n++; } $sameString = substr($string1, 0, $n); #n might have to be n-1

Or at least something like that. Forgive me if this does not work, my Perl is a little rusty.

+3

Perchik Feb 01 '09 at 1:48

source share

If you use Google for the “longest common substring”, you will get some good pointers for the general case where sequences should not start at the beginning of lines. For example, http://en.wikipedia.org/wiki/Longest_common_substring_problem .

Mathematica has a function for this inline: http://reference.wolfram.com/mathematica/ref/LongestCommonSubsequence.html (Note that they mean a continuous subsequence, i.e. a substring, that you want.)

If you care only about the longest common prefix, then for the i-th cycle it should be much faster than i, starting from 0 until the i-th characters match and return substr (s, 0, i-1 )

+2

dreeves Feb 01 '09 at 1:40

source share

From http://forums.macosxhints.com/showthread.php?t=33780

 my @strings = ( 'file:///home/gms8994/Music/tATu/', 'file:///home/gms8994/Music/nina%20sky/', 'file:///home/gms8994/Music/A%20Perfect%20Circle/', ); my $common_part = undef; my $sep = chr(0); # assuming it not used legitimately foreach my $str ( @strings ) { # First time through loop -- set common # to whole if ( !defined $common_part ) { $common_part = $str; next; } if ("$common_part$sep$str" =~ /^(.*).*$sep\1.*$/) { $common_part = $1; } } print "Common part = $common_part\n";

+1

Hissohathair Feb 01 '09 at 12:00

source share

Faster than above, uses perl native binary xor function adapted from perlmongers solution ($ + [0] does not work for me):

 sub common_suffix { my $comm = shift @_; while ($_ = shift @_) { $_ = substr($_,-length($comm)) if (length($_) > length($comm)); $comm = substr($comm,-length($_)) if (length($_) < length($comm)); if (( $_ ^ $comm ) =~ /(\0*)$/) { $comm = substr($comm, -length($1)); } else { return undef; } } return $comm; } sub common_prefix { my $comm = shift @_; while ($_ = shift @_) { $_ = substr($_,0,length($comm)) if (length($_) > length($comm)); $comm = substr($comm,0,length($_)) if (length($_) < length($comm)); if (( $_ ^ $comm ) =~ /^(\0*)/) { $comm = substr($comm,0,length($1)); } else { return undef; } } return $comm; }

+1

Erik Aronesty Feb 28 '12 at 21:15

source share

Hynek -Pichi- Vychodil · Accepted Answer · 2009-02-01 09:56

Edit: Sorry for the error. It is a pity that I observed that using the variable my inside countit(x, q{}) is a big mistake. This line is evaluated inside the Benchmark module, and @str is empty there. This solution is not as fast as I imagined. See Correction below. I'm sorry again.

Perl can be fast:

 use strict; use warnings; package LCP; sub LCP { return '' unless @_; return $_[0] if @_ == 1; my $i = 0; my $first = shift; my $min_length = length($first); foreach (@_) { $min_length = length($_) if length($_) < $min_length; } INDEX: foreach my $ch ( split //, $first ) { last INDEX unless $i < $min_length; foreach my $string (@_) { last INDEX if substr($string, $i, 1) ne $ch; } } continue { $i++ } return substr $first, 0, $i; } # Roy implementation sub LCP2 { return '' unless @_; my $prefix = shift; for (@_) { chop $prefix while (! /^\Q$prefix\E/); } return $prefix; } 1;

Test suite:

 #!/usr/bin/env perl use strict; use warnings; Test::LCP->runtests; package Test::LCP; use base 'Test::Class'; use Test::More; use Benchmark qw(:all :hireswallclock); sub test_use : Test(startup => 1) { use_ok('LCP'); } sub test_lcp : Test(6) { is( LCP::LCP(), '', 'Without parameters' ); is( LCP::LCP('abc'), 'abc', 'One parameter' ); is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' ); is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ), 'abcd', 'Some common prefix' ); my @str = map { chomp; $_ } <DATA>; is( LCP::LCP(@str), 'file:///home/gms8994/Music/', 'Test data prefix' ); is( LCP::LCP2(@str), 'file:///home/gms8994/Music/', 'Test data prefix by LCP2' ); my $t = countit( 1, sub{LCP::LCP(@str)} ); diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}"); $t = countit( 1, sub{LCP::LCP2(@str)} ); diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}"); } __DATA__ file:///home/gms8994/Music/tATu/ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/

Test suite result:

 1..7 ok 1 - use LCP; ok 2 - Without parameters ok 3 - One parameter ok 4 - None of common prefix ok 5 - Some common prefix ok 6 - Test data prefix ok 7 - Test data prefix by LCP2 # LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) @ 20766.06/s (n=22635) # LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) @ 16746.73/s (n=17919)

This means that a pure Perl solution using substr approximately 20% faster than the Roy solution in your test case, and one prefix detection takes about 50us. There is no need to use XS if your data or performance expectations are greater.

How to determine the longest similar part of several lines?

More articles: