Regex, Remove duplicate paths from delimited string

I am trying to remove duplicate file paths from comma delimited strings using a regex. The order of the final paths does not matter.

Input Example:

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3; 

Output Required:

 C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3; 

I have the following regex that works, but very slowly when the input lines get very long. Add to that a running along thousands of lines and time is very bad.

 \b([^;]+)(?=.*;\1;); 

Any tips on how to improve the performance of this are greatly appreciated!

+7
c # regex perl
source share
8 answers

A typical way to remove duplicates in Perl is a hash. See Also perlfaq4: How to remove duplicate elements from a list or array?

 my $str = q{C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3}; my %seen; my $out = join ';', sort grep { !$seen{$_}++ } split /;/, $str; print $out, "\n"; __END__ # Output: C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6 

I selected sort there, but you can remove it if you don't need it.

Although you have not yet indicated whether the implementation should be in C # or Perl, the same idea should apply to C #. (Update: see Patrick Arthier's answer )

Note that the regular expression is slow because for each match \b([^;]+) engine must scan the rest of the line for lookahead .*;\1; , so it essentially resembles nested loops.

+7
source share

Or C # version:

 using System; using System.Collections.Generic; public class Program { public static void Main() { var paths = @"C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;"; var cleaned = string.Join(";", new HashSet<string>(paths.Split(';'))); Console.WriteLine(cleaned); } } 

Outputs:

 C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3; 

Divide the entry into ; , make it HashSet<string>(..) to get rid of duplicates, join again ; .


Caveat . If your paths contain ; as part of the directory name, it breaks - you will need to do more creative for this case - but the same will be true for any RegEx you use.

+7
source share

I think this is a lot easier to do in perl using perl hash idiom.

Take a look at this example,

 @items = (1,2,4,1,1,1); my %uniq; undef @uniq{ @items }; my @uniques = keys %uniq; print join " ",@uniques 

Output:

 1 2 4 

Each key exists only once in a hash, assigning the same key a hash every few times saves only the most recent value associated with this key. This has its advantages! For example, to find unique items in a list:

Using undef with a hash slice sets the hash value to undef. This idiom is the cheapest way to perform hash operations.

Above was taken from the book Modern Perl Books. Here is a link for you to check. Idioms hashes

We can clearly use this in your scenario.

 use feature "say"; my $sample_text= C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;; #Split the paths seperated by ';' into an array of paths my @path_arr= split /;/,$sample_text; say "Path files with duplicates"; print join "\n",@path_arr; print "------------------------"; my %temp_hash; } THIS undef @temp_hash{@path_arr}; } IS WHAT my @unique = keys %temp_hash; } YOU WANT say "Path files without duplicates"; print join "\n",@unique; 

Ouput:

 Path files with duplicates: C:\Users\user\Desktop\TESTING\path1 C:\Users\user\Desktop\TESTING\path5 C:\Users\user\Desktop\TESTING\path1 C:\Users\user\Desktop\TESTING\path6 C:\Users\user\Desktop\TESTING\path1 C:\Users\user\Desktop\TESTING\path3 C:\Users\user\Desktop\TESTING\path1 C:\Users\user\Desktop\TESTING\path3 ----------------------------- Path files without duplicates: C:\Users\user\Desktop\TESTING\path1 C:\Users\user\Desktop\TESTING\path3 C:\Users\user\Desktop\TESTING\path6 C:\Users\user\Desktop\TESTING\path5 

I believe this is the fastest way to achieve what you want. If performance is a problem.

+2
source share

Try the following code.

 var inputStr = "C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path5;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path6;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path3;C:\\Users\\user\\Desktop\\TESTING\\path1;C:\\Users\\user\\Desktop\\TESTING\\path3" var urlArr = inputStr.split(";"); var uniqueUrlList = []; urlArr.forEach(function (elem, indx1) { let foundElem = uniqueUrlList.find((x, indx2)=>{ return x.toUpperCase() === elem.toUpperCase() && (indx1 != indx2); }); if (foundElem === undefined) { uniqueUrlList.push(elem); } }); console.log(uniqueUrlList); 
+1
source share

Perl, the most optimized single-layer version of RegEx:

 (?<![^;])([^;]++;)(?=(?>[^;]*;)*?\1) 

On your own input line, your own regular expression takes ~ 114,000 steps to find all matches, but 567 steps are required to do this.

Over 40,000 cases detected in ~ 4 seconds:

enter image description here

Live demo

RegEx Distribution:

 (?<! # A Negative lookbehind [^;] # Should be anything other than `;` ) # End of lookbehind ( # Capturing group #1 [^;]++; # Match anything up to first `;` ) # End of CG #1 (?= # A Positive lookahead (?>[^;]*;)*? # Skip over next path, don't backtrack \1 # Until an occurrence ) # End of lookahead 
+1
source share

In Perl,

 #!/usr/bin/env perl # always use these two use strict; use warnings; my $paths = 'C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;'; print "$paths\n"; { my %temporary_hash = map { $_ => 1 } split( q{;}, $paths ); $paths = join( q{;}, keys %temporary_hash ); } print "$paths\n"; 

See perldoc -q duplicate .

+1
source share

In Perl, this requires a single line using the List::Util library, which is basic and highly optimized:

 my $newpaths = join ';', uniq split /;/, $paths; 

How it works? split will create a list of paths that separate the character ; ; uniq will ensure that there are no repetitions; join will again create a string of paths separated by a character ; .

If the path case is not important, then:

 my $newpaths = join ';', uniq split /;/, lc $paths; 

A complete program may be:

 use strict; use warnings; use List::Util qw( uniq ); my $paths = 'C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;'; my $newpaths = join ';', uniq split /;/, $paths; print $newpaths, "\n"; 

To do something interesting, give time to this decision against the one using a temporary hash. This is a synchronization program:

 use strict; use warnings; use List::Util qw( uniq ); use Time::HiRes qw( time ); my @p; for( my $i = 0; $i < 1000000; $i++ ) { push @p, 'C:\This\is\a\random\path' . int(rand(250000)); } my $paths = join ';', @p; my $t = time(); my $newpaths = join ';', uniq split /;/, $paths; $t = time() - $t; print 'Time with uniq: ', $t, "\n"; $t = time(); my %temp = map { $_ => 1 } split /;/, $paths; $newpaths = join ';', keys %temp; $t = time() - $t; print 'Time with temporaty hash: ', $t, "\n"; 

It generates 1 million random paths, which should have a 5: 1 duplicate ratio (5 duplicates of each path). The time for the server on which I tested this:

 Time with uniq: 0.849196910858154 Time with temporaty hash: 1.29486703872681 

Which makes the uniq library faster than a temporary hash. With 100: 1 duplicates:

 Time with uniq: 0.526581048965454 Time with temporaty hash: 0.823433876037598 

With 10000: 1 duplicates:

 Time with uniq: 0.423808097839355 Time with temporaty hash: 0.736939907073975 

Both algorithms work less than the more duplicates found. uniq performs consistently better when duplicates increase.

Feel free to play with random generator numbers.

0
source share

Since these are case-insensitive Windows paths, you probably want to remove items that are identical, unless

(The next step is to push each element through File::Spec::canonpath to determine if the paths are the same, but differently expressed, and then, perhaps, to take into account the links, but this is case insensitive)

I don’t know if your query “using regular expression” is required, but as you have found, this is a very inefficient way to do this

I recommend a simple split with a comma, and List::UtilsBy do case-independent uniqueness

 use strict; use warnings 'all'; use feature 'say'; use List::UtilsBy 'uniq_by'; my $p = 'C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;'; my $newp = join "", map { "$_;" } uniq_by { lc } split /;/, $p; say $newp; 

Exit

 C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3; 
-one
source share

All Articles