How to automate the creation of a template based on real data?

Question

How to automate the creation of a template based on real data?

I have many providers in the database, they all differ in some aspects of their data. I would like to make a data validation rule based on previous data.

Example:

A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12

Purpose: if the user enters the string "XZ-217" for provider B, the algorithm should compare the previous data and say: this line is not similar to the previous data of provider B.

Is there any good way / tools to achieve such a comparison? The answer may be some general Perl algorithm or module.

Edit: Agree, “likeness” is hard to define. But I would like to catch an algorithm that could analyze the previous about 100 samples, and then compare the results of the analysis with the new data. The similarity can be based on length, on the use of characters / numbers, patterns for creating strings similar to the beginning / end / average, with some separators.

I believe that this is not an easy task, but, on the other hand, I believe that it is very widely used. Therefore, I hoped that there were already some hints.

+5

regex pattern-matching perl

wk Jan 13 '12 at 14:55

source share

4 answers

Alien life form · Answer 1 · 2012-01-13T15:39:03+0000

You might want to check out: http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for example)

Joel Berger · Answer 2 · 2012-01-13T17:23:33+0000

. , .

:

A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})

:

#!/usr/bin/env perl

use strict;
use warnings;

use List::MoreUtils qw'uniq each_arrayref';

my %examples = (
  A => [qw/ XZ-4 XZ-23 XZ-217 /],
  B => [qw/ 1276 1899 22711 /],
  C => [qw/ 12-4 12-75 12 /],
);

foreach my $example (sort keys %examples) {
  print "$example: ", gen_regex(@{ $examples{$example} }) || "Generate failed!", "\n";
}

sub gen_regex {
  my @cases = @_;

  my %exploded;

  # ex. $case may be XZ-217
  foreach my $case (@cases) {
    my @parts = 
      grep { defined and length } 
      split( /(\d+|\w+)/, $case );

    # @parts are ( XZ, -, 217 )

    foreach (@parts) {
      if (/\d/) {
        # 217 becomes ['\d' => 3]
        push @{ $exploded{$case} }, ['\d' => length];

      } elsif (/\w/) {
        #XZ becomes ['\w' => 2]
        push @{ $exploded{$case} }, ['\w' => length];

      } else {
        # - becomes ['lit' => '-']
        push @{ $exploded{$case} }, ['lit' => $_ ];

      }
    }
  }

  my $pattern = '';

  # iterate over nth element (part) of each case
  my $ea = each_arrayref(values %exploded);
  while (my @parts = $ea->()) {

    # remove undefined (i.e. optional) parts
    my @def_parts = grep { defined } @parts;

    # check that all (defined) parts are the same type
    my @part_types = uniq map {$_->[0]} @def_parts;
    if (@part_types > 1) {
      warn "Parts not aligned\n";
      return;
    }
    my $type = $part_types[0]; #same so make scalar

    # were there optional parts?
    my $required = (@parts == @def_parts);

    # keep the values of each part
    # these are either a repitition or lit strings
    my @values = sort uniq map { $_->[1] } @def_parts;

    # these are for non-literal quantifiers
    my $min = $required ? $values[0] : 0;
    my $max = $values[-1];

    # write the specific pattern for each type
    if ($type eq '\d') {
      $pattern .= '\d' . "{$min,$max}";

    } elsif ($type eq '\w') {
      $pattern .= '\w' . "{$min,$max}";

    } elsif ($type eq 'lit') {
      # quote special characters, - becomes \-
      my @uniq = map { quotemeta } uniq @values;
      # join with alternations, surround by non-capture grouup, add quantifier
      $pattern .= '(?:' . join('|', @uniq) . ')' . ($required ? '{1}' : '?');
    }
  }


  # build the qr regex from pattern
  my $regex = qr/$pattern/;
  # test that all original patterns match (@fail should be empty)
  my @fail = grep { $_ !~ $regex } @cases;

  if (@fail) {
    warn "Some cases fail for generated pattern $regex: (@fail)\n";
    return '';
  } else {
    return $regex;
  }
}

, , . , , , .

Axeman · Answer 3 · 2012-01-13T19:43:00+0000

. 3 .

-

. , . , , -, , .

use strict;
use warnings;
use List::Util qw<max min>;

sub compile_search_expr { 
    shift;
    @_ = @{ shift() } if @_ == 1;
    my $str 
        = join( '|'
              , map { join( ''
                           , grep { defined; } 
                             map  {
                                 $_ eq 'P' ? quotemeta;
                               : $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
                               : $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
                               :             undef
                               ;
                            } @$_ 
                          )
                } @_ == 1 ? @{ shift } : @_
        );
    return qr/^(?:$str)$/;
}

sub merge_profiles {
    shift;
    my ( $profile_list, $new_profile ) = @_;
    my $found = 0;
    PROFILE:
    for my $profile ( @$profile_list ) { 
        my $profile_length = @$profile;

        # it not the same profile.
        next PROFILE unless $profile_length == @$new_profile;
        my @merged;
        for ( my $i = 0; $i < $profile_length; $i++ ) { 
            my $old = $profile->[$i];
            my $new = $new_profile->[$i];
            next PROFILE unless $old->[0] eq $new->[0];
            push( @merged
                , [ $old->[0]
                  , min( $old->[1], $new->[1] )
                  , max( $old->[2], $new->[2] ) 
                  ]);
        }
        @$profile = @merged;
        $found = 1;
        last PROFILE;
    }
    push @$profile_list, $new_profile unless $found;
    return;
}

sub compute_info_profile { 
    shift;
    my @profile_chunks
        = map { 
              /\W/ ? [ P => $_ ]
            : /\D/ ? [ W => length, length ]
            :        [ D => length, length ]
        }
        grep { length; } split /(\W+)/, shift
        ;
}

# Psuedo-Perl
sub process_input_task { 
    my ( $application, $input ) = @_;

    my $patterns = $application->get_patterns_for_current_customer;
    my $regex    = $application->compile_search_expr( $patterns );

    if    ( $input =~ /$regex/ ) {}
    elsif ( $application->approve_divergeance( $input )) {
        $application->merge_profiles( $patterns, compute_info_profile( $input ));
    }
    else { 
        $application->escalate( 
           Incident->new( issue    => INVALID_FORMAT
                        , input    => $input
                        , customer => $customer 
                        ));
    }

    return $application->process_approved_input( $input );
}

Zaid · Answer 4 · 2012-01-13T15:36:51+0000

If a module existed Tie::StringApproxHash, it would correspond to the count.

I think you're looking for something that combines fuzzy logic functions String::Approxand a hash interface Tie::RegexpHash.

The first is more important; the latter will do the easy coding job.

How to automate the creation of a template based on real data?

More articles: