How to identify and analyze similar models, such as Excel?

You know the functionality in Excel when you enter 3 rows with a specific pattern and drag the column all the way down. Excel is trying to continue the template for you.

for example

type of...

  • test-1
  • test-2
  • test 3

Excel will continue it with:

  • test 4
  • test 5
  • test n ...

The same thing works for some other patterns, such as dates, etc.

I am trying to accomplish a similar thing, but I also want to handle more exceptional cases, such as:

  • test-blue-somethingelse
  • test-yellow-somethingelse
  • test-red-somethingelse

Now, based on these entries, I want to say that the template:

  • test- [dynamic] -What

Continuing with [DYNAMIC] with other colors is another matter, it doesn’t really matter to me now. What interests me most is the discovery of the [DYNAMIC] parts in the template.

I need to detect this from a large number of entries in the pool. Suppose you get 10,000 lines with these types of patterns and want to group these lines based on similarities, and also determine how much of the text is constantly changing ([DYNAMIC]).

Document classification can be useful in this scenario, but I'm not sure where to start.

UPDATE:

I forgot to mention that it is also possible to have multiple [DYNAMIC] templates.

For example:

  • test_ [dynamic] 12 [Dynamic2]

I do not think this is important, but I plan to implement it in .NET, but any hint of using algorithms will be very useful.

+6
algorithm design-patterns similarity fuzzy
source share
4 answers

Once you start looking at the search for the dynamic parts of the form templates: <const1><dynamic1><const2><dynamic2>.... without any other suggestions, you will need to find the longest common subsequence of the samples you provided. For example, if I have test-123-abc and test-48953-defg , then LCS will be test- and - . The dynamic parts will then be the gaps between the LCS result. Then you can find your dynamic part in the corresponding data structure.

The problem of finding LCS from more than 2 lines is very expensive and this will be the bottleneck of your problem. Due to accuracy, you can make this problem acceptable. For example, you can perform LCS between all pairs of rows and group collections of rows with similar LCS results. However, this means that some patterns will not be correctly identified.

Of course, all this can be avoided if you can impose additional restrictions on your lines, for example Excel, which only allows templates of the form <const><dynamic> .

+2
source share

Finding [dynamic] is not a big deal, you can do it with two lines - just start from the beginning and stop when they start to not be equal, do the same from the end, and voila - you got your [dynamic]

something like (pseudocode - view):

 String s1 = 'asdf-1-jkl'; String s2= 'asdf-2-jkl'; int s1I = 0, s2I = 0; String dyn1, dyn2; for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++) if (s1.charAt(s1I) != s2.charAt(s2I)) break; int s1E = s1.length(), s2E = s2.length; for (;s2E>0&&s1E>0;s1E--,s2E--) if (s1.charAt(s1E) != s2.charAt(s2E)) break; dyn1 = s1.substring(s1I, s1E); dyn2 = s2.substring(s2I, s2E); 

About your 10k datasets. You will need to name this (or perhaps a slightly more optimized version), with each combination, to figure out your patten (calls 10k x 10k). and then sort the result by pattern (i.e. keep the beginning and end and sort by these fields)

0
source share

I think you need to calculate something like Levenshtein distance to find a group of similar lines, and then in each group of similar lines, you define the dynamic part in a typical algorithm similar to a break.

0
source share

Google Docs may be better than succeeding for this kind of thing, believe it or not.

Google has collected a huge amount of data about sets - for example, in the example you gave it, blue, red, yellow ... will be recognized as part of a set of "colors". It has much more complete pattern recognition than Excel, so it will have a better chance of continuing the template.

0
source share

All Articles