Automatically detect numbering patterns in file names

Introduction

I work at an object where we have microscopes. These guys may be asked to create 4D movies with a sample: they take, for example. 10 images in different Z positions, then wait a while (next time point) and take 10 slices again. They can be asked to save a file for each fragment, and they use an explicit naming pattern, something like 2009-11-03-experiment1-Z07-T42.tif . File names are numbered to reflect Z position and time point

Question

Once you have all these file names, you can use the regex template to extract the Z and T values ​​if you know the database file name template. This I know how to do it.

I have a question: do you know how to automatically generate a regular expression pattern from a list of file names? For example, there is an amazing tool on the net that does something similar: txt2re .

What algorithm would you use to analyze the entire list of file names and generate the most likely regular expression pattern?

+4
source share
3 answers

First of all, you are trying to do this with difficulty. I suspect that this may not be possible, but you will have to apply some artificial intelligence methods, and it will be much more difficult than it is worth. Any neural network or genetic algorithm system can be trained to recognize the numbers Z and T numbers, assuming that the format Z[0-9]+ and T[0-9]+ always used somewhere in the regular expression.

What I will do with this problem is to write a Python script to handle all file names. In this script, I would match the file name twice, once I looked for Z[0-9]+ and once I looked for T[0-9]+ . Each time, I calculated matches for Z-numbers and T-numbers.

I would save four other counters with current totals, two for Z-numbers and two for T-numbers. Each pair will represent the number of file names with 1 match and those that have multiple matches. And I would calculate the total number of files processed.

In the end, I would say the following:

 nnnnnnnnnn filenames processed Z-numbers matched only once in nnnnnnnnnn filenames. Z-numbers matched multiple times in nnnnnn filenames. T-numbers matched only once in nnnnnnnnnn filenames. T-numbers matched multiple times in nnnnnn filenames. 

If you're lucky, there will be no matches, and you can use the above expressions to extract your numbers. However, if there is any significant number of matches, you can run the script again with some print statements to show you examples of file names that cause multiple matches. This will tell you if a simple regex setup can work.

For example, if you have 23,768 multiple matches in T numbers, make a script print every 500th file name with multiple matches, which will give you 47 samples to check.

Probably something like [ -/.=]T[0-9]+[ -/.=] Would be enough to combine multiple matches to zero, as well as give a one-time match for each file name. Or in the worst case, [0-9][ -/.=]T[0-9]+[ -/.=]

+1
source

There is a Perl module called String :: Diff that has the ability to generate a regular expression for two different strings. The example he gives is

 my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby'); print "$diff\n"; 

outputs:

  this \ is \ (?: Perl | Ruby)

Perhaps you could combine pairs of file names in this form to get the initial regular expression. However, this will not give you the capture of numbers, etc., therefore, it will not be fully automatic. After receiving the diff, you will have to manually edit or do some kind of substitution to get the working final regular expression.

+2
source

For Python, see this question about TemplateMaker.

0
source

All Articles