Find the "N Gram" substrings that are the smallest distance from the target string N of character long

Question

Find the "N Gram" substrings that are the smallest distance from the target string N of character long

I am looking for an algorithm, preferably in Python, that will help me find substrings, long N characters, from existing strings closest to the target string N of the long character.

Consider the target string, for example, 4 characters long:

targetString -> '1111'

Suppose this is a string that I have with me (I will generate substrings of this to match "best alignment"):

 nonEmptySubStrings -> ['110101']

The substrings specified above are 4 characters long:

 nGramsSubStrings -> ['0101', '1010', '1101']

I want to write / use a "Magic Function" that will select the line closest to targetString:

 someMagicFunction -> ['1101']

Some more examples:

 nonEmptySubStrings -> ['101011'] nGramsSubStrings -> ['0101', '1010', '1011'] someMagicFunction -> ['1011'] nonEmptySubStrings -> ['10101'] nGramsSubStrings -> ['0101', '1010'] someMagicFunction -> ['0101', '1010']

Is this “magic function” a well-known substring problem?

I really want to find mines. the number of changes to nonEmptySubStrings so that targetString is used as a substring.

+4

python string substring string-matching

Poorluzer Nov 17 '10 at 9:40

source share

3 answers

I believe you need to change the distance . Spelling corrector Peter Norvig is an example implementation in python. Here's the implementation of Levenshtein Distance . See also this question .

EDIT: This is quite common in bioinformatics. See FASTA and BLAST . Bioinformatics has many variations of this algorithm. See Alignment Sequence for an overview of methods.

+3

Yuval F Nov 17 '10 at 9:55

source share

Recently, when I was discussing gene alignment, I wrote this pyraming example , implementing the pyparsing CloseMatch class. Typically, pyparsing expressions return a structure containing matched strings and any named results, but CloseMatch returns a 2-tuple containing a matched string and a list of places of mismatch in the matched string. Here's how CloseMatch will be used:

 searchseq = CloseMatch("TTAAATCTAGAAGAT", 3) for g in genedata: print "%s (%d)" % (g.id, g.genelen) print "-"*24 for t,startLoc,endLoc in searchseq.scanString(g.gene): matched, mismatches = t[0] print "MATCH:", searchseq.sequence print "FOUND:", matched if mismatches: print " ", ''.join(' ' if i not in mismatches else '*' for i,c in enumerate(searchseq.sequence)) else: print "<exact match>" print "at location", startLoc

Here is an example of partial match output:

 organism=Toxoplasma_gondii_RH (258) ------------------------ MATCH: TTAAATCTAGAAGAT FOUND: TTAAATTTAGGAGCT * * * at location 195

Note that this class does not find matching matches. This can still be done, but with a slightly different approach with scanString (which I will include in the next release of the circulation).

+2

Paulmcg Nov 17 '10 at 13:47

source share

aaronasterling · Accepted Answer · 2010-11-17T09:57:14+0000

The basics of commenting on a question to a question is what you need.

 import functools def edit_distance(str1, str2): #implement it here f = functools.operator(edit_distance, target_string) return min(f(s) for s in slices(string_)) # use slices from below

This will return the minimum editing distance of any substring to the target string. It will not indicate which row or its index. It can be easily changed to do so.

The naive way that may be the best is

 import functools def diff(str1, str2): # However you test the distance gets defined here. eg Hamming distance, # Levenshtein distance, etc. def slices(string_, L): for i in xrange(len(string_) - L + 1)): yield string_[i:i+L] best_match = min(slices(string_), key=functools.partial(diff, target_string))

This does not return the index at which the substring occurs. Of course, you did not indicate that you need this in your question;)

If you want to become better than this, it will depend on how you measure the distance and basically comes down to avoiding checking some substrings, indicating that you will need to change at least x characters to get a better match than you are already there. At this point, you could just change x characters by jumping over x x characters.

Find the "N Gram" substrings that are the smallest distance from the target string N of character long

More articles: