Comparing multiple arbitrary string sequences with oriented characters

Question

Comparing multiple arbitrary string sequences with oriented characters

Main problem:

I am looking for an algorithm to calculate the maximum economical distance between a set of rows. With distance, I mean something similar to the Damerau-Levenshtein distance , i.e. minimum number of deletions, insertions, substitutions, and transposing characters or adjacent blocks of characters. But instead of the usual strings, I want to examine strings with oriented characters.

So the line might look like this:

(A,1) (B,1) (C,1) (D,1)

and possible derivatives may be:

(A,1) (C,0) (B,0) (D,1)
(A,1) (C,1) (B,1) (D,1)
(A,1) (B,0) (C,0) (D,1)

Where A,B,C,D are the identifiers of the characters and 1 = forward and 0 = reverse .

Here, the derivative 1. would have a distance of 2, since you could cut the BC block and reinsert it in inverted (1 cut, 1 paste). Derived 2. would also have 2, since you can cut C and re-paste it before B (1 cut, 1 Paste), while number 3. will require a conversion of 4 operations (2 cuts, 2 pastes). Similarly, deleting or inserting blocks will give a distance of 1.

If you define (X,0) and (X,1) as two different non-oriented characters (X0, X1) for all possible X, example 3. will lead to distance 2, since you could cut out block B1C1 and paste block B0C0 into two stages.

Real world example:

Genes in the bacterial genome can be considered as oriented character (A, 0), (B, 0) ... Having determined the distance of the sequence, the genomic orientation of the homology genes in two related bacteria can be used as a trace of evolutionary markers. The fact that bacterial genomes are round lines introduces an additional condition for the ABC boundary, equal to BCA.

Real genomes do have unique genes with no equivalent in a partner, which leads to the appearance of the @ site owner symbol. These place owners reduce the information content of the comparison to the lower boundary, since, for example, (A, 1) (B, 1) @ (C, 1) can be converted to (A, 1) @@@ (B, 1) @ ( C, 1) by inserting the block @@@. However, the orientation partially restores the information content, since you can find (A, 1) @@@ (B, 0) @ (C, 1) by specifying the minimum distance 3. Even better would be an algorithm for comparing several related sequences (genomes) simultaneously , since you could find intermediate links in evolutionary history, which increases resolution.

I understand there are several questions already sent compared to text strings. But they cannot easily expand, including orientation. In addition, there are many methods for controlling biological sequences, in particular for the analysis of multiple sequences. However, they are limited to sequences of macromolecules that do not exist in alternative orientations and usually cause specific weights for any particular match of characters.

If there is already a python library that would allow you to make the necessary settings to solve this problem, this would be fantastic. But any suitable orientation recognition algorithm will be very useful.

+7

python string algorithm character levenshtein distance

user152314 Aug 23 '13 at 15:40

source share

1 answer

Games brainiac · Accepted Answer · 2013-09-27T00:59:57+0000

I believe the following code may help you:

 from itertools import permutations from random import randint from pprint import pprint def generate_genes(): """ Generates a boilerplate list of genes @rtype : list """ tuple_list = [] for i in range(16): binary_var = bin(i)[2:] if len(binary_var) != 4: binary_var = "0" * (4 - len(binary_var)) + binary_var tuple_list.append([('A', (1 if binary_var[0] == '1' else 0)), ('B', (1 if binary_var[1] == '1' else 0)), ('C', (1 if binary_var[2] == '1' else 0)), ('D', (1 if binary_var[3] == '1' else 0))]) return tuple_list def all_possible_genes(): """ Generates all possible combinations of ABCD genes @return: returns a list of combinations @rtype: tuple """ gene_list = generate_genes() all_possible_permutations = [] for gene in gene_list: all_possible_permutations.append([var for var in permutations(gene)]) return all_possible_permutations def gene_stringify(gene_tuple): """ @type gene_tuple : tuple @param gene_tuple: The gene tuple generated """ return "".join(str(var[0]) for var in gene_tuple if var[1]) def dameraulevenshtein(seq1, seq2): """Calculate the Damerau-Levenshtein distance between sequences. This distance is the number of additions, deletions, substitutions, and transpositions needed to transform the first sequence into the second. Although generally used with strings, any sequences of comparable objects will work. Transpositions are exchanges of *consecutive* characters; all other operations are self-explanatory. This implementation is O(N*M) time and O(M) space, for N and M the lengths of the two sequences. >>> dameraulevenshtein('ba', 'abc') 2 >>> dameraulevenshtein('fee', 'deed') 2 It works with arbitrary sequences too: >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e']) 2 """ # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix. # However, only the current and two previous rows are needed at once, # so we only store those. oneago = None thisrow = range(1, len(seq2) + 1) + [0] for x in xrange(len(seq1)): # Python lists wrap around for negative indices, so put the # leftmost column at the *end* of the list. This matches with # the zero-indexed strings and saves extra calculation. twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1] for y in xrange(len(seq2)): delcost = oneago[y] + 1 addcost = thisrow[y - 1] + 1 subcost = oneago[y - 1] + (seq1[x] != seq2[y]) thisrow[y] = min(delcost, addcost, subcost) # This block deals with transpositions if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]): thisrow[y] = min(thisrow[y], twoago[y - 2] + 1) return thisrow[len(seq2) - 1] if __name__ == '__main__': genes = all_possible_genes() list1 = genes[randint(0, 15)][randint(0, 23)] list2 = genes[randint(0, 15)][randint(0, 23)] print gene_stringify(list1) pprint(list1) print gene_stringify(list2) pprint(list2) print dameraulevenshtein(gene_stringify(list1), gene_stringify(list2))

Loans

Michael Homer for the algorithm

Comparing multiple arbitrary string sequences with oriented characters

Loans

More articles: