Main problem:
I am looking for an algorithm to calculate the maximum economical distance between a set of rows. With distance, I mean something similar to the Damerau-Levenshtein distance , i.e. minimum number of deletions, insertions, substitutions, and transposing characters or adjacent blocks of characters. But instead of the usual strings, I want to examine strings with oriented characters.
So the line might look like this:
and possible derivatives may be:
(A,1) (C,0) (B,0) (D,1)(A,1) (C,1) (B,1) (D,1)(A,1) (B,0) (C,0) (D,1)
Where A,B,C,D are the identifiers of the characters and 1 = forward and 0 = reverse .
Here, the derivative 1. would have a distance of 2, since you could cut the BC block and reinsert it in inverted (1 cut, 1 paste). Derived 2. would also have 2, since you can cut C and re-paste it before B (1 cut, 1 Paste), while number 3. will require a conversion of 4 operations (2 cuts, 2 pastes). Similarly, deleting or inserting blocks will give a distance of 1.
If you define (X,0) and (X,1) as two different non-oriented characters (X0, X1) for all possible X, example 3. will lead to distance 2, since you could cut out block B1C1 and paste block B0C0 into two stages.
Real world example:
Genes in the bacterial genome can be considered as oriented character (A, 0), (B, 0) ... Having determined the distance of the sequence, the genomic orientation of the homology genes in two related bacteria can be used as a trace of evolutionary markers. The fact that bacterial genomes are round lines introduces an additional condition for the ABC boundary, equal to BCA.
Real genomes do have unique genes with no equivalent in a partner, which leads to the appearance of the @ site owner symbol. These place owners reduce the information content of the comparison to the lower boundary, since, for example, (A, 1) (B, 1) @ (C, 1) can be converted to (A, 1) @@@ (B, 1) @ ( C, 1) by inserting the block @@@. However, the orientation partially restores the information content, since you can find (A, 1) @@@ (B, 0) @ (C, 1) by specifying the minimum distance 3. Even better would be an algorithm for comparing several related sequences (genomes) simultaneously , since you could find intermediate links in evolutionary history, which increases resolution.
I understand there are several questions already sent compared to text strings. But they cannot easily expand, including orientation. In addition, there are many methods for controlling biological sequences, in particular for the analysis of multiple sequences. However, they are limited to sequences of macromolecules that do not exist in alternative orientations and usually cause specific weights for any particular match of characters.
If there is already a python library that would allow you to make the necessary settings to solve this problem, this would be fantastic. But any suitable orientation recognition algorithm will be very useful.