I have a default by default with ~ 700 keys. The keys are in the format A_B_STRING. I need to divide the key by β_β and compare the distance between βSTRINGβ for each key if A and B are identical. If the distance is <= 2, I want to group the lists for these keys into one group of defaultdict: value values. There will be many keys that must match and be grouped. I also want to keep the new combined defaultdict, all key: value pairs that did not fall into the group.
The input file is in FASTA format , where the headers are keys and the values ββare sequences (defaultdict is used because several sequences have the same header based on the explosion report from the original fasta file).
This is what I have so far:
!/usr/bin/env python
import sys
from collections import defaultdict
import itertools
inp = sys.argv[1]
with open(inp, 'r') as f:
h = []
s = []
for line in f:
if line.startswith(">"):
h.append(line.strip().split('>')[1])
else:
s.append(line.strip())
seqs = dict(zip(h,s))
print 'Total Sequences: ' + str(len(seqs))
groups = defaultdict(list)
for i in seqs:
groups['_'.join(i.split('_')[1:])].append(seqs[i])
def hamming(str1, str2):
""" Simple hamming distance calculator """
if len(str1) == len(str2):
diffs = 0
for ch1, ch2 in zip(str1,str2):
if ch1 != ch2:
diffs += 1
return diff
keys = [x for x in groups]
combos = list(itertools.combinations(keys,2))
combined = defaultdict(list)
for i in combos:
a1 = i[0].split('_')[0]
a2 = i[1].split('_')[0]
b1 = i[0].split('_')[1]
b2 = i[1].split('_')[1]
c1 = i[0].split('_')[2]
c2 = i[1].split('_')[2]
if a1 == a2 and b1 == b2:
d = hamming(c1, c2)
if d <= 2:
combined[i[0]].append(groups[i[0]] + groups[i[1]])
print len(combined)
for c in sorted(combined):
print c, '\t', len(combined[c])
The problem is that this code does not work as expected. When printing keys in a combined defaultdict; I clearly see that there are many that can be combined. However, the combined defaultdict is about half the size of the original.
Edit
Alternative no itertools.combinations:
for a in keys:
tocombine = []
tocombine.append(a)
tocheck = [x for x in keys if x != a]
for b in tocheck:
i = (a,b)
a1 = i[0].split('_')[0]
a2 = i[1].split('_')[0]
b1 = i[0].split('_')[1]
b2 = i[1].split('_')[1]
c1 = i[0].split('_')[2]
c2 = i[1].split('_')[2]
if a1 == a2 and b1 == b2:
if len(c1) == len(c2):
d = hamming(c1, c2)
if d <= 2:
tocombine.append(b)
for n in range(len(tocombine[1:])):
keys.remove(tocombine[n])
combined[tocombine[0]].append(groups[tocombine[n]])
final = defaultdict(list)
for i in combined:
final[i] = list(itertools.chain.from_iterable(combined[i]))
However, with this method, I still lack several of them that do not match any other.