Working with suffix trees in python

I am relatively new to python and am starting to work with suffix trees. I can create them, but I ran into a memory problem when the string gets big. I know that they can be used to work with DNA strands of 4 ^ 10 or 4 ^ 12 in size, but whenever I try to implement a method, I end up with a memory problem.

Here is my code for generating a string and suffix tree.

import random

def get_string(length):
    string=""
    for i in range(length):
        string += random.choice("ATGC")
    return string

word=get_string(4**4)+"$"

def suffixtree(string):
    for i in xrange(len(string)):
        if tree.has_key(string[i]):
            tree[string[i]].append([string[i+1:]][0])
        else:
            tree[string[i]]=[string[i+1:]]
    return tree

tree={}
suffixtree(word)

When I get up to 4 ** 8, I have problems with serious memory. I am new to this, so I'm sure something is missing for me while keeping these things. Any advice would be greatly appreciated.

: , . 16. , 16 , . , .

+5
4

, , , . , .

string[i+1:]

() , i+1.

- ( ), . :

def suffixtree(string):
    N = len(string)
    for i in xrange(N):
        if tree.has_key(string[i]):
            tree[string[i]].append(buffer(string,i+1,N))
        else:
            tree[string[i]]=[buffer(string,i+1,N)]
    return tree

, 1 8 ^ 11 .

, , , , . ( ) ; . buffer , .

+2

. , -.

, , . .

+4

, , ? :

word=get_string(4**12)+"$"

def matcher(word, match_string):
    positions = [-1]
    while 1:
        positions.append(word.find(match_string, positions[-1] + 1))
        if positions[-1] == -1:
            return positions[1:-1]

print matcher(word,'AAAAAAAAAAAA')
[13331731, 13331732, 13331733]
print matcher('AACTATAAATTTACCA','AT')
[4, 8]

, 30 , 4 ^ 12. 12- , . - -.

- , , :

import suffixtree
stree = suffixtree.SuffixTree(word)
print stree.find_substring("AAAAAAAAAAAA")

, , . , -, , , , . find_substring ( , , , ).

: ,

, 10 4 ^ 12, 9,5 ( , , ...). , ( , ), . ( , ) , 10 , . , . ( , , word , max_length, , , ):

def split_find(word,search_words,max_length):
    number_sub_trees = len(word)/max_length
    matches = {}
    for i in xrange(0,number_sub_trees):
        stree = suffixtree.SuffixTree(word[max_length*i:max_length*(i+1)])
        for search in search_words:
            if search not in matches:
                match = stree.find_substring(search)
                if match > -1:
                    matches[search] = match + max_length*i,i
            if i < number_sub_trees:
                match = word[max_length*(i+1) - len(search):max_length*(i+1) + len(search)].find(search)
                if match > -1:
                    matches[search] = match + max_length*i,i
    return matches

word=get_string(4**12)
search_words = ['AAAAAAAAAAAAAAAA'] #list of all words to find matches for
max_length = 4**10 #as large as your machine can cope with (multiple of word)
print split_find(word,search_words,max_length)

4 ^ 10, 700 . , 4 ^ 12, 10 13 ( , , , , ). 100 , 1.00 * 41sec = 1 .

, 14 , ... 9,5 . , 1,6 1 , , !

+2

The reason for the memory problems is that 'banana'you generate for input {'b': ['anana$'], 'a': ['nana$', 'na$', '$'], 'n': ['ana$', 'a$']}. This is not a tree structure. You have all the possible input suffixes created and saved in one of the lists. This takes up O (n ^ 2) storage space. In addition, in order for the suffix tree to work correctly, you want the leaf nodes to display index positions.

The result that you want to get is {'banana$': 0, 'a': {'$': 5, 'na': {'$': 3, 'na$': 1}}, 'na': {'$': 4, 'na$': 2}}. (This is an optimized representation, a simpler approach restricts us to single-character labels.)

+2
source

All Articles