Performance issues in Burrows-Wheeler in python

I tried to implement Barrows-Wheeler in python. (This is one of the assignments in the online course, but I hope I did some work to qualify, to seek help).

The algorithm works as follows. Take a line that ends with a special character ($ in my case) and create all the looping lines from that line. Sort all of these lines in alphabetical order, having a special character always smaller than any other character. After that, get the last element of each row.

This gave me an oneliner:

''.join([i[-1] for i in sorted([text[i:] + text[0:i] for i in xrange(len(text))])] 

Which is correct and fast enough for large enough lines (this is enough to solve the problem):

  60 000 chars - 16 secs 40 000 chars - 07 secs 25 000 chars - 02 secs 

But when I tried to process a really huge string with several million characters, I failed (it takes too much time to process).

I assume the problem is storing too many lines in memory.

Is there any way to overcome this?

PS I just want to note that it may also seem like a home problem, my solution is already going through a grader, and I'm just looking for a way to do it faster. In addition, I do not spoil the fun for other people, because if they want to find a solution, there is a version in the wiki similar to mine. I also checked this questio n, which seems similar, but answers the more complex question of how to decode a string encoded using this algorithm.

+7
performance python algorithm
source share
3 answers

It takes a long time to make all of these string snippets with long strings. This is at least O (N ^ 2) (since you are creating N lines of N lengths, and each of them must be copied into memory, taking the original data from the original), which destroys the overall performance and makes sorting inconsequential. Not to mention the memory requirements!

Instead of actually slicing a string, the next thought is to arrange the i values ​​that you use to create the circular rows in the order in which they are matched β€” without actually creating it. It turns out to be a bit complicated. (Removed / edited something here that was incorrect, see @TimPeters answer.)

The approach I made here is to bypass the standard library, which makes it difficult (though not impossible) to compare these strings on demand - and do my own sorting. The natural choice of the algorithm here is radix sorting , since in any case we need to consider strings one character at a time.

Let him tune in first. I am writing code for version 3.2, so the season is to my taste. (In particular, in 3.3 and higher, we could use yield from .) I use the following imports:

 from random import choice from timeit import timeit from functools import partial 

I wrote a universal number sorting function like this one:

 def radix_sort(values, key, step=0): if len(values) < 2: for value in values: yield value return bins = {} for value in values: bins.setdefault(key(value, step), []).append(value) for k in sorted(bins.keys()): for r in radix_sort(bins[k], key, step + 1): yield r 

Of course, we do not need to be universal (our "baskets" can only be marked with single characters, and presumably you really mean to apply the algorithm to the bytes sequence;)), but this will not hurt. Perhaps there is something reusable, right? In any case, the idea is simple: we process the base register, and then we throw each element into "bin" according to the result from the key function, and then we extract the values ​​from the bins in a sorted buffer order, recursively sorting each contents of the basket.

The interface requires that key(value, n) give us n th the "radius" value . Therefore, for simple cases, for example, to compare strings directly, it can be as simple as lambda v, n: return v[n] . Here, however, the idea is to compare the indices in a row, according to the data in the row at that point (considered cyclically). So let's define the key:

 def bw_key(text, value, step): return text[(value + step) % len(text)] 

Now the trick to getting the right results is to remember that we conceptually concatenate the last characters of the strings that we are not actually creating. If we look at a virtual string created using index n , its last character is in index n - 1 , due to the way we wrap it, and for one moment it will think that this still works when n == 0 ;) . [However, when we complete the forwards, we still need to keep the string index within the bounds - hence, the modulo operation is in the key function.]

This is a common key function that must be passed to text , which it will refer to when converting value for comparison. What is where functools.partial comes in - you can also just mess around with lambda , but it is probably cleaner and I found it usually faster.

In any case, now we can easily record the actual conversion using the key:

 def burroughs_wheeler_custom(text): return ''.join(text[i - 1] for i in radix_sort(range(len(text)), partial(bw_key, text))) # Notice I've dropped the square brackets; this means I'm passing a generator # expression to `join` instead of a list comprehension. In general, this is # a little slower, but uses less memory. And the underlying code uses lazy # evaluation heavily, so :) 

Nice and beautiful. Let's see how it turns out, right? We need a standard for comparison:

 def burroughs_wheeler_standard(text): return ''.join([i[-1] for i in sorted([text[i:] + text[:i] for i in range(len(text))])]) 

And the synchronization procedure:

 def test(n): data = ''.join(choice('abcdefghijklmnopqrstuvwxyz') for i in range(n)) + '$' custom = partial(burroughs_wheeler_custom, data) standard = partial(burroughs_wheeler_standard, data) assert custom() == standard() trials = 1000000 // n custom_time = timeit(custom, number=trials) standard_time = timeit(standard, number=trials) print("custom: {} standard: {}".format(custom_time, standard_time)) 

Pay attention to the math I did to solve a few trials inversely related to the length of the string test . This should support the total time spent on testing in a fairly narrow range - right ?;) (Wrong, of course, since we found that the standard algorithm is not less than O (N ^ 2).)

Let's see how this is done (* drumroll *):

 >>> imp.reload(burroughs_wheeler) <module 'burroughs_wheeler' from 'burroughs_wheeler.py'> >>> burroughs_wheeler.test(100) custom: 4.7095093091438684 standard: 0.9819262643716229 >>> burroughs_wheeler.test(1000) custom: 5.532266880287807 standard: 2.1733253807396977 >>> burroughs_wheeler.test(10000) custom: 5.954826800612864 standard: 42.50686064849015 

Hey, this is a little scary jump. Anyway, as you can see, the new approach adds a ton of overhead to short lines, but allows actual sorting to be a bottleneck instead of cutting lines. :)

+10
source share

Just add a little to @KarlKnechtel's answer.

First, the "standard way" to speed up the cyclic permutation is only to insert two copies together and index them directly. After:

 N = len(text) text2 = text * 2 

then the cyclic permutation starting with index i is simply text2[i: i+N] , and the symbol j is that the permutation is just text2[i+j] . It is not necessary to glue two slices or module operations ( % ).

Secondly, you can use the built-in sort() to do this, though:

  • This is funky; -)
  • For strings with several separate characters (compared to the length of the string) sorting Karl radix will almost certainly be faster.

As a proof of concept, here is a replacement replacement for this part of Karl's code (although this applies to Python 2):

 def burroughs_wheeler_custom(text): N = len(text) text2 = text * 2 class K: def __init__(self, i): self.i = i def __lt__(a, b): i, j = ai, bi for k in xrange(N): # use `range()` in Python 3 if text2[i+k] < text2[j+k]: return True elif text2[i+k] > text2[j+k]: return False return False # they're equal inorder = sorted(range(N), key=K) return "".join(text2[i+N-1] for i in inorder) 

Please note that the built-in implementation of sort() calculates the key exactly once for each element in the input file and saves these results during sorting. In this case, the results are lazy little instances of K that simply remember the starting index, and the __lt__ method compares one pair of characters at a time, while "less than!" or "more than!" permitted.

+6
source share

I agree with the previous answer, the line / slicing list in python becomes a bottleneck when doing huge algorithmic calculations. The idea does not cut.

[EDIT: do not cut, but index the list. If you use array.array instead of lists, the execution time is reduced to half. Indexing arrays are simple, indexing lists is a more complicated process)]

Here is a more functional solution to your problem.

An idea that has a generator will act like slicer (rslice). This is similar to itertools.islice, but it comes to the beginning of the line when it reaches the end. And he will stop before reaching the starting position that you indicated when you created it. With this trick, you don’t copy substrings in memory, so in the end you only have pointers moving around your line, not creating copies everywhere.

So, we create a list containing [rslices, lastchar slicer] and we sort it using the rslice key (as you can see in the cf sort function).

When sorting, you will only need to collect for each element in the list a second element (the last element of a previously saved slice).

 from itertools import izip def cf(i1,i2): for i,j in izip(i1[0](),i2[0]()): # We grab the the first element (is a lambda) and execute it to get the generator if i<j: return -1 elif i>j: return 1 return 0 def rslice(cad,pos): # Slice that rotates through the string (it a generator) pini=pos lc=len(cad) while pos<lc: yield cad[pos] pos+=1 pos=0 while pos<pini-1: yield cad[pos] pos+=1 def lambdagen(start,cad): # Closure to hold a generator return lambda: rslice(cad,start) def bwt(txt): lt=len(txt) arry=list(txt)+[None] l=[(lambdagen(0,arry),None)]+[(lambdagen(i,arry),arry[i-1]) for i in range(1,lt+1)] # What we keep in the list is the generator for the rotating-slice, plus the # last character of the slice, so we save the time of going through the whole # string to get the last character l.sort(cmp=cf) # We sort using our cf function return [i[1] for i in l] print bwt('Text I want to apply BTW to :D') # ['D', 'o', 'y', 't', 'o', 'W', 't', 'I', ' ', ' ', ':', ' ', 'B', None, 'T', 'w', ' ', # 'T', 'p', 'a', 't', 't', 'p', 'a', 'x', 'n', ' ', ' ', ' ', 'e', 'l'] 

EDIT: using arrays (runtime reduced by 2):

 def bwt(txt): lt=len(txt) arry=array.array('h',[ord(i) for i in txt]) arry.append(-1) l=[(lambdagen(0,arry),None)]+[(lambdagen(i,arry),arry[i-1]) for i in range(1,lt+1)] l.sort(cmp=cf) return [i[1] for i in l] 
+1
source share

All Articles