How to implement truly efficient bitvector sorting in python

This is actually an interesting topic from the programming of pearls, sorting 10 digits of telephone numbers in limited memory with an efficient algorithm. You can find the whole story here.

I am wondering how fast the implementation can be in python. I made a naive implementation using the module bit vector. The code is as follows:

from BitVector import BitVector import timeit import random import time import sys def sort(input_li): return sorted(input_li) def vec_sort(input_li): bv = BitVector( size = len(input_li) ) for i in input_li: bv[i] = 1 res_li = [] for i in range(len(bv)): if bv[i]: res_li.append(i) return res_li if __name__ == "__main__": test_data = range(int(sys.argv[1])) print 'test_data size is:', sys.argv[1] random.shuffle(test_data) start = time.time() sort(test_data) elapsed = (time.time() - start) print "sort function takes " + str(elapsed) start = time.time() vec_sort(test_data) elapsed = (time.time() - start) print "sort function takes " + str(elapsed) start = time.time() vec_sort(test_data) elapsed = (time.time() - start) print "vec_sort function takes " + str(elapsed) 

I tested from an array size of 100 to 10,000,000 in my macbook (2 GHz Intel Core 2 Duo 2GB SDRAM), the result is as follows:


  • Test Date Size: 1000 Sort Function
  • accepts 0.000274896621704
  • Vec_sort function takes 0.00383687019348

  • Test Date Size: 10000

  • Sort function accepts 0.00380706787109
  • Vec_sort function takes 0.0371489524841

  • Test Date Size: 100000

  • The sort function accepts 0.0520560741425
  • The vec_sort function takes the value 0.374383926392

  • Test Date Size: 1,000,000

  • sorting function takes 0.867373943329
  • Vec_sort function takes 3.80475401878

  • Test_data size: 10000000

  • Sort function accepts 12.9204008579
  • Vec_sort function takes 38.8053860664

What disappoints me is that even when the size of test_data is 100,000,000, the sort function is still faster than vec_sort. Is there a way to speed up the vec_sort function?

+4
source share
2 answers

As Nicky noted, you are comparing a very fast C procedure with Python. Using psyco speeds it up for me, but you can really speed it up using a bit vector module written in C. I used bitarray and then the bit sorting method is superior to the built-in sorting for an array size of around 250,000 using psyco.

Here is the function I used:

 def vec_sort2(input_li): bv = bitarray(len(input_li)) bv.setall(0) for i in input_li: bv[i] = 1 return [i for i in xrange(len(bv)) if bv[i]] 

Notice also that I used list comprehension to build a sorted list that helps a bit. Using psyco and above functions with your functions, I get the following results:

 test_data size is: 1000000 sort function takes 1.29699993134 vec_sort function takes 3.5150001049 vec_sort2 function takes 0.953999996185 

As a side note, BitVector is not particularly optimized even for Python. Before I found bitarray, I made several different settings in the module and using my module with settings, the time for vec_sort is reduced by a second for such an array size. I did not make my changes to it, because bitarray is just a lot faster.

+3
source

My Python is not the best, but it looks like you have a bug in the code:

 bv = BitVector( size = len(input_li) ) 

The size of your battlevector matches the size of your input array. You want the bitvector to be the size of your domain - 10 ^ 10. I'm not sure how Python bitvectors deal with overflows, but if it automatically resizes the bitvector, then you get quadratic behavior.

Also, I believe that the Python sort function is implemented in C and will not have the overhead of sorting implemented exclusively in Python. However, this probably will not cause the O (nlogn) algorithm to run much faster than the O (n) algorithm.

Edit: this view will only work on large datasets. Your algorithm runs in O (n + 10 ^ 10) time (based on your tests, I assume you know that), which will be worse than O (nlogn) for small inputs.

+1
source

Source: https://habr.com/ru/post/1312114/


All Articles