How to make my greedy cover set implementation faster?

I came up with the following implementation for the Greedy Set Cover after much discussion of my original question here . From the help I received, I coded the problem into the β€œGreedy Cover Kit” and after receiving additional help here , I came up with the following implementation. I am grateful to everyone for helping me with this. The following implementation works fine, but I want to make it scalable / fast.

Scalable / faster, I mean that:

  • My dataset contains about 50K-100K sets in S
  • The number of elements in U itself is very small in the order of 100-500
  • The size of each set in S can be from 0 to 40

And here is my attempt:

U = set([1,2,3,4]) R = U S = [set([1,2]), set([1]), set([1,2,3]), set([1]), set([3,4]), set([4]), set([1,2]), set([3,4]), set([1,2,3,4])] w = [1, 1, 2, 2, 2, 3, 3, 4, 4] C = [] costs = [] def findMin(S, R): minCost = 99999.0 minElement = -1 for i, s in enumerate(S): try: cost = w[i]/(len(s.intersection(R))) if cost < minCost: minCost = cost minElement = i except: # Division by zero, ignore pass return S[minElement], w[minElement] while len(R) != 0: S_i, cost = findMin(S, R) C.append(S_i) R = R.difference(S_i) costs.append(cost) print "Cover: ", C print "Total Cost: ", sum(costs), costs 

I am not an expert in Python, but any Python optimization for this code would be very enjoyable.

+2
performance optimization python algorithm scalability
source share
2 answers

What times do you get against what you need? Of course, most of the runtime is spent crossing code at the c level, so you can't optimize it? With some random data (the results may differ from your data, of course, not sure if they are good values) of 100,000 sets, 40 elements in each set, 500 unique elements, weights from 1 to 10,

 print 'generating test data' num_sets = 100000 set_size = 40 elements = range(500) U = set(elements) R = U S = [] for i in range(num_sets): random.shuffle(elements) S.append(set(elements[:set_size])) w = [random.randint(1,100) for i in xrange(100)] C = [] costs = [] 

I got this performance with cProfile:

  8200209 function calls in 14.391 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 14.391 14.391 <string>:1(<module>) 41 4.802 0.117 14.389 0.351 test.py:23(findMin) 1 0.001 0.001 14.391 14.391 test.py:40(func) 4100042 0.428 0.000 0.428 0.000 {len} 82 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 41 0.001 0.000 0.001 0.000 {method 'difference' of 'set' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 4100000 9.160 0.000 9.160 0.000 {method 'intersection' of 'set' objects} 

Hm, therefore, basically, apparently, 1/3 of the time is not in the established intersections. But I personally will no longer optimize, especially at the cost of clarity. There will not be much that you can do with the other 2/3, so why bother?

+2
source share

I use the trick when I implemented the well-known greedy algorithm for typing cover art (without weight) in Matlab. Perhaps you could somehow extend this trick to a weighted case using a given power / given weight instead of the installed power. Moreover, if you use the NumPy library, exporting Matlab code to Python should be very simple.

Here is the trick:

  • (optional) I sorted the sets in descending order of power (i.e. the number of elements that they contain). I also saved their power.
  • I select the set S, in my implementation it is the largest (i.e. the first set of the list), and I count how many detected elements it contains. Let say that it contains n unclosed elements.
  • Since then I know that there is a set S with n uncovered elements, I do not need to process all sets with a capacity of less than n elements, because they cannot be better than S. Therefore, I just need to look for an optimal set among sets with a capacity of at least n ; with my sorting, we can easily focus on them.
+3
source share

All Articles