Any good greedy set implementation for large datasets?

This question follows from a related question about my placement here . @mhum suggested that my problem falls within the scope of coverage problems. I tried to code my question into a problem with a minimal set of covers, and currently I have a data set in this form:

Set Cost (1,2) 1 (1) 1 (1,2,3) 2 (1) 2 (3,4) 2 (4) 3 (1,2) 3 (3,4) 4 (1,2,3,4) 4 

The goal is to find a good set covering all numbers and one that tries to minimize the total cost. My data set is large, at least 30,000 sets (ranging in size from 5-40 items). Are there any good greedy implementations to solve this, or am I on my own to implement this? I am not an expert in LP, but any LP solvers (from numpy / scipy) that can solve this are also acceptable.

+6
python algorithm numpy scipy linear-programming
source share
2 answers

There is a well-known greedy approximation algorithm for a set of covers, which is also easy to implement in any language of your choice. The algorithm itself is described here:

http://en.wikipedia.org/wiki/Set_cover_problem#Greedy_algorithm

It is so simple that the easiest way to write it is from scratch.

It is noteworthy that this is also the best polynomial time approximation algorithm known for dialing. This means that to improve performance in the worst case (a more compact set of results) you will need to have non-polynomial runtimes (= slow algorithms for large sets).

Unfortunately, the Wikipedia entry does not actually cover the cover with a heavier set, which is what is here. The extension is simple and described, for example. here:

http://pages.cs.wisc.edu/~shuchi/courses/880-S07/scribe-notes/lecture03.pdf

Some more useful notes:

http://www.cs.ucr.edu/~neal/non_arxiv/Young08SetCover.pdf http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec20.pdf

+7
source share

My linear implementation of greedy dial time in C ++ is available on github.

https://github.com/martin-steinegger/setcover

Calculation for 40,000,000 sets with avg. 10 elements in a set take about 4 minutes when computed using Amazon AWS m2.2xlarge examples.

I am still working on some tricks to improve performance.

  • remove subsets that are covered by a large set with MinHash
  • delete all sets that contain only one element that is not another set
+1
source share

All Articles