First, create a mapping from the document identifier to the word set - your matrix 0 and 1 is a rather cumbersome structure to process directly. If you read correctly, “column headings” (words) are the first list in the matrix (minus presumably the first element), and “row headers” (docids) are the first elements of each row (minus presumably the first row). Then (assuming Python 2.6 or better):
def makemap(matrix): im = iter(matrix) words = next(im)[1:] themap = {} for row in im: mapent = set() docid = row[0] for w, bit in zip(words, row[1:]): try: if bit: mapent.add(w) except: print 'w is %r' % (w,) raise themap[docid] = mapent return themap
Now you need to check all the valid subsets of documents - the total number of subsets is huge, so you really want to trim this search tree as much as you can, and generating all the subsets (for example, by cyclizing itertools.combinations for different lengths) will not do any trimming, of course .
I would start with all 2 combinations (all pairs of docid - itertools.combinations great for this, of course) and make the first batch (those pairs that have 2+ words in the possession) of "a valid length of 2 subsets." This may go in a different juxtaposition with tuple or frozenset docids as keys.
Then, to make valid (N + 1) -subsets, I would try to extend the existing valid N-length subsets by one more docid each (the full intersection check still lasts 2+). This is at least trimming some , and not blindly trying all 2**N subsets (or even just 2**N - N - 1 subsets of at least two lengths ;-).
Perhaps it would be possible to do even better by writing down all the docids that were unable to extend a specific subset of the N-length — there was no point in trying to use them for any of the (N + 1) -sets obtained from it. It is worth a try as a second level of cropping / optimization.
There may be further tricks that you can do to further trim, but because of this, no one comes to mind, so where do I start. (For readability, I will not worry below with the use of microoptimizations such as iteritems instead of items , frozenset instead of tuple s, etc. - they are probably marginal because these sequences are all O(N) compared to the exponential size calculated structures, although, of course, it is worth trying at the tuning / optimization stage).
def allfeasiblesubsets(matrix): mapping = makemap(matrix) docids = sorted(mapping.keys()) feasible_len2 = {} dont_bother = dict((d, set([d])) for d in docids) for d1, d2 in itertools.combinations(docids, 2): commonw = mapping[d1].intersection(mapping[d2]) if len(commonw) >= 2: feasible_len2[d1, d2] = commonw else: dont_bother[d1].add(d2) dont_bother[d2].add(d1) all_feasible = [feasible_len2] while all_feasible[-1]: feasible_Np1 = {} for ds, ws in all_feasible[-1].items(): md = max(ds) for d, w in mapping.items(): if d <= md or any(d in dont_bother[d1] for d1 in ds): continue commonw = w.intersection(ws) if len(commonw) >= 2: feasible_Np1[ds+(d,)] = commonw all_feasible.append(feasible_Np1) return all_feasible[:-1]
You will notice that I applied only the soft form of my proposed “further trimming” - dont_bother records only “incompatibilities” (<2 words together) between one docid and others - this can help if there are several pairs of such “incompatible docids” and is simple and reasonable unobtrusive, but not as powerful in pruning as the more complex "complete" alternative. I am also trying to save all keys in feasible* dicts as sorted docids tuples (since itertools.combinations initially provides pairs) to avoid duplication and therefore redundant work.
Here is a small example that I tried (in the same file as these functions after, of course, import for itertools and collections ):
mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(), ['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1], ['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1], ['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1], ['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]] mm = makemap(mat) print mm afs = allfeasiblesubsets(mat) print afs
Results that look OK:
{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])} [{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]
but of course, there may still be errors hidden, since I have not tested them completely. By the way, I hope that it is clear that the result given here (a list of dicts for different increasing lengths, each dict having ordered tuples of docids-sets forms as keys and sets of their common words as values) can be easily published - processed in any other form you may prefer, such as nested lists.
(Not that it matters, but the text I use in the example is an old Italian proverb ;-).