Best way in Python to determine all possible intersections in a matrix?

So, if I have a matrix (list of lists), where each column is a unique word, each row is a separate document, and each record is 1 or 0, indicating whether or not there is a word for this column in the document for this row.

What I would like to know is to identify all possible combinations of words and documents where more than one word matches more than one document. The result might look something like this:

[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']], [[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']], ... 

etc. for every possible combination. I would like to find a solution that provides a complete set of combinations and one that gives only combinations that are not a subset of the other, so the example does not [[Docid_3, Docid_5], ['word1', 'word17']] , since it is a complete subset first example.

I feel that there is an elegant solution that just doesn’t come to mind, and beer does not help.

Thanks.

+2
source share
4 answers
  • Normalize the text. You only need strings from string.lowercase . Split / split everything else.
  • Extract set .
  • Use something like this to get all possible groupings of all sizes:

     def get_all_lengths_combinations_of(elements): for no_of_items in range(2, len(elements)+1): for items in itertools.combinations(elements, no_of_items) yield items 

    I am sure that real itertools wizards will come up with something better, possibly involving izip() .

  • Remember that you must use the set.intersection() method:

     set.intersection(*list_of_sets_to_intersect) 
+3
source

First, create a mapping from the document identifier to the word set - your matrix 0 and 1 is a rather cumbersome structure to process directly. If you read correctly, “column headings” (words) are the first list in the matrix (minus presumably the first element), and “row headers” (docids) are the first elements of each row (minus presumably the first row). Then (assuming Python 2.6 or better):

 def makemap(matrix): im = iter(matrix) words = next(im)[1:] themap = {} for row in im: mapent = set() docid = row[0] for w, bit in zip(words, row[1:]): try: if bit: mapent.add(w) except: print 'w is %r' % (w,) raise themap[docid] = mapent return themap 

Now you need to check all the valid subsets of documents - the total number of subsets is huge, so you really want to trim this search tree as much as you can, and generating all the subsets (for example, by cyclizing itertools.combinations for different lengths) will not do any trimming, of course .

I would start with all 2 combinations (all pairs of docid - itertools.combinations great for this, of course) and make the first batch (those pairs that have 2+ words in the possession) of "a valid length of 2 subsets." This may go in a different juxtaposition with tuple or frozenset docids as keys.

Then, to make valid (N + 1) -subsets, I would try to extend the existing valid N-length subsets by one more docid each (the full intersection check still lasts 2+). This is at least trimming some , and not blindly trying all 2**N subsets (or even just 2**N - N - 1 subsets of at least two lengths ;-).

Perhaps it would be possible to do even better by writing down all the docids that were unable to extend a specific subset of the N-length — there was no point in trying to use them for any of the (N + 1) -sets obtained from it. It is worth a try as a second level of cropping / optimization.

There may be further tricks that you can do to further trim, but because of this, no one comes to mind, so where do I start. (For readability, I will not worry below with the use of microoptimizations such as iteritems instead of items , frozenset instead of tuple s, etc. - they are probably marginal because these sequences are all O(N) compared to the exponential size calculated structures, although, of course, it is worth trying at the tuning / optimization stage).

 def allfeasiblesubsets(matrix): mapping = makemap(matrix) docids = sorted(mapping.keys()) feasible_len2 = {} dont_bother = dict((d, set([d])) for d in docids) for d1, d2 in itertools.combinations(docids, 2): commonw = mapping[d1].intersection(mapping[d2]) if len(commonw) >= 2: feasible_len2[d1, d2] = commonw else: dont_bother[d1].add(d2) dont_bother[d2].add(d1) all_feasible = [feasible_len2] while all_feasible[-1]: feasible_Np1 = {} for ds, ws in all_feasible[-1].items(): md = max(ds) for d, w in mapping.items(): if d <= md or any(d in dont_bother[d1] for d1 in ds): continue commonw = w.intersection(ws) if len(commonw) >= 2: feasible_Np1[ds+(d,)] = commonw all_feasible.append(feasible_Np1) return all_feasible[:-1] 

You will notice that I applied only the soft form of my proposed “further trimming” - dont_bother records only “incompatibilities” (<2 words together) between one docid and others - this can help if there are several pairs of such “incompatible docids” and is simple and reasonable unobtrusive, but not as powerful in pruning as the more complex "complete" alternative. I am also trying to save all keys in feasible* dicts as sorted docids tuples (since itertools.combinations initially provides pairs) to avoid duplication and therefore redundant work.

Here is a small example that I tried (in the same file as these functions after, of course, import for itertools and collections ):

 mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(), ['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1], ['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1], ['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1], ['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]] mm = makemap(mat) print mm afs = allfeasiblesubsets(mat) print afs 

Results that look OK:

 {'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])} [{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}] 

but of course, there may still be errors hidden, since I have not tested them completely. By the way, I hope that it is clear that the result given here (a list of dicts for different increasing lengths, each dict having ordered tuples of docids-sets forms as keys and sets of their common words as values) can be easily published - processed in any other form you may prefer, such as nested lists.

(Not that it matters, but the text I use in the example is an old Italian proverb ;-).

+3
source

Take a look at SO what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there

For real problem sizes, say 100 documents, 10,000 words, get a good bitarray module
(which says, by the way, "the same algorithm in Python ... about 20 times slower than in C").

On "only combinations that are not a subset of the other": define hit22 as a 2x2 submatrix with 11 11, hit23 as a 2x3 submatrix with 111 111 (2 documents, 3 words total), etc.
This hit22 can be in many hit2n s - 2 documents, n words,
as well as in many hitn2 s - n docs, 2 words. It looks fun.

Added Monday 14Jun: small functions using bitarray.
(Intro / python modules for actually classifying Dunno documents.)

 """ docs-words with bitarray, randombits """ # google "document classification" (tutorial | python) ... # /questions/231449/what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there from __future__ import division import random import sys from bitarray import bitarray # http://pypi.python.org/pypi/bitarray __date__ = "14jun 2010 denis" ndoc = 100 nbits = 1000 exec "\n".join( sys.argv[1:] ) # run this.py ndoc= ... random.seed(1) me = __file__.split('/') [-1] print "%s ndoc=%d nbits=%d" % (me, ndoc, nbits) # bitarray stuff -- def bitslist( bits ): """ 011001 -> [1,2,5] """ return [ j for j in range(len(bits)) if bits[j] ] hex_01 = { "0": "0000", "1": "0001", "2": "0010", "3": "0011", "4": "0100", "5": "0101", "6": "0110", "7": "0111", "8": "1000", "9": "1001", "a": "1010", "b": "1011", "c": "1100", "d": "1101", "e": "1110", "f": "1111", } def to01( x, len_ ): x = "%x" % x s = "".join( hex_01[c] for c in x ) return (len_ - len(s)) * "0" + s def randombits( nbits ): """ -> bitarray 1/16 1, 15/16 0 """ hibit = 1 << (nbits - 1) r = (random.randint( 0, hibit - 1 ) & random.randint( 0, hibit - 1 ) & random.randint( 0, hibit - 1 ) & random.randint( 0, hibit - 1 )) # prob 1/16 return bitarray( to01( r, nbits )) #............................................................................... doc = [ randombits(nbits) for j in range(ndoc) ] # ndoc x nbits def mostsimilarpair(): """ -> (sim, j, k) most similar pair of docs """ mostsim = (-1,-1,-1) for j in range(ndoc): for k in range(j+1, ndoc): # allpairs[j,k] -> scipy.cluster.hier ? sim = (doc[j] & doc[k]).count() # nr bits (words) in common, crude mostsim = max( mostsim, (sim,j,k) ) return mostsim sim, jdoc, kdoc = mostsimilarpair() print "The 2 most similar docs:" , print "doc %d has %d words," % ( jdoc, doc[jdoc].count() ) , print "doc %d has %d," % ( kdoc, doc[kdoc].count() ) print "%d words in common: %s" % ( sim, bitslist( doc[jdoc] & doc[kdoc] )) print "" #............................................................................... def docslike( jdoc, thresh ): """ -> (doc index, sim >= thresh) ... """ for j in range(ndoc): if j == jdoc: continue sim = (doc[j] & doc[jdoc]).count() if sim >= thresh: yield (j, sim) thresh = sim // 2 print "Docs like %d, with >= %d words in common:" % (jdoc, thresh) for j, sim in docslike( jdoc, thresh ): print "%3d %s" % ( j, bitslist( doc[j] & doc[jdoc] )) """ The 2 most similar docs: doc 72 has 66 words, doc 84 has 60, 12 words in common: [11, 51, 119, 154, 162, 438, 592, 696, 800, 858, 860, 872] Docs like 72, with >= 6 words in common: 2 [3, 171, 258, 309, 592, 962] ... """ 
+1
source

How many documents? How many unique words? How much RAM do you have?

What do you want to create in the following scenario: document A has the words 1, 2, 3; B has 1, 2; C has 2, 3

0
source

All Articles