Find the "largest" dense sub-matrix in a large sparse matrix

Given a large sparse matrix (say 10k + per 1M +), I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub-matrix to be as large as possible (and not the largest amount, but the largest number of elements) within some aspect ratio restrictions.

Are there any known exact or approximate solutions to this problem?

A quick crawl at Google seems to give a lot of results, but not entirely accurate. What conditions should I look for?


edit: just clarify; the submatrix does not have to be continuous. In fact, the order of rows and columns is completely arbitrary, so adjacency is completely irrelevant.


Thought based on the idea of ​​Chad Oker

  • Order strings from highest to lowest (optional, but may help performance)
  • Select two lines with a large overlap
  • Add all other lines that will not reduce overlap.
  • Burn this set
  • Add any row that reduces the smallest overlap.
  • Repeat at # 3 until the result reaches a small
  • Start C # 2 with a different starting pair
  • Continue until you decide that the result is good enough.
+7
algorithm sparse-matrix discrete-mathematics
source share
4 answers

I assume you want something like this. You have a matrix like

1100101 1110101 0100101 

You need columns 1,2,5,7 and rows 1 and 2, right? This sub-matrix will be 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3, which would be a 3x3 matrix.

If you need an approximate method, you can start with one non-zero element, and then continue searching for another non-zero element and add it to the list of rows and columns. At some point, you will come across a non-zero element that, if rows and columns were added to your collection, your collection would no longer be completely non-zero.

So, for the above matrix, if you added 1,1 and 2,2, you will have rows 1,2 and columns 1,2 in your collection. If you tried adding 3.7, this will cause a problem because the value of 1.3 is zero. Therefore, you could not add it. You could add 2.5 and 2.7. Creating a 4x2 sub-matrix.

Basically, you will iterate until you can find new new rows and columns to add. This would give you a local minimum. You can save the result and start again from a different starting point (perhaps one that did not fit into your current solution).

Then just stop when after a while you cannot find more.

This will obviously take a lot of time, but I don’t know if you can do it faster.

+2
source share

EDIT. This is NOT the same as the problem below .. My bad ...

But based on the last comment below, it could be tantamount to the following:

  • Find the most distant vertical pair of zero points that do not have a zero point between them.
  • Find the farthest horizontally divided pair of zero points that have no zeros between them?
  • Then the horizontal region you are looking for is a rectangle that is between these two pairs of points?

    This exact problem is discussed in the pearl of John Bentley's Programming Pearls book and, as I recall, although there is a solution in one dimension, there is no simple answer for 2nd or higher three-dimensional options ...

Problem 1 = D is to find the largest sum of a continuous subset of a set of numbers:

iterating over the elements, tracking the current amount from a certain previous element and the maximum possible subtotal (and the start and end elements that generated it) ... In each element, if the subtotal is greater than the maximum amount visible so far, maximum visibility so far and endelemnt are reset ... If the maximum total amount drops below zero, the initial element is reset to the current element, and the total amount is reset to zero ...

A two-dimensional problem arose as a result of an attempt to create an algorithm for processing a visual image, which tried to find in the stream of brightness values ​​representing pixels in a 2-color image, to find the “brightest” rectangular area inside the image. those. find the contained 2-D submatrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel brightness value and the total average brightness of the entire image (so many elements had negative values)

EDIT: to search for a 1-D solution, I dug up my copy of the second edition of this book, and in this John Bentley says: "The two-dimensional version remains unresolved, since this edition is being printed ...", which was in 1999.

+1
source share

Is this a Netflix issue ?

MATLAB or some other sparse matrix libraries may have ways to handle this.

Is your intention to write your own?

Perhaps the 1D approach for each row will help you. The algorithm may look like this:

  • Loop over each line
  • Find the index of the first nonzero element
  • Find the index of the non-zero element of the row with the largest spacing between the non-zero columns in each row and save both.
  • Sort rows from largest to smallest range between non-zero columns.

At this point, I'm starting to get fuzzy (sorry, not the algorithm developer). I would try to iterate over each row, building up the starting point indices, looking for the maximum nonzero mileage of the column indices that I could do.

You do not indicate whether the dense matrix should be square. I guess not.

I don't know how effective it is or what Big-O's behavior is. But this is a brute force method to start with.

+1
source share

I know that you are no longer working on this, but I thought that someone might have the same question as I do in the future.

So, realizing that this is an NP-hard problem (by shortening it to MAX-CLIQUE), I decided to come up with a heuristic that so far has worked well for me:

Given the N x M binary / Boolean matrix, find a large dense sub-matrix:

Part I : creating reasonable candidate sub-matrices

  • We consider each of the strings N as an M -dimensional binary vector v_i , where i = 1 to N
  • Calculate the distance matrix for vectors N using the Hamming distance
  • Use the UPGMA algorithm (unweighted pair method with arithmetic mean) for cluster vectors

Initially, each of the v_i vectors is a singleton cluster. Step 3 above (clustering) gives the order of combining vectors into submatrices. Thus, each internal node in the hierarchical clustering tree is a candidate submatrix.

Part II : Sub-Category Sub-Categories and Candidate Rank

  • For each submatrix, calculate D the number of elements in a dense subset of vectors for the submatrix, excluding any column with one or more zeros.
  • Choose a submatrix that maximizes D

I also had some considerations regarding the minimum number of rows that needed to be kept from the original full matrix, and I would discard any candidate sub-matrices that did not meet these criteria before choosing a sub-matrix with max D value.

+1
source share

All Articles