How to sort a bunch of N x M binary matrices so that the most similar ones are neighbors in a doubly linked list?

How to sort a bunch of N x M binary matrices so that the most similar ones are neighbors in a doubly linked list?

I have a 2d binary matrix set, and I need to sort a lot of matrices in some data structure efficiently, so those that are most similar to each other are “neighbors” to each other in the data structure. I do not think that the map structure will be effective, since I have about 40,000 matrices that need to be effectively searched.

My formula for the distance between two matrices

getSimilarity(matrix toCompare) //initialize variable "sum" to 0 //for each rowT in this and each rowC in toCompare //sum += max(rowT,rowC) - bitwiseAnd(rowT,rowC) // return sum 

I don’t even need you to provide me with a data structure, all I need is a way to compare two matrices that give me the result of converging similar matrices as close as possible to each other.

EDIT: 12/19/12 1:52 PM My rows represent user attributes, and my columns represent page attributes. Each matrix is ​​a representation of those attributes that the user has in the presence of certain page attributes (for example, the user's age is less than 42, and they visited page 4).

+4
source share
3 answers

I noticed that the similarity operator that you have on your matrices defines a metric space . I.e:

  • D (M1, M2) = 0 if and only if M1 = M2
  • D (M1, M2)? 0 for any M1, M2.
  • D (M1, M2) = D (M2, M1) and
  • D (M1, M3)? D (M1, M2) + D (M2, M3) ( triangle inequality )

As a result, one way that you could apparently store all your data is a tree of metric spaces, a type of data structure for storing objects in metric space, which makes it easy to find all the "close" elements to some initial element.

Your data has the added benefit that it is a discrete metric space, which means that the distance function you provided always outputs an integral answer. That is, you will not have two matrices at a distance of 1.5 from each other, and you will not be able to get two matrices at a distance of & pi;

Therefore, you may want to save your matrices in a BK-tree . The BK tree is often used to store strings, but, as a rule, it can store elements in any discrete metric space. This allows you to search for nearest neighbors by individual matrices quite efficiently (usually without having to look at all the matrices in your collection), although admittedly it will not insert a doubly linked list through all of your elements.

Intuitively, the BK tree is structured as follows. Select the matrix of your choice as the "root node". Then compare each matrix in the collection with the root matrix and distribute it among the subtrees based on their distance from the root matrix. Then you recursively subdivide each of these subtrees equally. Due to triangle inequality, you can search for a BK tree for neighboring matrices using a simple recursive algorithm.

Hope this helps!

+4
source

I do not understand your similarity function. Should I compare strings with strings? In addition, in general, a higher bitwiseAnd value means a higher similarity, where, like you, there is a minus sign.

Location sensitivity is often used to solve problems like yours. EG, you could imagine that your matrices are black and white images and you want to quickly find similar images. Hash functions are designed so that similar images have similar hashes. This way, you create a hash table of your database, and then find neighboring elements in the hashed space for use as candidates, and then do a more expensive full affinity check with your candidates.

There are even more advanced methods called amplification, where you use several different LSHs and then require that some element is close in at least two of the LSHs to ensure a complete comparison. Chapter 3 Massive Arrays of Minerals gives a detailed account of your problem.

+1
source

You might want to take a look at the “Voronoi diagram” as a method of processing the closest neighboring situations with two or more dimensions.

Is your similarity measuring only scalar (= one-dimensional) distance? Always positive? Or would it be appropriate to use a two-dimensional vector distance?

Bitwise And not very useful for getting differences. Bitwise EXOR will make more sense. If all bits are of equal importance, you might want to count 1-bit in EXOR, which will be the Hamming distance between two unsigned integers.

Difference Counting Function for Boolean Matrices:

 int getSimilarity(matrix other) { int sum = 0; for(int col = 1; col < M; col++) { for (int row = 1; row < N; row++) { sum += (this[row, col] != other[row, col]) ? +1 : 0; } } return sum; } 

This distance function can be adjusted by multiplying the distances between rows / columns by weights.

0
source

All Articles