Clustering a sparse binary vector dataset

If I have a sparse data set where each data is described by a vector of 1000 elements, each element of this vector can be either 0 or 1 (many 0 and some 1), do you know any distance function that could help me group them? Is something like Euclidean distance convenient in this case? I would like to know if there is a simple convenient distance metric for such a situation in order to try my data.

thank

+5
source share
4 answers

Take a look at the distance functions used for sparse text vectors such as cosine distance and to compare sets such as Jaccard distance.

+3
source

There is no answer in your question. Depending on the domain, there are best practices.

Once you decide on a similarity metric, clustering is usually done by averaging or searching for medoids. See these binary data clustering docs for sample algorithms:

  • Carlos Ordonez. Clustering binary data streams using K-tools. Pdf
  • Tao Li. General binary data clustering model. Pdf

. - ". : -, , -, , , , -, -, , , , , 2, , -, -. :

  • , . .,
  • , ., - .
  • Toit, du S.H.C.; Steyn, A.G.W.; Stumpf, R.H.; ; 3, . 77, 1986; Springer-Verlag.

( . KL- .)

+10

If there are actually many 0 and several 1, you can try clustering for the first or last 1 - see http://aggregate.org/MAGIC/#Least Significant 1 bit

0
source

A distance / similarity function for binary vectors is proposed.

In a review of binary similarities and distance measurements - Choi, Cha, Tappert 2010 , the authors list 76 such functions.

0
source

All Articles