Large Dataset Clustering

I am trying to group a large (Gigabyte) dataset. To group, you need the distance of each point to each point, so you will get a distance matrix N ^ 2, which in the case of my data set will have an exabyte order. Pdist in Matlab instantly explodes;)

Is there a way to copy subsets of big data first, and then maybe do some merging of such clusters?

I don’t know if this helps anyone, but the data is a binary string of fixed length, so I calculate their distances using the Hamming distance (Distance = string1 XOR string2).

+5
source share
3 answers

A simplified version of a good method from Tabei et al., Single or multiple sorting in all pair search similarities , say for pairs with Hammingdist 1:

  • sort all bit strings on the first 32 bits
  • look at the blocks of lines where the first 32 bits are the same; these blocks will be relatively small
  • pdist each of these blocks for Hammingdist (left 32) 0 + Hammingdist (rest) <= 1.

This skips the beat, for example. 32/128 nearby couples who have a Hammingdist (left 32) 1 + Hammingdist (rest) 0. If you really want this, repeat the above using "first 32" → "last 32".

. , , Hammingdist <= 2 4 32- ; , 2000 0200 0020 0002 1100 1010 1001 0110 0101 0011, 2 0, .

(Btw, sketchsort-0.0.7.tar - 99% src/boost/, build/,.svn/.)

+1

? , - ? , , .

. , N-1 N- , . , , .

0

EM- K- LMW-tree , . - 733 - 600 000 . EM-, .

, , , - . .

0

All Articles