Large Dataset Clustering

Question

Large Dataset Clustering

I am trying to group a large (Gigabyte) dataset. To group, you need the distance of each point to each point, so you will get a distance matrix N ^ 2, which in the case of my data set will have an exabyte order. Pdist in Matlab instantly explodes;)

Is there a way to copy subsets of big data first, and then maybe do some merging of such clusters?

I don’t know if this helps anyone, but the data is a binary string of fixed length, so I calculate their distances using the Hamming distance (Distance = string1 XOR string2).

+5

matlab cluster-analysis large-data

Marcin Mar 29 '11 at 11:44

source share

3 answers

denis · Answer 1 · 2011-04-08T14:32:19+0000

A simplified version of a good method from Tabei et al., Single or multiple sorting in all pair search similarities , say for pairs with Hammingdist 1:

sort all bit strings on the first 32 bits
look at the blocks of lines where the first 32 bits are the same; these blocks will be relatively small
pdist each of these blocks for Hammingdist (left 32) 0 + Hammingdist (rest) <= 1.

This skips the beat, for example. 32/128 nearby couples who have a Hammingdist (left 32) 1 + Hammingdist (rest) 0. If you really want this, repeat the above using "first 32" → "last 32".

. , , Hammingdist <= 2 4 32- ; , 2000 0200 0020 0002 1100 1010 1001 0110 0101 0011, 2 0, .

(Btw, sketchsort-0.0.7.tar - 99% src/boost/, build/,.svn/.)

Ashish Uthama · Answer 2 · 2011-03-29T17:56:21+0000

? , - ? , , .

. , N-1 N- , . , , .

Chris de Vries · Answer 3 · 2015-05-17T05:17:22+0000

EM- K- LMW-tree , . - 733 - 600 000 . EM-, .

, , , - . .

Large Dataset Clustering

More articles: