Dist () function in R: vector size constraint

Question

Dist () function in R: vector size constraint

I tried to draw a hierarchical clustering of some samples (40 of them) over some functions (genes), and I have a large table with 500 thousand rows and 41 columns (the first is the name), and when I tried

d<-dist(as.matrix(file),method="euclidean")

I got this error

Error: cannot allocate vector of size 1101.1 Gb

How can I get around this limitation? I searched for it and came across an ff package in R, but I don’t quite understand if this can solve my problem.

Thanks!

+4

r cluster-analysis

olala Oct 17 '13 at 20:06

source share

3 answers

R .

R, -, , O(n^2) . , ( ).

, , 1101.1 Gb - 1 . , RAM, , , .

ELKI , . ( , ), ( O(n log n), O(log n) ).

, , . , K-, , ( ) O(n^2).

, : , R - . , , .

+3

Anony-Mousse 18 . '13 7:44

, ( 100 16 ).
.
, 2 2 . parallelDist parDist() . , RAM , , dist ( , ).
hclust(), fastcluster. fastcluster , , , hclust().
, , .

0

Yoann pageaud Jan 15 '19 at 9:50

source share

zero323 · Accepted Answer · 2013-10-17T20:28:10+0000

In general, hierarchical clustering is not the best approach for working with very large data sets.

, , . , . ( , ) .

, :

data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))

:

 # Create transposed data matrix
 data.matrix.t <- t(as.matrix(data))

 # Create distance matrix
 dists <- dist(data.matrix.t)

 # Clustering
 hcl <- hclust(dists)

 # Plot
 plot(hcl)

, .

Dist () function in R: vector size constraint

More articles: