Dist () function in R: vector size constraint

I tried to draw a hierarchical clustering of some samples (40 of them) over some functions (genes), and I have a large table with 500 thousand rows and 41 columns (the first is the name), and when I tried

d<-dist(as.matrix(file),method="euclidean")

I got this error

Error: cannot allocate vector of size 1101.1 Gb

How can I get around this limitation? I searched for it and came across an ff package in R, but I don’t quite understand if this can solve my problem.

Thanks!

+4
source share
3 answers

In general, hierarchical clustering is not the best approach for working with very large data sets.

, , . , . ( , ) .

, :

data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))

:

 # Create transposed data matrix
 data.matrix.t <- t(as.matrix(data))

 # Create distance matrix
 dists <- dist(data.matrix.t)

 # Clustering
 hcl <- hclust(dists)

 # Plot
 plot(hcl)

, .

+4

R .

R, -, , O(n^2) . , ( ).

, , 1101.1 Gb - 1 . , RAM, , , .

ELKI , . ( , ), ( O(n log n), O(log n) ).

, , . , K-, , ( ) O(n^2).

, : , R - . , , .

+3

, ( 100 16 ).
.
, 2 2 . parallelDist parDist() . , RAM , , dist ( , ).
hclust(), fastcluster. fastcluster , , , hclust().
, , .

0
source

All Articles