I tried to draw a hierarchical clustering of some samples (40 of them) over some functions (genes), and I have a large table with 500 thousand rows and 41 columns (the first is the name), and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around this limitation? I searched for it and came across an ff package in R, but I don’t quite understand if this can solve my problem.
Thanks!
In general, hierarchical clustering is not the best approach for working with very large data sets.
, , . , . ( , ) .
, :
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
:
# Create transposed data matrix data.matrix.t <- t(as.matrix(data)) # Create distance matrix dists <- dist(data.matrix.t) # Clustering hcl <- hclust(dists) # Plot plot(hcl)
, .
R .
R, -, , O(n^2) . , ( ).
O(n^2)
, , 1101.1 Gb - 1 . , RAM, , , .
1101.1 Gb
ELKI , . ( , ), ( O(n log n), O(log n) ).
O(n log n)
O(log n)
, , . , K-, , ( ) O(n^2).
, : , R - . , , .
, ( 100 16 ).., 2 2 . parallelDist parDist() . , RAM , , dist ( , ).hclust(), fastcluster. fastcluster , , , hclust()., , .