Summation of two distance matrices to obtain a third “common” distance matrix (environmental context)

Question

Summation of two distance matrices to obtain a third “common” distance matrix (environmental context)

I am an ecologist using mainly vegan R.

I have 2 matrices (number x x x) (see data below):

matrix 1 / nrow = 6replicates * 24sites, ncol = 15 species (fish) matrix 2 / nrow = 3replicates * 24sites, ncol = 15 species (fish)

In both matrices, the sites are the same. I want to get a general resemblance to my brother (taking into account both matrices) among pairs of sites. I see 2 options:

option 1, averaging over the replicas (at the scale of the plot) of fish and the abundance of macrovertebrates, cbind two average abundance matrices (nrow = 24sites, ncol = 15 + 10 average abundances) and the calculation of bray-curtis.

option 2 for each assembly, calculating the bray-curtis dissimilarity between pairs of nodes, calculating the distances between the centroids of sites. Then we summarize the two distance matrices.

If I do not understand, I did these 2 operations in the R codes below.

Please could you tell me if option 2 is correct and more suitable than option 1.

thank you in advance.

Pierre

here below R code examples

data generation

library(plyr);library(vegan) #assemblage 1: 15 fish species, 6 replicates per site a1.env=data.frame( Habitat=paste("H",gl(2,12*6),sep=""), Site=paste("S",gl(24,6),sep=""), Replicate=rep(paste("R",1:6,sep=""),24)) summary(a1.env) a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1)))) names(a1.bio)=paste("F",1:15,sep="") a1.bio[1:72,]=2*a1.bio[1:72,] #assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),] summary(a2.env) a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1)))) names(a2.bio)=paste("I",1:10,sep="") a2.bio[1:36,]=0.5*a2.bio[1:36,] #environmental data at the sit scale env=unique(a1.env[,c("Habitat","Site")]) env=env[order(env$Site),]

OPTION 1, averaging abundance and cbind

 a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean)) a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),] a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean)) a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),] bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)]) dist.mean=vegdist(sqrt(bio.mean),"bray")

OPTION 2, calculating for each assembly distance between centroids and summing the matrix of 2 distances

 a1.dist=vegdist(sqrt(a1.bio),"bray") a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids a1.dist.centroid=vegdist(a1.coord.centroid,"eucl") a2.dist=vegdist(sqrt(a2.bio),"bray") a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")

summing two distance matrices using a Gavin Simpson fuse ()

 dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))

summation of two Euclidean distance matrices (due to correction by Jari Oksanen)

 dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)

and "coord.centroid" below for further remote analysis (is this correct?)

 coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)

COMPARISON OF OPTION 1 AND 2

 pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray")) pco.centroid=cmdscale(dist.centroid) comparison=procrustes(pco.centroid,pco.mean) protest(pco.centroid,pco.mean)

+6

matrix r distance centroid vegan

Pierre Jan 24 '14 at 12:39

source share

3 answers

Gavin simpson · Answer 1 · 2014-01-25T15:32:18+0000

A simpler solution is simply to flexibly combine the two dissimilarity matrices by weighing each matrix. Weights need to be summed to 1. For two dissimilarity matrices, the inconsistency matrix is easy

 d.fused = (w * dx) + ((1 - w) * dy)

where w is a numerical scalar (length 1 vector). If you have no reason to weigh one of the sets of differences more than the other, just use w = 0.5 .

I have a function to do this for you in my analog package; fuse() . Example from ?fuse -

  train1 <- data.frame(matrix(abs(runif(100)), ncol = 10)) train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE), ncol = 10)) rownames(train1) <- rownames(train2) <- LETTERS[1:10] colnames(train1) <- colnames(train2) <- as.character(1:10) d1 <- vegdist(train1, method = "bray") d2 <- vegdist(train2, method = "jaccard") dd <- fuse(d1, d2, weights = c(0.6, 0.4)) dd str(dd)

This idea is used in Kohonen’s controlled networks (controlled SOMs) to bring several layers of data into one analysis.

the analogue works closely with the vegan , so there will be no problems associated with the two packages side by side.

Jari oksanen · Answer 2 · 2014-01-29T09:50:09+0000

The correctness of averaging distances depends on what you do with these distances. In some applications, you can expect that they really are distances. That is, they satisfy some metric properties and have a certain relation to the source data. Combined differences may not meet these requirements.

This problem is connected with the polemic of partial analysis of the Mantel type of differences compared with the analysis of rectangular data, which is really hot (and I mean red hot) in studies of beta varieties. We at the vegan provide tools for both, but I think that in most cases, analyzing rectangular data is more reliable and powerful. With rectangular data, I mean the usual sampling units multiplied by the view matrix. Preferred vegan distinguishability methods compare differences on a rectangular shape. These vegan methods include db-RDA ( capscale ), permutation MANOVA ( adonis ), and intra-group variance analysis ( betadisper ). Methods that work with irregularities per se include mantel , anosim , mrpp , meandis .

The average value of differences or distances usually does not have a clear correspondence to the original rectangular data. That is: the average value of the differences does not correspond to the average value of the data. I think in general it is better to average or process the data, and then get heterogeneities from the converted data.

If you want to combine the differences, the analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices at equal maxima. If you have differences in scale of 0..1, this is usually a minor issue unless one of the datasets is more uniform and has a lower maximum dissimilarity than the others. In fuse() , they are all aligned so that it is not just averaging, but averaging after aligning the range. In addition, you should remember that averaging differences usually destroys the geometry, and that matters if you use analysis methods for rectangular data ( adonis , betadisper , capscale in vegan).

Finally, on the geometry of combining differences. The differences on a scale of 0..1 are fractions of type A / B. Two fractions can be added (and then separated to get the average) directly only if the denominators are equal. If you ignore this and directly average the shares, the result will not be equal to the same fraction of the averaged data. This is what I mean with destructive geometry. Some open-scale indexes are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are the square roots of the squares of differences, and their squares are additive, but not directly distances.

I demonstrate these things by showing the effect of combining the two differences (and averaging would mean dividing the result by two or by suitable weights). I take data on the vegan of Barro Colorado Island and divide it into two subsets of slightly unequal sizes. Geometry that preserves the addition of distances of subsets of data will give the same result as the analysis of complete data:

 library(vegan) ## data and vegdist library(analogue) ## fuse data(BCI) dim(BCI) ## [1] 50 225 x1 <- BCI[, 1:100] x2 <- BCI[, 101:225] ## Bray-Curtis and fuse: not additive plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225))) ## summing distances is straigthforward (they are vectors), but preserving ## their attributes and keeping the dissimilarities needs fuse or some trick ## like below where we make dist structure dtmp to be replaced with the result dtmp <- dist(BCI) ## dist skeleton with attributes dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan") ## manhattans are additive and can be averaged plot(dist(BCI, "manhattan"), dtmp) ## Fuse rescales dissimilarities and they are no more additive dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225)) plot(dist(BCI, "manhattan"), dfuse) ## Euclidean distances are not additive dtmp[] <- dist(x1) + dist(x2) plot(dist(BCI), dtmp) ## ... but squared Euclidean distances are additive dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2) plot(dist(BCI), dtmp) ## dfuse would rescale squared Euclidean distances like Manhattan (not shown)

I just reviewed the add-on above, but if you cannot add, you cannot average. This is important if it is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I prefer the second group.

philshem · Answer 3 · 2017-06-06T10:04:27+0000

I like this simplicity of this answer , but it only applies to adding two distance matrices:

 d.fused = (w * dx) + ((1 - w) * dy)

so I wrote my own fragment to combine an array of several distance matrices (not only 2) and use standard R packages:

 # generate array of distance matrices x <- matrix(rnorm(100), nrow = 5) y <- matrix(rnorm(100), nrow = 5) z <- matrix(rnorm(100), nrow = 5) dst_array <- list(dist(x),dist(y),dist(z)) # create new distance matrix with first element of array dst <- dst_array[[1]] # loop over remaining array elements, add them to distance matrix for (jj in 2:length(dst_array)){ dst <- dst + dst_array[[jj]] }

You can also use a vector with a similar dst_array size to determine the scaling factors

 dst <- dst + my_scale[[jj]] * dst_array[[jj]]

Summation of two distance matrices to obtain a third “common” distance matrix (environmental context)

here below R code examples

data generation

OPTION 1, averaging abundance and cbind

OPTION 2, calculating for each assembly distance between centroids and summing the matrix of 2 distances

and "coord.centroid" below for further remote analysis (is this correct?)

COMPARISON OF OPTION 1 AND 2

More articles: