In addition to what @agartland suggested, I like to use pairwise_distances or pairwise_disances_chunked with numpy.triu_indices to get a condensed distance vector. This is the exact output provided by scipy.spatial.distance.pdist
It is important to note that k kwarg for triu_indices controls the offset for the diagonal. The default value k=0 will return the diagonal of zeros, as well as the value of the real distance and should be set to k=1 to avoid this.
For large datasets, I came across the question of where pairwise_distances raises a ValueError from struct.unpack when returning a value from a workflow. So my use of pairwise_distances_chunked below.
gen = pairwise_distances_chunked(X, method='cosine', n_jobs=-1) Z = np.concatenate(list(gen), axis=0) Z_cond = Z[np.triu_indices(Z.shape[0], k=1)
For me, this is much faster than using pdist and scales well in pdist on the number of available cores.
NB. I think it's also worth noting that in the past there was some confusion with the arguments for scipy.cluster.hierarchy.linkage that the documentation at some point indicated that users could pass a compressed or square vector / distance matrix ( link) () distance matrix error function as observation vectors # 2614 ). In fact, this is not so, and the values transmitted to the connection must be either a vector of compressed distance or an array of raw observations mxn.
Grr
source share