Parallel construction of the distance matrix

I am working on hierarchical agglomeration clustering on large numbers of multidimensional vectors, and I noticed that the biggest bottleneck is building a distance matrix. The naive implementation for this task is as follows (here in Python):

''' v = an array (N,d), where rows are the observations and columns the dimensions''' def create_dist_matrix(v): N = v.shape[0] D = np.zeros((N,N)) for i in range(N): for j in range(i+1): D[i,j] = cosine(v[i,:],v[j,:]) # scipy.spatial.distance.cosine() return D 

I was wondering which one is best added to this parallelism routine. A simple way would be to split and assign a series of tasks to the outer loop, for example. if you have 10 processors, create 10 different tasks for different i ranges and then combine the results. However, this “horizontal” solution does not seem to be entirely correct. Are there other parallel algorithms (or existing libraries) for this task? Any help would be greatly appreciated.

+9
source share
5 answers

It looks like scikit-learn has a parallel version of pdist called pairwise_distances

 from sklearn.metrics.pairwise import pairwise_distances D = pairwise_distances(X = v, metric = 'cosine', n_jobs = -1) 

where n_jobs = -1 indicates that all processors will be used.

+11
source

I doubt that you will get it faster than pdist in the scipy module. This is probably why

Please note that you should avoid passing a reference to one of the distance functions defined in this library. For example:

  dm = pdist (X, sokalsneath)

will calculate paired distances between vectors in X using the Python sokalsneath function. This will cause sokalsneath called n choose 2 times, which is ineffective. Instead, the optimized version of C is more efficient, and we invoke it using the following syntax:

  dm = pdist (X, 'sokalsneath')
Thus, the Python function is not used if you use pdist(X, 'cosine') . When I run it, it seems to me that it uses only one core, so if you have many cores, you can get it faster. But keep in mind that to achieve this, your own implementation must be as fast as SciPy. It will not be trivial. You would rather be patient or go for another clustering method, for example. d. an algorithm that supports the spatial index.
+2
source

See @agartland answer - you can specify n_jobs in sklearn.metrics.pairwise.pairwise_distances or look for the sklearn.cluster clustering algorithm with the n_jobs parameter. E. g. sklearn.cluster.KMeans .

However, if you feel adventurous, you can implement your own calculations. For example, if you need a 1D distance matrix for scipy.cluster.hierarchy.linkage , you can use:

 #!/usr/bin/env python3 from multiprocessing import Pool import numpy as np from time import time as ts data = np.zeros((100,10)) # YOUR data: np.array[n_samples x m_features] n_processes = 4 # YOUR number of processors def metric(a, b): # YOUR dist function return np.sum(np.abs(ab)) n = data.shape[0] k_max = n * (n - 1) // 2 # maximum elements in 1D dist array k_step = n ** 2 // 500 # ~500 bulks dist = np.zeros(k_max) # resulting 1D dist array def proc(start): dist = [] k1 = start k2 = min(start + k_step, k_max) for k in range(k1, k2): # get (i, j) for 2D distance matrix knowing (k) for 1D distance matrix i = int(n - 2 - int(np.sqrt(-8 * k + 4 * n * (n - 1) - 7) / 2.0 - 0.5)) j = int(k + i + 1 - n * (n - 1) / 2 + (n - i) * ((n - i) - 1) / 2) # store distance a = data[i, :] b = data[j, :] d = metric(a, b) dist.append(d) return k1, k2, dist ts_start = ts() with Pool(n_processes) as pool: for k1, k2, res in pool.imap_unordered(proc, range(0, k_max, k_step)): dist[k1:k2] = res print("{:.0f} minutes, {:,}..{:,} out of {:,}".format( (ts() - ts_start)/60, k1, k2, k_max)) print("Elapsed %.0f minutes" % ((ts() - ts_start) / 60)) print("Saving...") np.savez("dist.npz", dist=dist) print("DONE") 

As you know, the scipy.cluster.hierarchy.linkage implementation scipy.cluster.hierarchy.linkage not parallel, and its complexity is at least O (N * N). I am not sure if scipy has a parallel implementation of this function.

+2
source

If you decide to organize a multiprocessor yourself, you can divide the number of calculations evenly between the CPUs to minimize calculations. Then the answer to this question on the uniform partition of the diagonal matrix may come in handy.

0
source

In addition to what @agartland suggested, I like to use pairwise_distances or pairwise_disances_chunked with numpy.triu_indices to get a condensed distance vector. This is the exact output provided by scipy.spatial.distance.pdist

It is important to note that k kwarg for triu_indices controls the offset for the diagonal. The default value k=0 will return the diagonal of zeros, as well as the value of the real distance and should be set to k=1 to avoid this.

For large datasets, I came across the question of where pairwise_distances raises a ValueError from struct.unpack when returning a value from a workflow. So my use of pairwise_distances_chunked below.

 gen = pairwise_distances_chunked(X, method='cosine', n_jobs=-1) Z = np.concatenate(list(gen), axis=0) Z_cond = Z[np.triu_indices(Z.shape[0], k=1) 

For me, this is much faster than using pdist and scales well in pdist on the number of available cores.

NB. I think it's also worth noting that in the past there was some confusion with the arguments for scipy.cluster.hierarchy.linkage that the documentation at some point indicated that users could pass a compressed or square vector / distance matrix ( link) () distance matrix error function as observation vectors # 2614 ). In fact, this is not so, and the values ​​transmitted to the connection must be either a vector of compressed distance or an array of raw observations mxn.

0
source

All Articles