Python quickly calculates many distances

Question

Python quickly calculates many distances

I have an input of 36.742 points, which means that if I wanted to calculate the lower triangle of the distance matrix (using the vincenty approximation), I would need to generate 36.742 * 36.741 * 0.5 = 1.349.974.563 of the distance.

I want to keep pairs of combinations that are within 50 km of each other. My current setup is as follows

shops= [[id,lat,lon]...] def lower_triangle_mat(points): for i in range(len(shops)-1): for j in range(i+1,len(shops)): yield [shops[i],shops[j]] def return_stores_cutoff(points,cutoff_km=0): below_cut = [] counter = 0 for x in lower_triangle_mat(points): dist_km = vincenty(x[0][1:3],x[1][1:3]).km counter += 1 if counter % 1000000 == 0: print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5))) if dist_km <= cutoff_km: below_cut.append([x[0][0],x[1][0],dist_km]) return below_cut start = time.clock() stores = return_stores_cutoff(points=shops,cutoff_km=50) print(time.clock() - start)

It will obviously take hours and hours. Some features I was thinking about:

Use numpy for vectorize for these calculations, not for scrolling
Use hashing to crop quickly (all stores within 100 km) and then only calculate the exact distances between these stores.
Instead of storing the points in a list, use something like a quad-tree, but I think it only helps with ranking the nearby points, not the actual distance -> so I assume some kind of geodatabase
I obviously can try a haversine or project and use the Euclidean distances, however I am interested in using the most accurate measure possible.
Use parallel processing (however, I had difficulties with how to cut the list to get all the matching pairs).

Change I think geohish is needed here - an example from :

 from geoindex import GeoGridIndex, GeoPoint geo_index = GeoGridIndex() for _ in range(10000): lat = random.random()*180 - 90 lng = random.random()*360 - 180 index.add_point(GeoPoint(lat, lng)) center_point = GeoPoint(37.7772448, -122.3955118) for distance, point in index.get_nearest_points(center_point, 10, 'km'): print("We found {0} in {1} km".format(point, distance))

However, I would also like to vectorize (instead of a loop) distance calculations for stores returned by geo-hashes.

Edit2: Pouria Hadjibagheri . I tried using lambda and map:

 # [B]: Mapping approach lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops))) func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) # Trying to see if conditional statements slow this down func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None start = time.clock() out_dist = list(map(func,lwr_tr_mat)) print(time.clock() - start) start = time.clock() out_dist = list(map(func_cond,lwr_tr_mat)) print(time.clock() - start)

And they were around 61 seconds (I limited the number of stores to 2,000 from 32,000). Perhaps I used the card incorrectly?

+7

python numpy haversine distance geohashing

mptevsion Feb 09 '16 at 16:19

source share

4 answers

Have you tried matching whole arrays and functions instead of iterating through them? An example would be the following:

 from numpy.random import rand my_array = rand(int(5e7), 1) # An array of 50,000,000 random numbers in double.

Now, what is usually done is:

 squared_list_iter = [value**2 for value in my_array]

Which, of course, works, but is optimally invalid.

An alternative would be to map the array to a function. This is done as follows:

 func = lambda x: x**2 # Here is what I want to do on my array. squared_list_map = map(func, test) # Here I am doing it!

Now you can ask: how is it otherwise, or even better? Since then, we have also added a function call! Here is your answer:

For the first solution (via iteration):

 1 loop: 1.11 minutes.

Compared to the last solution (mapping):

 500 loop, on average 560 ns.

Converting map() to list(map(my_list)) at the same time will increase the time by 10 times to about 500 ms .

You choose!

0

Pouria Feb 09 '16 at 16:57

source share

“Use some kind of hashing to crop quickly (all stores within 100 km) and then only calculate the exact distances between these stores.” I think it's better to call it a grid. Therefore, first dictate with a set of cords as a key and put each store in a 50-kilometer bucket near this point. then, when you calculate the distances, you look only at the neighboring buckets, and do not sort out each store in the whole universe.

0

bmbigbang Feb 09 '16 at 16:59

source share

Thank you all for your help. I think I solved this by including all the suggestions.

I use numpy to import geographic coordinates and then design them using "France Lambert - 93". This allows me to fill in scipy.spatial.cKDTree with points, and then calculate sparse_distance_matrix, indicating a cutoff of 50 km (my projected points are indicated in meters). Then I retrieve the extraction of the lower triangle in the CSV.

 import numpy as np import csv import time from pyproj import Proj, transform #http://epsg.io/2154 (accuracy: 1.0m) fr = '+proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 \ +x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 \ +units=m +no_defs' #http://epsg.io/27700-5339 (accuracy: 1.0m) uk = '+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 \ +x_0=400000 +y_0=-100000 +ellps=airy \ +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs' path_to_csv = '.../raw_in.csv' out_csv = '.../out.csv' def proj_arr(points): inproj = Proj(init='epsg:4326') outproj = Proj(uk) # origin|destination|lon|lat func = lambda x: transform(inproj,outproj,x[2],x[1]) return np.array(list(map(func, points))) tstart = time.time() # Import points as geographic coordinates # ID|lat|lon #Sample to try and replicate #points = np.array([ # [39007,46.585012,5.5857829], # [88086,48.192370,6.7296289], # [62627,50.309155,3.0218611], # [14020,49.133972,-0.15851507], # [1091, 42.981765,2.0104902]]) # points = np.genfromtxt(path_to_csv, delimiter=',', skip_header=1) print("Total points: %d" % len(points)) print("Triangular matrix contains: %d" % (len(points)*((len(points))-1)*0.5)) # Get projected co-ordinates proj_pnts = proj_arr(points) # Fill quad-tree from scipy.spatial import cKDTree tree = cKDTree(proj_pnts) cut_off_metres = 1600 tree_dist = tree.sparse_distance_matrix(tree, max_distance=cut_off_metres, p=2) # Extract triangle from scipy import sparse udist = sparse.tril(tree_dist, k=-1) # zero the main diagonal print("Distances after quad-tree cut-off: %d " % len(udist.data)) # Export CSV import csv f = open(out_csv, 'w', newline='') w = csv.writer(f, delimiter=",", ) w.writerow(['id_a','lat_a','lon_a','id_b','lat_b','lon_b','metres']) w.writerows(np.column_stack((points[udist.row ], points[udist.col], udist.data))) f.close() """ Get ID labels """ id_to_csv = '...id.csv' id_labels = np.genfromtxt(id_to_csv, delimiter=',', skip_header=1, dtype='U') """ Try vincenty on the un-projected co-ordinates """ from geopy.distance import vincenty vout_csv = '.../out_vin.csv' test_vin = np.column_stack((points[udist.row].T[1:3].T, points[udist.col].T[1:3].T)) func = lambda x: vincenty(x[0:2],x[2:4]).m output = list(map(func,test_vin)) # Export CSV f = open(vout_csv, 'w', newline='') w = csv.writer(f, delimiter=",", ) w.writerow(['id_a','id_a2', 'lat_a','lon_a', 'id_b','id_b2', 'lat_b','lon_b', 'proj_metres','vincenty_metres']) w.writerows(np.column_stack((list(id_labels[udist.row]), points[udist.row ], list(id_labels[udist.col]), points[udist.col], udist.data, output, ))) f.close() print("Finished in %.0f seconds" % (time.time()-tstart)

This approach took 164 seconds to generate (for 5,306,434 distances) - compared to 9 - and also about 90 seconds to save to disk.

Then I compared the difference in vincenty distance and hypotenuse distance (from projected coordinates).

The average difference in meters was 2.7, and the average difference / meters was 0.0073% - which looks great.

0

mptevsion Feb 10 '16 at 18:47

source share

ali_m · Accepted Answer · 2016-02-09T18:22:06+0000

This sounds like a classic use case for kD trees .

If you first convert your points to Euclidean space, you can use the query_pairs scipy.spatial.cKDTree method:

 from scipy.spatial import cKDTree tree = cKDTree(data) # where data is (nshops, ndim) containing the Euclidean coordinates of each shop # in units of km pairs = tree.query_pairs(50, p=2) # 50km radius, L2 (Euclidean) norm

pairs will be a set of (i, j) tuples corresponding to the row indices of store pairs that are ≤50 km apart.

The output of tree.sparse_distance_matrix is scipy.sparse.dok_matrix . Since the matrix will be symmetrical, and you are only interested in unique pairs of rows / columns, you can use scipy.sparse.tril to nullify the upper triangle, giving you scipy.sparse.coo_matrix . From there, you can access non-zero row and column indices and their corresponding distance values using the .row , .col and .data attributes:

 from scipy import sparse tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2) udist = sparse.tril(tree_dist, k=-1) # zero the main diagonal ridx = udist.row # row indices cidx = udist.col # column indices dist = udist.data # distance values

Python quickly calculates many distances

More articles: