Effective Nearest Neighbor Scala Search

Let this class of coordinates with Euclidean distance,

case class coord(x: Double, y: Double) { def dist(c: coord) = Math.sqrt( Math.pow(xc.x, 2) + Math.pow(yc.y, 2) ) } 

and let the coordinate grid for example

 val grid = (1 to 25).map {_ => coord(Math.random*5, Math.random*5) } 

Then for any given coordinate

 val x = coord(Math.random*5, Math.random*5) 

closest points to x are

 val nearest = grid.sortWith( (p,q) => p.dist(x) < q.dist(x) ) 

so the first three closest are nearest.take(3) .

Is there a way to make these calculations more time-efficient, especially for a million-dot grid?

+7
algorithm scala nearest-neighbor kdtree r-tree
source share
5 answers

I'm not sure if this is useful (or even stupid), but I thought about it:

You use sorting to sort ALL elements in the grid and then select the first k elements. If you consider the sorting algorithm as a recursive merge-sorting, you have something like this:

  • Half-split split
  • Count on both halves
  • Combine both sorted halves

Perhaps you could optimize such a function for your needs. The unifying part usually combines all the elements from both halves, but you are only interested in the first k , which is the result of the merger. Thus, you can merge only until you have the elements k and ignore the rest.

So, in the worst case, where k >= n ( n is the grid size), you will still only have the difficulty of sorting the merge. O(n log n) Honestly, I cannot determine the complexity of this solution with respect to k . (too tired for this at the moment)

Here is an example implementation of this solution (it is definitely not optimal and not generalized):

 def minK(seq: IndexedSeq[coord], x: coord, k: Int) = { val dist = (c: coord) => c.dist(x) def sort(seq: IndexedSeq[coord]): IndexedSeq[coord] = seq.size match { case 0 | 1 => seq case size => { val (left, right) = seq.splitAt(size / 2) merge(sort(left), sort(right)) } } def merge(left: IndexedSeq[coord], right: IndexedSeq[coord]) = { val leftF = left.lift val rightF = right.lift val builder = IndexedSeq.newBuilder[coord] @tailrec def loop(leftIndex: Int = 0, rightIndex: Int = 0): Unit = { if (leftIndex + rightIndex < k) { (leftF(leftIndex), rightF(rightIndex)) match { case (Some(leftCoord), Some(rightCoord)) => { if (dist(leftCoord) < dist(rightCoord)) { builder += leftCoord loop(leftIndex + 1, rightIndex) } else { builder += rightCoord loop(leftIndex, rightIndex + 1) } } case (Some(leftCoord), None) => { builder += leftCoord loop(leftIndex + 1, rightIndex) } case (None, Some(rightCoord)) => { builder += rightCoord loop(leftIndex, rightIndex + 1) } case _ => } } } loop() builder.result } sort(seq) } 
+3
source share

Profile to find out what is expensive.

Your sorting method is already very inefficient.

Do not reconfigure distances all the time. It's not free - most likely your program spends 99% of its time on computational distances (use the profiler to find out!)

Finally, you can use index structures . For the Euclidean distance, you are probably the largest selection of indices to speed up the search for nearest neighbors. There is a kd tree, but I found that the R tree is often faster. If you want to play with them, I recommend ELKI . It is a Java data mining library (so it should also be easy to use from Scala), and it has a huge selection of index structures.

+3
source share

It was very funny.

 case class Coord(x: Double, y: Double) { def dist(c: Coord) = Math.sqrt(Math.pow(x - cx, 2) + Math.pow(y - cy, 2)) } class CoordOrdering(x: Coord) extends Ordering[Coord] { def compare(a: Coord, b: Coord) = a.dist(x) compare b.dist(x) } def top[T](xs: Seq[T], n: Int)(implicit ord: Ordering[T]): Seq[T] = { // xs is an ordered sequence of n elements. insert returns xs with e inserted // if it is less than anything currently in the sequence (and in that case, // the last element is dropped) otherwise returns an unmodifed sequence def insert[T](xs: Seq[T], e: T)(implicit ord: Ordering[T]): Seq[T] = { val (l, r) = xs.span(x => ord.lt(x, e)) (l ++ (e +: r)).take(n) } xs.drop(n).foldLeft(xs.take(n).sorted)(insert) } 

Minimally verified. Call it that:

 val grid = (1 to 250000).map { _ => Coord(Math.random * 5, Math.random * 5) } val x = Coord(Math.random * 5, Math.random * 5) top(grid, 3)(new CoordOrdering(x)) 

EDIT: It is quite simple to extend this to (pre) calculate distances only once

 val zippedGrid = grid map {_.dist(x)} zip grid object ZippedCoordOrdering extends Ordering[(Double, Coord)] { def compare(a:(Double, Coord), b:(Double, Coord)) = a._1 compare b._1 } top(zippedGrid,3)(ZippedCoordOrdering).unzip._2 
+2
source share

Here is an algorithm that uses the R-tree data structure. Not useful for the described small data set, but it scales well for a large number of objects.

Use an ordered list whose nodes represent either objects or the bounding rectangles of the R-tree. Ordering closest using any distance function you want. Enter the order on the insert.

Initialize the list by inserting the bounding fields in the root of the R-tree node.

To get the next closest object:

(1) Remove the first item from the list.

(2) If it is an object, it is the closest.

(3) If it is a non-leaf bounding box of an R-tree node, insert all the bounding boxes representing the children of this node into the list in their respective places according to their distance.

(4) If this is the bounding box of a leaf of a node tree, insert objects that are children of this node (objects, not their bounding fields) according to their distance.

(5) Return to step (1).

The list will remain fairly short. At the front there will be close objects that interest us, and later the nodes in the list will be boxes representing collections of objects that are further away.

+1
source share

It depends on exact or approximation .

Since several tests, such as http://www.slideshare.net/erikbern/approximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup , show that approximation is a good solution in terms of efficient .

I wrote ann4s , which is an implementation of Scala Annoy

Annoy (Oh Yeah Nearest Neighbor) is a C ++ library with Python bindings for finding points in space close to a given query point. It also creates large read-only file data structures that fit in memory so that many processes can share the same data.

Check out this repo .

0
source share

All Articles