How do I find the closest document using the Google App Engine Search API?

I have approximately 400,000 documents in the GAE Search index . All documents have the location GeoPoint property and are distributed worldwide. Some documents may be over 4,000 km from any other document, others may be grouped in meters from each other.

I would like to find the closest document to a specific set of coordinates, but finding the following code gives incorrect results:

 from google.appengine.api import search # coords are in the form of a tuple eg (50.123, 1.123) search.Document( doc_id='meaningful-unique-id', fields=[search.GeoField(name='location' value=search.GeoPoint(coords[0], coords[1]))]) # find document function radius is in metres def find_document(coords, radius=1000000): sort_expr = search.SortExpression( expression='distance(location, geopoint(%.3f, %.3f))' % coords, direction=search.SortExpression.ASCENDING, default_value=0) search_query = search.Query( query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \ % (coords[0], coords[1], radius), options=search.QueryOptions( limit=1, ids_only=True, sort_options=search.SortOptions(expressions=[sort_expr]))) index = search.Index(name='document-index') return index.search(search_query) 

With this code, I get results that are consistent, but incorrect. For example, a search for the nearest document in London showed that the nearest of them was in Scotland. I checked that there are thousands of closer documents.

I narrowed the problem down to a too large radius parameter. I get the correct results if the radius is about 12 km ( radius=12000 ). As a rule, no more than 1000 documents within a radius of 12 km. (Probably related to search.SortOptions(limit=1000) .)

The problem is that if I am in a rare area of ​​the globe where there are no documents for thousands of miles, my search function will not return anything using radius=12000 (12 km). I want him to return the closest document to me, wherever I am. How can I execute this sequentially with one call to the search API?

+6
source share
3 answers

I believe the problem is as follows. In your request, up to 10K documents will be selected, then sorted according to your expression, sorting distances and returned. (That is, in fact, this is not all 400k documents.) Therefore, I suspect that some of the geographically close points are not included in this 10k selection. That's why everything works better when you shorten the search radius, since you have fewer common points in this radius.

Essentially, you want your request to hit up to 10,000 in a way that makes sense for what you request. You can solve this in at least several ways that you can combine:

  • Add a rating so that the most “important” documents (according to some criteria that make sense in your domain) are returned in ranking order, then they will be sorted by distance.
  • Filter one or more fields (s) of documents (for example, “business category” if your documents contain information about enterprises) to reduce the number of candidate documents.

(I don't think this 10k threshold is currently in the search API documentation, I applied to add it).

+5
source

I have the same problem and I don’t think it is possible. The problem arises, as you yourself found out, when there are more possible results than the results. The Google algorithm simply exits when it loads constraints and then sorts the results.

I saw the same clusters as you and its part of the search API.

One hack will be to subdivide your search, make several simultaneous calls, and then combine and organize the results.

+1
source

A wild idea, why not save / record the distance from 3 points, and then calculate from this.

0
source

All Articles