What data structure should I use for geocoding?

Question

What data structure should I use for geocoding?

I am trying to create a Python script that will take an address as input and spit out its latitude and longitude, or latitude and longitude in case of multiple matches, like Nominatim .

Thus, the possible input and output data may be: -

To: New York, USA => Exit: New York (lat: x1 lon: y1)
In: New York => Exit: New York (lat: x1 lon: y1)
To: Pearl Street, New York, USA => Exit: Pearl Street (lat: x2 lon: y2)
In: Pearl Street, USA => Out: Pearl Street (lat: x2 lon: y2), Pearl Street (lat: x3 lon: y3)
B: Pearl Street => Out: Pearl Street (lat: x2 lon: y2), Pearl Street (lat: x3 lon: y3)
B: 103 Alkazam, New York, USA => Exit: New York (lat: x1 lon: y1)

In 6 above, New York was returned because no location was found at 103 Alkazam, New York, USA , but he could have found New York, USA .

I initially thought about creating a tree representing a hierarchy relationship where siblings are sorted alphabetically. It could be like this: -

  GLOBAL | --------------------------------------------- | | ... USA --------------- | | ... CALIFORNIA NEW YORK | | ----------- ------------- | |.. | |.... PEARL STREET PEARL STREET

But the problem was that the user could provide an incomplete address, as in 2, 4 and 5.

So, I thought about how to use the search tree and save the full address in each node. But this is also bad, because: -

This will store data with a high degree of redundancy in each node. Since it will be really big data, therefore questions of conservation of space.
He will not be able to use the fact that the user has narrowed the search space.

I have one additional requirement. I need to detect spelling errors. I assume that this should be considered as a separate issue and can treat each node as common strings.

Update 1

A little development. The input will be a list in which the subscript element is the parent of the element in the higher index; and they, of course, may or may not be immediate parents or children. Therefore, for request 1, the entry will be ["USA", "NEW YORK"] . So, great that USA, New York does not return a result.

The user should be able to find the building if he has an address, and our data is so detailed.

Update 2 (skip case)

If a user requests Pearl Street, USA , then our algo should be able to find the address, since he knows Pearl Street has New York as a parent, and USA is his parent.

Update 3 (redundant case)

Suppose a user requests 101 C, Alley A, Pearl Street, New York . Also suppose our data knows about 101 C , but not about Alley A According to him, 101 C is a direct child of Pearl Street . Even so, he should be able to find the address.

+8

python geocoding large-data openstreetmap

Applegrew Apr 12 '12 at 12:56

source share

3 answers

how to use a key value storage map and full-text search.

for location bar
for location_level and lat & lon values.
search by:
- Split custom input string into separate location words (not just a comma)
- search every word on the map
- return lat & lon least level location

python.dict, memcached, mongodb .... will meet ur needs.

If you have too many location words, separate location_level as a new map, the two searches will speed up.
forget location levels, trace as full-text search
huge data? hash key for short string or numbers

Some questions:

how to store data in a database
how to start a search tree from data, if any
how to extend / change the search tree at runtime
fault tolerance for input / storage
storage space> speed? or speed> storage?

therefore a more usable test case for user input

 101 C, Time Square, New York, US 101 C, Pearl street, New York, US 101 C, Time Square, SomeCity, Mars 101 C 101 C, US 101 C, New York, US 101 C, New York, Time Square, US North Door, 101 C, Time Square, New York, US South Door, 101 C, Time Square, New York, US

for the situation :

fast speed with huge data;
full fault tolerant;
easy to configure with storage and runtime

better solution :( also complex-est)

flat key card vault
full text search
- or hash key with B-tree search

Your program / website may possibly run as fast as google.

+1

fanlix Apr 12 '12 at 13:15

source share

If you try to create a data structure for this problem, I think you will have data redundancy. Rather, you can use tree / graph and try to implement a search algorithm that searches for words from user input with node values. Fuzzy matching can help you generate the most likely results, and you can suggest / show users several of them depending on the level of reliability of their similarities.

It can also take care of errors, etc.

0

0xc0de Apr 12 '12 at 14:18

source share

Applegrew · Accepted Answer · 2012-04-15T15:19:33+0000

Thanks to everyone for their answers, their answers were useful, but did not touch everything that I needed. Finally, I found an approach that took care of all my affairs. The approach is a modified version of what I suggested in the question.

Basic approach

Here I will refer to something called "node", this is an object of the class that will contain geo-information, such as the latitude of the object, longitude, and maybe also the measurement, and the full address.

If the address of the object is "101 C, Pearl Street, New York, USA", then this means that our data structure will contain at least four nodes for - "101 C", "Pearl Street", "New York" 'and' USA '. Each node will have a name and one part address . For “101 C,” the name will be “101 C,” and the address will be “Pearl Street, New York, USA.”

The main idea is to have a search tree for these nodes, where the node name will be used as the key for the search. We can get some matches, so later we need to rank the results on how well the node address matches with the requested one.

  EARTH | --------------------------------------------- | | USA INDIA | | --------------------------- WEST BENGAL | | | NEW YORK CALIFORNIA KOLKATA | | | --------------- PEARL STREET BARA BAZAR | | | PEARL STREET TIME SQUARE 101 C | | 101 C 101 C

Suppose we have geographic data as described above. Thus, searching for “101 C, NEW YORK” will not only return the “101 C” nodes to “NEW YORK”, but also “INDIA”. This is due to the fact that it uses only name , i.e. '101 C'. Later we can evaluate the quality of the result by measuring the difference of the node address at the requested address. We do not use exact matching, as the user is allowed to provide an incomplete address, as in this case.

Classification Search Result

To evaluate the quality of the result, we can use the longest common subsequence . In this algon, cases of "omission" and "excess" are well taken into account.

Better if I let the code talk. The following is a Python implementation designed for this purpose.

 def _lcs_diff_cent(s1, s2): """ Calculates Longest Common Subsequence Count Difference in percentage between two strings or lists. LCS reference: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem. Returns an integer from 0-100. 0 means that `s1` and `s2` have 0% difference, ie they are same. """ m = len(s1) n = len(s2) if s1 == s2: return 0 if m == 0: # When user given query is empty then that is like '*'' (match all) return 0 if n == 0: return 100 matrix = [[0] * (n + 1)] * (m + 1) for i in range(1, m+1): for j in range(1, n+1): if s1[i-1] == s2[j-1]: matrix[i][j] = matrix[i-1][j-1] + 1 else: matrix[i][j] = max(matrix[i][j-1], matrix[i-1][j]) return int( ( 1 - float(matrix[m][n]) / m ) * 100 )

Optimized approach

I followed the aforementioned (basic) approach, because it forced redundancy, and it could not reduce the use of the fact that if the user provided "USA" in his request, we do not need to search for nodes in "INDIA".

This optimized approach greatly addresses both of these issues. The solution is not to have one large search tree. We can break up the search space into the words "USA" and "India." Later we can further redo these search spaces on a state-by-principle basis. This is what I call cutting.

In the diagram below - SearchSlice is a “slice” and SearchPool is a search tree.

  SearchSlice() | --------------------------------------------- | | SearchSlice(USA) SearchSlice(INDIA) | | --------------------------- SearchPool(WEST BENGAL) | | | SearchPool(NEW YORK) SearchPool(CALIFORNIA) |- KOLKATA | | |- BARA BAZAR, KOLKATA |- PEARL STREET |- PEARL STREET |- 101 C, BARA BAZAR, KOLKATA |- TIME SQUARE |- 101 C, PEARL STREET |- 101 C, TIME SQUARE

A few key points to note above ...

Each slice represents only one level of depth. Well, that is not so obvious.
The name of the sliced level does not appear in the address of its children. For example, SearchSlice(USA) supports state slicing in 'USA'. So, the nodes in the "NEW YORK" section do not include the name "NEW YORK" or "USA" in their address . The same goes for other regions. The hierarchy relation implicitly determines the full address.
The '101 C' address also includes its parent name , since they are not chopped.

Scaling options

If there is a bucket (pool), there is the possibility of implicit scaling. We (say) divide geodata for "USA" into two groups. Both can be in different systems. So, great if the pool "NEW YORk" is in system A, but the CALIFORNIA pool is in system B, since they do not share any data, except for the parents, of course.

Here is a warning. We must duplicate parents who will always be a slice. Since slices are designed for a limited number, therefore, the hierarchy will not be too deep, so it should not be too redundant to duplicate them.

Working code

Please refer to my GitHub for a working Python demo code .

What data structure should I use for geocoding?

Basic approach

Classification Search Result

Optimized approach

Scaling options

Working code

More articles: