Thanks to everyone for their answers, their answers were useful, but did not touch everything that I needed. Finally, I found an approach that took care of all my affairs. The approach is a modified version of what I suggested in the question.
Basic approach
Here I will refer to something called "node", this is an object of the class that will contain geo-information, such as the latitude of the object, longitude, and maybe also the measurement, and the full address.
If the address of the object is "101 C, Pearl Street, New York, USA", then this means that our data structure will contain at least four nodes for - "101 C", "Pearl Street", "New York" 'and' USA '. Each node will have a name and one part address . For “101 C,” the name will be “101 C,” and the address will be “Pearl Street, New York, USA.”
The main idea is to have a search tree for these nodes, where the node name will be used as the key for the search. We can get some matches, so later we need to rank the results on how well the node address matches with the requested one.
EARTH | --------------------------------------------- | | USA INDIA | | --------------------------- WEST BENGAL | | | NEW YORK CALIFORNIA KOLKATA | | | --------------- PEARL STREET BARA BAZAR | | | PEARL STREET TIME SQUARE 101 C | | 101 C 101 C
Suppose we have geographic data as described above. Thus, searching for “101 C, NEW YORK” will not only return the “101 C” nodes to “NEW YORK”, but also “INDIA”. This is due to the fact that it uses only name , i.e. '101 C'. Later we can evaluate the quality of the result by measuring the difference of the node address at the requested address. We do not use exact matching, as the user is allowed to provide an incomplete address, as in this case.
Classification Search Result
To evaluate the quality of the result, we can use the longest common subsequence . In this algon, cases of "omission" and "excess" are well taken into account.
Better if I let the code talk. The following is a Python implementation designed for this purpose.
def _lcs_diff_cent(s1, s2): """ Calculates Longest Common Subsequence Count Difference in percentage between two strings or lists. LCS reference: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem. Returns an integer from 0-100. 0 means that `s1` and `s2` have 0% difference, ie they are same. """ m = len(s1) n = len(s2) if s1 == s2: return 0 if m == 0:
Optimized approach
I followed the aforementioned (basic) approach, because it forced redundancy, and it could not reduce the use of the fact that if the user provided "USA" in his request, we do not need to search for nodes in "INDIA".
This optimized approach greatly addresses both of these issues. The solution is not to have one large search tree. We can break up the search space into the words "USA" and "India." Later we can further redo these search spaces on a state-by-principle basis. This is what I call cutting.
In the diagram below - SearchSlice is a “slice” and SearchPool is a search tree.
SearchSlice() | --------------------------------------------- | | SearchSlice(USA) SearchSlice(INDIA) | | --------------------------- SearchPool(WEST BENGAL) | | | SearchPool(NEW YORK) SearchPool(CALIFORNIA) |- KOLKATA | | |- BARA BAZAR, KOLKATA |- PEARL STREET |- PEARL STREET |- 101 C, BARA BAZAR, KOLKATA |- TIME SQUARE |- 101 C, PEARL STREET |- 101 C, TIME SQUARE
A few key points to note above ...
- Each slice represents only one level of depth. Well, that is not so obvious.
- The name of the sliced level does not appear in the address of its children. For example,
SearchSlice(USA) supports state slicing in 'USA'. So, the nodes in the "NEW YORK" section do not include the name "NEW YORK" or "USA" in their address . The same goes for other regions. The hierarchy relation implicitly determines the full address. - The '101 C'
address also includes its parent name , since they are not chopped.
Scaling options
If there is a bucket (pool), there is the possibility of implicit scaling. We (say) divide geodata for "USA" into two groups. Both can be in different systems. So, great if the pool "NEW YORk" is in system A, but the CALIFORNIA pool is in system B, since they do not share any data, except for the parents, of course.
Here is a warning. We must duplicate parents who will always be a slice. Since slices are designed for a limited number, therefore, the hierarchy will not be too deep, so it should not be too redundant to duplicate them.
Working code
Please refer to my GitHub for a working Python demo code .