A good baseline, probably impractical in terms of its relatively high computational cost and, more importantly, its production of many false positives, would be a general string distance algorithm, for example
Depending on the required level of accuracy (which, BTW, should be indicated in terms of recall and accuracy , that is, in general to express, it is more important to skip the correlation than to misidentify), a homegrown process based on [some] of the following heuristics and ideas may do the trick :
- tokenize input i.e. see the input as an array of words, not a string
- tokenization should also contain line number information
- normalize input using a short dictionary of common permutations (for example, "dr" at the end of the line = "disk", "Jack" = "John", "Bill" = "William" ..., "W." at the beginning of the line "West " etc.
- Identify (a bit like tagging, like POS marking), the nature of some objects (for example, zip code and extended zip code, as well as city
- Define (search) some of these objects (for example, a relative short database table may include all cities / cities in the target area
- Identify (find) some domain-related entities (if all / many of the addresses deal with people who speak the profession of a lawyer, a search for the names of a law firm or federal buildings can help.
- As a rule, add more weight to the tokens that come from the last line of the address.
- Put more (or less) weight on tokens with a certain type of object (for example: "Disk", "Street", "Court" should be much less than the same signs that precede them.
- Consider a modified SOUNDEX algorithm to help with normalization.
Based on the foregoing, perform a rule-based appraiser . Roughly, the rules can be implemented as visitors to the tree / array structure, where the initial source analysis is first ( visitor design template ).
The advantage of a rule-based structure is that each heuristic is in its own function, and the rules can be prioritized, that is, place some rules at an early stage of the chain, which allows you to first interrupt the evaluation with some strong heuristics (for example, different City => Correlation = 0, confidence level = 95%, etc.).
An important consideration with the search for correlations is the need for a priori comparison of each individual element (here is the address) with any other element , so it takes as much as 1/2 n^2 comparisons at the product level. Because of this, it can be useful to store the reference elements in such a way that they are pre-processed (analyzed, normalized ...), and also possibly have a digest / sort key , which can be used as a [very rough] indicator of possible correlation ( for example, a key from a 5-digit ZIP code followed by the SOUNDEX value of the “primary” name).
mjv
source share