Calculation of context-sensitive text correlation

Question

Calculation of context-sensitive text correlation

Suppose I want to map address entries (or names of people or something else) to each other to combine entries that most likely refer to the same address. Basically, I would like to calculate some correlation between text values and combine records if this value exceeds a certain threshold.

Example: "West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A", but differs from "East Lawnmower Drive 54 A".

How do you approach this problem? It would be necessary to have some kind of context-based dictionary that knows, in the case of the address, that "W", "W". and "West" are the same? What about spelling mistakes ("mover" instead of "mower", etc.)?

I think this is complicated: are there perhaps some well-known algorithms?

+6

string algorithm text nlp

Anders fjeldstad Dec 03 '09 at 14:53

source share

5 answers

I would look at getting a similarity metric, which, given two objects (possibly strings), returns the "distance" between them.

If you fulfill the following criteria, this helps:

the distance between the object and itself is zero. (Reflective)
the distance from a to b is the same in both directions (transitional)
the distance from a to c is not more than the distance from a to b plus the distance from a to c. (triangle rule)

If your metric obeys this, you can arrange your objects in metric space, which means that you can run queries such as:

Which other object is most similar to this
Give me 5 objects most like this one.

There's a good book about it here . After you have configured the infrastructure for placing objects and running queries, you can simply connect various comparison algorithms, compare their performance and then configure them.

I did this for geographic data at the university, and it was pretty fun trying to set up comparison algorithms.

I'm sure you could come up with something more advanced, but you could start with something simple, like reducing the address bar to numbers and the first letter of each word, and then compare the result of this using the longest general algorithm subsequences.

Hope this helps in some way.

+1

Tom ducking Dec 03 '09 at 15:19

source share

You can use Levenshtein's editing distance to find lines that differ only in a few characters. BK Trees can speed up the matching process.

+1

Ken bloom Dec 03 '09 at 17:57

source share

Disclaimer: I do not know a single algorithm that would do this, but would really be interested in knowing it if it exists. This answer is a naive attempt to try to solve a problem without any prior knowledge. Comments are welcome, please do not laugh too much.

If you try to do this manually, I would suggest applying some kind of “normalization” to your lines: describe them, remove punctuation marks, replace common abbreviations with full words (Dr. => drive, St => street, etc. )

Then you can try different alignments between the two lines you are comparing and calculate the correlation by averaging the absolute differences between the corresponding letters (for example, a = 1, b = 2, etc. and corr(a, b) = |a - b| = 1 ):

 west lawnmover drive w lawnmower street

Thus, even if some letters are different, the correlation will be high. Then simply save the found maximum correlation and decide that they are the same if the correlation is above a given threshold.

0

Wookai Dec 03 '09 at 15:18

source share

When I had to modify a proprietary program by doing this, back in the early 90s it took many thousands of lines of code in several modules accumulated over many years of experience. Modern machine learning methods should make your work easier, and maybe you don’t have to do everything (it was my employer with bread and butter).

So, if you are talking about merging lists of actual mailing addresses, I would do this by outsourcing if I can.

USPS had some tests to evaluate the quality of address standardization programs. I don’t remember anything about how this works, but you can check whether they continue to do this - perhaps you can get good training data.

0

Darius bacon Dec 16 '09 at 9:12

source share

mjv · Accepted Answer · 2009-12-03T15:39:56+0000

A good baseline, probably impractical in terms of its relatively high computational cost and, more importantly, its production of many false positives, would be a general string distance algorithm, for example

Change distance (aka Levenshtein distance)
Ratcliff / obershelp

Depending on the required level of accuracy (which, BTW, should be indicated in terms of recall and accuracy , that is, in general to express, it is more important to skip the correlation than to misidentify), a homegrown process based on [some] of the following heuristics and ideas may do the trick :

tokenize input i.e. see the input as an array of words, not a string
tokenization should also contain line number information
normalize input using a short dictionary of common permutations (for example, "dr" at the end of the line = "disk", "Jack" = "John", "Bill" = "William" ..., "W." at the beginning of the line "West " etc.
Identify (a bit like tagging, like POS marking), the nature of some objects (for example, zip code and extended zip code, as well as city
Define (search) some of these objects (for example, a relative short database table may include all cities / cities in the target area
Identify (find) some domain-related entities (if all / many of the addresses deal with people who speak the profession of a lawyer, a search for the names of a law firm or federal buildings can help.
As a rule, add more weight to the tokens that come from the last line of the address.
Put more (or less) weight on tokens with a certain type of object (for example: "Disk", "Street", "Court" should be much less than the same signs that precede them.
Consider a modified SOUNDEX algorithm to help with normalization.

Based on the foregoing, perform a rule-based appraiser . Roughly, the rules can be implemented as visitors to the tree / array structure, where the initial source analysis is first ( visitor design template ).
The advantage of a rule-based structure is that each heuristic is in its own function, and the rules can be prioritized, that is, place some rules at an early stage of the chain, which allows you to first interrupt the evaluation with some strong heuristics (for example, different City => Correlation = 0, confidence level = 95%, etc.).

An important consideration with the search for correlations is the need for a priori comparison of each individual element (here is the address) with any other element , so it takes as much as 1/2 n^2 comparisons at the product level. Because of this, it can be useful to store the reference elements in such a way that they are pre-processed (analyzed, normalized ...), and also possibly have a digest / sort key , which can be used as a [very rough] indicator of possible correlation ( for example, a key from a 5-digit ZIP code followed by the SOUNDEX value of the “primary” name).

Calculation of context-sensitive text correlation

More articles: