Compare two addresses that are not in the standard format

I need to compare addresses from two tables and get an identifier if the address matches. Each table has three columns Houseno, street, status Address is not in the standard format in any of the tables. There are approx. 50,000 lines, I need to scan through

In some places of his Avenue. Prospect Avenue. Str Street, ST. Lane ln. Place the PL CIR CIRCLE. Any combination with a period or a comma or spaces, hypen. I was thinking about combining all three. What could be the best way to do this in SQL or PLSQL, for example,

table1

HNO STR State ----- ----- ----- 12 6th Ave NY 10 3rd Aven SD 12-11 Fouth St NJ 11 sixth Lane NY A23 Main Parkway NY A-21 124 th Str. VA 

table2

  id HNO STR state -- ----- ----- ----- 1 12 6 Ave. NY 13 10 3 Avenue SD 15 1121 Fouth Street NJ 33 23 9th Lane NY 24 X23 Main Cir. NY 34 A1 124th Street VA 
+4
source share
4 answers

There is no easy way to achieve what you want. There is expensive software (google for "address standardization software") that can do this, but rarely 100% automatically.

What this type of software does is take data, use complex heuristics to try to figure out the “official” address, and then return it (sometimes with confidence that the result is correct, sometimes a list of results sorted by trust).

For a small percentage of the data, the software simply does not work, and you will have to fix it yourself.

+1
source

Oracle has a built-in UTL_Match package that has an edit_distance function (based on the Levenshtein algorithm, this is an indicator of how many changes you need to make to make one line the same as the other). More information about this package / function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm

You will need to make some decisions about whether to compare each column or combine, and then compare and what a reasonable threshold. For example, you can perform a manual check on anyone with an editing distance of less than 8 on concatenated values.

Let me know if you want any help with the syntax, the edit_distance function just takes 2 arguments of varchar2 (the lines you want to compare) and returns a number.

This is not an ideal solution if you set the threshold value high, you will have a lot of manual checks to drop them, and if you set too low, you will skip a few matches, but that might be about best if you want a relatively a simple solution.

+1
source

The way we did this for one of our applications was to use a third-party adddress normalization API (for example, Pitney Bowes), normalize each address (the Address is a combination of street address, city, state and zip code) and create a T hash code -sql for this address. To compare the addresses, do the same and compare the two hashes, and if they match, we will have a match

+1
source

you can make the cursor where you first start the group, where is the house number and city =.

in a loop, you can separate the string using instr e substr taking into account chr (32).

After that, you can try to consider running into a substring where you have number 6 = 6, another case is street = str.

Good luck

0
source

All Articles