Extract address from string

Let's say I have this line:

<div>john doe is nice guy btw 8240 E. Marblehead Way 92808 is also</div> 

or this line:

 <div>sky being blue? in the world is true? 024 Brea Mall Brea, California 92821 jackfroast nipping on the firehead</div> 

How can I extract an address from one of these lines? It will be due to some kind of regex, right?

I tried searching the Internet for a solution using JavaScript or PHP, but to no avail. And no other post here about stack overflow (as far as I know) gives a solution that uses jQuery and / or Javascript and / or PHP. (Nearest We parse the useful Street address, City, State, Postal code from the string , which does NOT have any code in the stream about extracting the postal code from the string.

Can someone point me in the right direction? How can I do this in jQuery or JavaScript or PHP?

+4
source share
6 answers

Tried this on twelve different lines that looked like yours, and it worked just fine:

 function str_to_address($context) { $context_parts = array_reverse(explode(" ", $context)); $zipKey = ""; foreach($context_parts as $key=>$str) { if(strlen($str)===5 && is_numeric($str)) { $zipKey = $key; break; } } $context_parts_cleaned = array_slice($context_parts, $zipKey); $context_parts_normalized = array_reverse($context_parts_cleaned); $houseNumberKey = ""; foreach($context_parts_normalized as $key=>$str) { if(strlen($str)>1 && strlen($str)<6 && is_numeric($str)) { $houseNumberKey = $key; break; } } $address_parts = array_slice($context_parts_normalized, $houseNumberKey); $string = implode(' ', $address_parts); return $string; } 

This implies a house number of at least two digits and no more than six. This also assumes that the zip code is not in an “extended” form (for example, 12345-6789). However, this can be easily changed to fit this format (regular expression would be a good option here, something like (\d{5}-\d{4}) .

But using regex to analyze the data entered by the user ... It is not a good idea here because we simply do not know what the user enters, because there were (as you might assume) the lack of checks.

Looking through the code and logic, starting with creating an array from the context and capturing zip:

 // split the context (for example, a sentence) into an array, // so we can loop through it. // we reverse the array, as we're going to grab the zip first. // why? we KNOW the zip is 5 characters long*. $context_parts = array_reverse(explode(" ", $context)); // we're going to store the array index of the zip code for later use $zipKey = ""; // foreach iterates over an object given the params, // in this case it like doing... // for each value of $context_parts ($str), and each index ($key) foreach($context_parts as $key=>$str) { // if $str is 5 chars long, and numeric... // an incredibly lazy check for a zip code... if(strlen($str)===5 && is_numeric($str)) { $zipKey = $key; // we have what we want, so we can leave the loop with break break; } } 

Make some recyclers so that we have the best object to decorate the house number with

 // remove junk from $context_array, since we don't // need stuff after the zip $context_parts_cleaned = array_slice($context_parts, $zipKey); // since the house number comes first, let go back to the start $context_parts_normalized = array_reverse($context_parts_cleaned); 

And then let me grab the house number using the same basic logic as the zip code:

 $houseNumberKey = ""; foreach($context_parts_normalized as $key=>$str) { if(strlen($str)>1 && strlen($str)<6 && is_numeric($str)) { $houseNumberKey = $key; break; } } // we probably have the parts we for the address. // let do some more cleaning $address_parts = array_slice($context_parts_normalized, $houseNumberKey); // and build the string again, from the address $string = implode(' ', $address_parts); // and return the string return $string; 
+19
source

Regular expressions are used to test patterns. . You need to know which template you are looking for. From the two examples you quoted, I would like to find the number, and then the text ending with a five-digit number.

All addresses must be in this format. You cannot magically just extract addresses from a string.

+2
source

If all your addresses start and end with numbers, you can use this regular expression to extract the necessary data:

 /[0-9].+[0-9]/gi 

Javascript example:

 "<div>john doe is nice guy btw 8240 E. Marblehead Way 92808 is also</div>".match(/[0-9].+[0-9]/gi) // ["8240 E. Marblehead Way 92808"] "<div>sky being blue? in the world is true? 024 Brea Mall Brea, California 92821 jackfroast nipping on the firehead</div>".match(/[0-9].+[0-9]/gi) // ["024 Brea Mall Brea, California 92821"] 

In the new example, which contains the phone number, you can:

 /[0-9].*[0-9]/gi 

Javascript example:

 "john doe 7143138656 is 8240 e marblehead way 92808".match(/[0-9].*[0-9]/gi) // ["7143138656 is 8240 e marblehead way 92808"] 

But this will only help you if you have match information in a string. If you really need a powerful address, you need to go ahead and create a powerful analysis.

You can start a text search for targeted keywords, then filter the paragraph to then remove the information you are looking for.

This is not an easy question, but it can be done, you can use more than one regular expression for some matches, but if the address does not have a pattern, regexp will be useless, at this time you will need to change your Approach.

+2
source

This is a common “mistake” to try to parse everything with regular expressions because of convenience. However, regular expressions are not the answer to everything. In this case, it does not look like you are looking for regular patterns in the text, but rather "natural" expressions that someone would write as if they were talking to you. This natural expression will not necessarily follow any agreed pattern at all. Some people first enter the applet numbers, then build the number, some people leave the city and pass the zip code, some people can put the city, state, country THEN zip. It is not possible to list all the possible regex patterns that someone can prepare with an address.

For natural language addresses, I would forget the definition of regular expression addresses and move on to the state-based parsing algorithm.

  • I would start by reading the text from left to right (at least in English) one word at a time. For each word you do one logical test: "Can this word be the beginning of an address?" I would suggest that this is a number for the building number or the number appt / unit / box (so "Box XXX", "PO BOX XXX", "PO XXX", "Unit XXX", "#XXX" or any number less than 6 digits in length). Although I do not know that this is true, I have never seen a North American building with the number 7 digits, which is the minimum for a telephone. Therefore, I suspect that you can easily sort phone numbers and building numbers. This "start of address" test may be a set of regular expression matches, but we don’t match the entire address, we just check the words or phrases that trigger the address. I would even say that it will be easier without matching regular expressions.

  • Once you find the start of the address, you create an “address parsing state object” (some class that you use to store the address as a continuation of the analysis and keeping track of what you still have and what you expect next) . Now you can continue execution of the sentence and continue adding parser state to the object. Following the building number, I probably expect a street name or direction indicator (NEWS NE. NW. SE. SW.). If none of them stops parsing your address and accepts an invalid or incomplete address, keep looking for new starting words for the address. Otherwise, add the street name and / or direction indicators to the parse tree and continue driving!

  • Everything that follows the name of the street can be infinitely variable. Some users may simply stop by building number and street name (provided that their local city / region / country). Otherwise, you are probably looking for either the city name or zip code / zip code. If it is found, add an analysis object to your address if it does not accept an incomplete address (fill in the default location information for the user) or an invalid address (ignore and continue to search for a different address start?).

Ultimately, this approach may be one fairly simple JavaScript method, perhaps with a few hundred lines of code (I'm not a PHP guy, but I guess it will be similar). If you tried to list all the possible regex patterns, someone could create an address with you, you would have a hundred of them alone, and it would still be unreliable! (Perhaps too slow if you are trying to match hundreds of regex patterns).

+1
source

My thinking is that you must have something to tell your code that "here is the address here and the rest is plain text." To do this, either you create an array of addresses, or save the addresses in a database, from where you can compare them with your inserted values

0
source

I got lucky using the Google Geocode API . It is difficult to try to think of all possible ways to enter the address bar.

I recently had to extract a part of the address from a single line for a real estate site, and I found that the best option was to use the google geocoding API. This allowed me to get the street, city, state, zip code, latitude, longitude, and much more for each address I entered.

I found an excellent google geocode (PHP) API configuration guide here: http://www.andrew-kirkpatrick.com/2011/10/google-geocoding-api-with-php/

The best part is, it even works with place names. Thus, a search for “UCLA” or “Apple Headquarters” will provide you with all parts of the address that you may need.

0
source

All Articles