Using Regex to Define Dates in Many Formats

I am working on an application that removes local websites to create a database of upcoming events, and I'm trying to use Regex to capture as many date formats as possible.

Consider the following sentence fragments:

  • "The focus of the workshop, on Saturday, February 2, 2013, will be [...]"
  • "Valentines Special @The Radisson February 14th"
  • "On Friday, February 15, a special Hollywood theme [...]"
  • "Symposium on children's play on Friday, February 8"
  • "Hosting a craft workshop on March 9 - 11 in the old [...]"

I want to be able to scan them and catch as many dates as possible. At the moment, I am doing this in what is probably the wrong way (I'm not very good at regex) by looking at several regex statements one by one, like this

/([0-9]+?)(st|nd|rd|th) (of)? (Jan|Feb|Mar|etc)/i /([0-9]+?)(st|nd|rd|th) (of)? (January|February|March|Etcetera)/i /(Jan|Feb|Mar|etc) ([0-9]+?)(st|nd|rd|th)/i /(January|February|March|Etcetera) ([0-9]+?)(st|nd|rd|th)/i 

I could combine all of these into one giant regex operator, but it looks like there is a cleaner way in php, perhaps in a third-party library, or something else?

EDITING. Errors may appear in the above expression - this means only an example.

+2
source share
2 answers

I wrote a function that extracts dates from text using strtotime() :

 function parse_date_tokens($tokens) { # only try to extract a date if we have 2 or more tokens if(!is_array($tokens) || count($tokens) < 2) return false; return strtotime(implode(" ", $tokens)); } function extract_dates($text) { static $patterns = Array( '/^[0-9]+(st|nd|rd|th|)?$/i', # day '/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month '/^20[0-9]{2}$/', # year '/^of$/' #words ); # defines which of the above patterns aren't actually part of a date static $drop_patterns = Array( false, false, false, true ); $tokens = Array(); $result = Array(); $text = str_word_count($text, 1, '0123456789'); # get all words in text # iterate words and search for matching patterns foreach($text as $word) { $found = false; foreach($patterns as $key => $pattern) { if(preg_match($pattern, $word)) { if(!$drop_patterns[$key]) { $tokens[] = $word; } $found = true; break; } } if(!$found) { $result[] = parse_date_tokens($tokens); $tokens = Array(); } } $result[] = parse_date_tokens($tokens); return array_filter($result); } # test $texts = Array( "The focus of the seminar, on Saturday 2nd February 2013 will be [...]", "Valentines Special @ The Radisson, Feb 14th", "On Friday the 15th of February, a special Hollywood themed [...]", "Symposium on Childhood Play on Friday, February 8th", "Hosting a craft workshop March 9th - 11th in the old [...]" ); $dates = extract_dates(implode(" ", $texts)); echo "Dates: \n"; foreach($dates as $date) { echo " " . date('dmY H:i:s', $date) . "\n"; } 

It is output:

 Dates: 02.02.2013 00:00:00 14.02.2013 00:00:00 15.02.2013 00:00:00 08.02.2013 00:00:00 09.03.2013 00:00:00 

This solution may not be ideal and certainly has its drawbacks, but it is a fairly simple solution to your problem.

+4
source

For this kind of potentially complex regular expression, I try to break it into simple parts that can be individually tested, maintained and developed.

I use REL , DSL (in Scala), which allows you to reassemble and reuse regex elements. That way, you can define your regular expression as these date and unit test matches in each part.

Additionally, your unit / spec tests can double as your document for this regular expression bit, indicating what matches and what doesn't (which tends to be important when using regular expressions).

In the next version of REL (0.3), you can directly export Regex to, for example, PCRE (thus PHP) to use it independently ... So far, only JavaScript and .NET translations are implemented in the github repository. Using the latest (not yet publicly recorded) snapshot, the PCRE flavor of the English alphanumeric regular expression is as follows:

 /(?:(?:(?<!\d)(?<a_d1>(?>(?:(?:[23]?1)st|(?:2?2)nd|(?:2?3)rd|(?:[12]?[4-9]|[123]0)th)\b|0[1-9]|[12][0-9]|3[01]|[1-9]|[12][0-9]|3[01]))(?: ?+(?:of )?+))(?>(?<a_m1>jan(?>uary|\.)?|feb(?>ruary|r?\.?)?|mar(?>ch|\.)?|apr(?>il|\.)?|may|jun(?>e|\.)?|jul(?>y|\.)?|aug(?>ust|\.)?|sep(?>tember|t?\.?)?|oct(?>ober|\.)?|nov(?>ember|\.)?|dec(?>ember|\.)?))|(?:\b(?>(?<a_m2>jan(?>uary|\.)?|feb(?>ruary|r?\.?)?|mar(?>ch|\.)?|apr(?>il|\.)?|may|jun(?>e|\.)?|jul(?>y|\.)?|aug(?>ust|\.)?|sep(?>tember|t?\.?)?|oct(?>ober|\.)?|nov(?>ember|\.)?|dec(?>ember|\.)?)))(?:(?:(?: ?+)(?<a_d2>(?>(?:(?:[23]?1)st|(?:2?2)nd|(?:2?3)rd|(?:[12]?[4-9]|[123]0)th)\b|0[1-9]|[12][0-9]|3[01]|[1-9]|[12][0-9]|3[01]))(?!\d))?))(?:(?:,?+)(?:(?:(?: ?)(?<a_y>(?:1[7-9]|20)\d\d|'?+\d\d))(?!\d))|(?<=\b|\.))/i 

Obtained through the expression fr.splayce.rel.matchers.en.Date.ALPHA using PCREFlavor (not yet in the GitHub repository). It will only match if there is a month expressed in alphabetical form ( feb , feb. Or february ), the regular expression ….Date.ALL , also corresponding to numerical forms, for example 2/21/2013 , is more complicated.

In addition, this particular regular expression matches your examples, but may be slightly limited for your needs:

  • It does not include weekly days.
  • It will not match date ranges ( March 9th )
  • It does not coincide with the first year. 2013, jan. 14th
+1
source

All Articles