Extract inconsistently formatted date from string (date parsing, NLP)

I have a large list of files, some of which have dates embedded in the file name. The date format is inconsistent and often incomplete, for example. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004", etc. In addition, some file names have unrelated numbers that look like dates, such as โ€œ20202010โ€.

In short, dates are usually incomplete, sometimes do not exist, are inconsistently formatted, and are embedded in a string with other information, for example. "Report Aug06.xls".

Are there any Perl modules available that will do a decent job of guessing the date from such a string? It should not be 100% correct, because it will be checked by a person manually, but I try to make everything as simple as possible for this person, and there are thousands of records to check :)

+6
date perl nlp
source share
3 answers

Date :: Parse will definitely be part of your answer - a bit that produces a random formatted date-like string and makes the actual useful date from it.

Another part of your problem - the rest of the characters in your file names - is unusual enough that you are unlikely to find that someone else has packaged the module for you.

Without seeing more of your sample data, you can actually only guess, but I would start by identifying possible or likely candidates for the โ€œdate sectionโ€.

Here's a nasty brute force example using Date :: Parse (a more reasonable approach would be to use the regex-en list to try and determine the date-bit - I like to write processor cycles so I donโ€™t think so hard, though!)

!/usr/bin/perl use strict; use warnings; use Date::Parse; my @files=("Report Aug06.xls", "ReportAug2006", "Report 11th September 2006.xls", "Annual Report-08-06", "End-of-month Report01-08-06.xls", "Report2006"); # assumption - longest likely date string is something like '11th September 2006' - 19 chars # shortest is "2006" - 4 chars. # brute force all strings from 19-4 chars long at the end of the filename (less extension) # return the longest thing that Date::Parse recognises as a date foreach my $file (@files){ #chop extension if there is one $file=~s/\..*//; for my $len (-19..-4){ my $string = substr($file, $len); my $time = str2time($string); print "$string is a date: $time = ",scalar(localtime($time)),"\n" if $time; last if $time; } } 
+3
source share

Date :: Parse does what you want.

0
source share

DateTime :: Format :: Natural looks like a candidate for this job. I can not vouch for him personally, but good reviews .

0
source share

All Articles