Retrieving dates from a web page

I want to extract dates with various formats from web pages. I use the Selenium2 Java interface to interact with the browser. I also use jQuery to further interact with the document. Therefore, solutions for both layers are welcome.

Dates can have very different formats in different locales. In addition, month names can be written as text or number. I need to match as many dates as possible, and I know that there are many combinations.

For example, if I have an HTML element like this:

<div class="tag_view"> Last update: May,22,2011 View :40 </div> 

I want the relevant part of the date to be extracted and recognized:

 May,22,2011 

Now this should be converted to a regular Java Date object.

Update

This should work with HTML from any web page, the date can be contained in any element in any format. For example, here in Stackoverflow, the source code is as follows:

 <span class="relativetime" title="2011-05-13 14:45:06Z">May 13 at 14:45</span> 

I want this to be done in the most efficient way, and I assume it will be a jQuery selector or filter that returns a standardized representation of the date. But I am open to your suggestions.

+4
source share
3 answers

I will answer it myself, because I came up with a working solution. I appreciate the comments though.

 /** * Extract date * * @return Date object * @throws ParseException */ public Date extractDate(String text) throws ParseException { Date date = null; boolean dateFound = false; String year = null; String month = null; String monthName = null; String day = null; String hour = null; String minute = null; String second = null; String ampm = null; String regexDelimiter = "[-:\\/.,]"; String regexDay = "((?:[0-2]?\\d{1})|(?:[3][01]{1}))"; String regexMonth = "(?:([0]?[1-9]|[1][012])|(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Sept|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?))"; String regexYear = "((?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3}))"; String regexHourMinuteSecond = "(?:(?:\\s)((?:[0-1][0-9])|(?:[2][0-3])|(?:[0-9])):([0-5][0-9])(?::([0-5][0-9]))?(?:\\s?(am|AM|pm|PM))?)?"; String regexEndswith = "(?![\\d])"; // DD/MM/YYYY String regexDateEuropean = regexDay + regexDelimiter + regexMonth + regexDelimiter + regexYear + regexHourMinuteSecond + regexEndswith; // MM/DD/YYYY String regexDateAmerican = regexMonth + regexDelimiter + regexDay + regexDelimiter + regexYear + regexHourMinuteSecond + regexEndswith; // YYYY/MM/DD String regexDateTechnical = regexYear + regexDelimiter + regexMonth + regexDelimiter + regexDay + regexHourMinuteSecond + regexEndswith; // see if there are any matches Matcher m = checkDatePattern(regexDateEuropean, text); if (m.find()) { day = m.group(1); month = m.group(2); monthName = m.group(3); year = m.group(4); hour = m.group(5); minute = m.group(6); second = m.group(7); ampm = m.group(8); dateFound = true; } if(!dateFound) { m = checkDatePattern(regexDateAmerican, text); if (m.find()) { month = m.group(1); monthName = m.group(2); day = m.group(3); year = m.group(4); hour = m.group(5); minute = m.group(6); second = m.group(7); ampm = m.group(8); dateFound = true; } } if(!dateFound) { m = checkDatePattern(regexDateTechnical, text); if (m.find()) { year = m.group(1); month = m.group(2); monthName = m.group(3); day = m.group(3); hour = m.group(5); minute = m.group(6); second = m.group(7); ampm = m.group(8); dateFound = true; } } // construct date object if date was found if(dateFound) { String dateFormatPattern = ""; String dayPattern = ""; String dateString = ""; if(day != null) { dayPattern = "d" + (day.length() == 2 ? "d" : ""); } if(day != null && month != null && year != null) { dateFormatPattern = "yyyy MM " + dayPattern; dateString = year + " " + month + " " + day; } else if(monthName != null) { if(monthName.length() == 3) dateFormatPattern = "yyyy MMM " + dayPattern; else dateFormatPattern = "yyyy MMMM " + dayPattern; dateString = year + " " + monthName + " " + day; } if(hour != null && minute != null) { //TODO ampm dateFormatPattern += " hh:mm"; dateString += " " + hour + ":" + minute; if(second != null) { dateFormatPattern += ":ss"; dateString += ":" + second; } } if(!dateFormatPattern.equals("") && !dateString.equals("")) { //TODO support different locales SimpleDateFormat dateFormat = new SimpleDateFormat(dateFormatPattern.trim(), Locale.US); date = dateFormat.parse(dateString.trim()); } } return date; } private Matcher checkDatePattern(String regex, String text) { Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL); return p.matcher(text); } 
0
source

Since we cannot limit ourselves to any particular type of element or children of any element, you are basically talking about finding all the page text for dates. The only way to do this with any efficiency is to use regular expressions. Since you are looking for dates in any format, you need a regular expression for each acceptable format. Having determined what it is, just compile the regular expressions and run something like:

 var datePatterns = new Array(); datePatterns.push(/\d\d\/\d\d\/\d\d\d\d/g); datePatterns.push(/\d\d\d\d\/\d\d\/\d\d/g); ... var stringToSearch = $('body').html(); // change this to be more specific if at all possible var allMatches = new Array(); for (datePatternIndex in datePatterns){ allMatches.push(stringToSearch.match(datePatterns[datePatternIndex])); } 

You can find more regular date expressions by following the link or make them yourself, they are quite lightweight. One note: you could combine some of the regular expressions above to create a more efficient program. I would be very careful with this, it can make your code difficult to read very quickly. Running a single regular expression in a date format seems a lot cleaner.

+1
source

You can use getText to get the text of the element and then split the String, for example -

 String s = selenium.getText("css=span.relativetime"); String date = s.split("Last update:")[1].split("View :")[0]; 
0
source

All Articles