I want to extract dates with various formats from web pages. I use the Selenium2 Java interface to interact with the browser. I also use jQuery to further interact with the document. Therefore, solutions for both layers are welcome.
Dates can have very different formats in different locales. In addition, month names can be written as text or number. I need to match as many dates as possible, and I know that there are many combinations.
For example, if I have an HTML element like this:
<div class="tag_view"> Last update: May,22,2011 View :40 </div>
I want the relevant part of the date to be extracted and recognized:
May,22,2011
Now this should be converted to a regular Java Date object.
Update
This should work with HTML from any web page, the date can be contained in any element in any format. For example, here in Stackoverflow, the source code is as follows:
<span class="relativetime" title="2011-05-13 14:45:06Z">May 13 at 14:45</span>
I want this to be done in the most efficient way, and I assume it will be a jQuery selector or filter that returns a standardized representation of the date. But I am open to your suggestions.
source share