How do I clean my web pages to find specific related pages in Java in the Google App Engine?

I need to get text from a remote website that does not provide an RSS feed.

I know that the data I need is always found on pages linked to the home page ( http://www.example.com/ ) with a link containing the text " Invoices Report ".

For instance:

 <a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a> 

So, I need to find all the links on the main page that match this template, and then get all the text from the pages that are inside the <div class="invoice-body"> .

Are there Java tools that help with this, and is there something specifically for the Google App Engine for Java that can be used to do this?

+4
source share
1 answer

Check out http://code.google.com/appengine/docs/java/urlfetch/overview.html

You can use the UrlFetch service to read at www.example.com/index.html in turn and use the regular expression to search for β€œAccount Report”.

 URL url = new URL("http://www.example.com/index.html"); BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream())); String line; while ((line = reader.readLine()) != null) { checkLineForTextAndAddLinkOrWhatever(line); } reader.close(); 

You may need a different type of reader if the link may be on multiple lines.

+4
source

All Articles