What is the best way to screen weak XHTML pages for a java application

Question

What is the best way to screen weak XHTML pages for a java application

I want to be able to capture content from web pages, especially tags and content within them. I tried XQuery and XPath, but they don't seem to work for distorted XHTML and REGEX - this is just a pain.

Is there a better solution. Ideally, I would like to be able to request all the links and return an array of URLs or request the text of the links and return an array of strings with the text of the links or request all the bold text, etc.

+4

java regex xquery xpath screen-scraping

Ankur Apr 3 '09 at 15:04

source share

4 answers

You can watch Watij . I used only my Ruby cousin, Watir, but with him I was able to load the web page and request all the page URLs exactly as you describe.

It was very easy to work with it - it literally launches a web browser and returns information to you in nice forms. IE support seemed better, but at least with Watir Firefox it was also supported.

+2

Joshua mckinnon Apr 3 '09 at 15:10

source share

I had some problems with JTidy that day. I think this is due to tags that were not closed, which made JTidy fail. I do not know if this is fixed now. I ended up being a TagSoup wrapper, although I don't remember the exact name of the project. Theres also an HTMLCleaner .

+2

John ellinwood Apr 3 '09 at 15:12

source share

I used http://htmlparser.sourceforge.net/ . It can analyze poorly formed html and makes it easy to retrieve data.

+2

Marcelo morales Apr 3 '09 at 18:53

source share

Jay kominek · Accepted Answer · 2009-04-03T15:09:45+0000

Run XHTML through something like JTidy , which should return valid XML to you.

What is the best way to screen weak XHTML pages for a java application

More articles: