What is the best way to screen weak XHTML pages for a java application

I want to be able to capture content from web pages, especially tags and content within them. I tried XQuery and XPath, but they don't seem to work for distorted XHTML and REGEX - this is just a pain.

Is there a better solution. Ideally, I would like to be able to request all the links and return an array of URLs or request the text of the links and return an array of strings with the text of the links or request all the bold text, etc.

+4
source share
4 answers

Run XHTML through something like JTidy , which should return valid XML to you.

+4
source

You can watch Watij . I used only my Ruby cousin, Watir, but with him I was able to load the web page and request all the page URLs exactly as you describe.

It was very easy to work with it - it literally launches a web browser and returns information to you in nice forms. IE support seemed better, but at least with Watir Firefox it was also supported.

+2
source

I had some problems with JTidy that day. I think this is due to tags that were not closed, which made JTidy fail. I do not know if this is fixed now. I ended up being a TagSoup wrapper, although I don't remember the exact name of the project. Theres also an HTMLCleaner .

+2
source

I used http://htmlparser.sourceforge.net/ . It can analyze poorly formed html and makes it easy to retrieve data.

+2
source

All Articles