Parsing HTML Web Pages in Java

I need to parse / read many HTML web pages (100+) for specific content (multiple lines of text that are almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow, and with jsoup I get the following error: java.net.SocketTimeoutException: read timeout (multiple computers with different connections)

Is there anything better?

EDIT:

Now that I got jsoup to work, I think the best question is how do I speed it up?

+4
source share
3 answers

Have you tried to extend the timeout on JSoup? In my opinion, this is only 3 seconds. See this .

+5
source

I offer Nutch , an open source search solution that includes support for HTML parsing. This is a very mature library. He uses Lucene under the hood, and I believe that he is a very reliable finder.

+2
source

Great learning ability is xpath. That would be great for this job! I just started learning it myself for automation testing. If you have any questions, please get me a message. I would be happy to help you, although I am not an expert.

Here's a good link, since you're interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also good to know when you are not using Java, so I would choose this route.

0
source

All Articles