Parsing HTML Web Pages in Java

Question

Parsing HTML Web Pages in Java

I need to parse / read many HTML web pages (100+) for specific content (multiple lines of text that are almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow, and with jsoup I get the following error: java.net.SocketTimeoutException: read timeout (multiple computers with different connections)

Is there anything better?

EDIT:

Now that I got jsoup to work, I think the best question is how do I speed it up?

+4

java html parsing jsoup

samwise Jul 14 '11 at 2:49

source share

3 answers

I offer Nutch , an open source search solution that includes support for HTML parsing. This is a very mature library. He uses Lucene under the hood, and I believe that he is a very reliable finder.

+2

uncaught_exceptions Jul 14 '11 at 2:54

source share

Great learning ability is xpath. That would be great for this job! I just started learning it myself for automation testing. If you have any questions, please get me a message. I would be happy to help you, although I am not an expert.

Here's a good link, since you're interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also good to know when you are not using Java, so I would choose this route.

0

Macgyver Jul 14 '11 at 2:56

source share

Ed staub · Accepted Answer · 2011-07-14T02:55:34+0000

Have you tried to extend the timeout on JSoup? In my opinion, this is only 3 seconds. See this .

Parsing HTML Web Pages in Java

More articles: