Web scraping, screen squeaks, data mining tips?

Question

Web scraping, screen squeaks, data mining tips?

I am working on a project and I need to take a lot of screenshots to get as much data as possible. I am wondering if anyone knows of a good API or resources to help me.

I am using java by the way.

Here is what my workflow has been up to now:

Connect to website (using HTTPComponents from Apache)
The website contains a section with a bunch of links that I need to visit (using the built-in HTML java parsers to find out which all the links I need to visit are annoying and dirty code)
Visit all the links I found.
For every link that I visit, there is more data that I need to extract, spread out on several pages, so that I need to visit more links.

Thoughts:

Does anyone know of any higher level html parsers than the embedded java file?
This is mainly a depth search. I guess I would like to do this multi-threaded at some time, so I can visit some of these links in parallel.
Perhaps what I'm really looking for is a multi-threaded scanning library on the Internet.

If you do not understand, this is my first time when I come across this, so I have a difficult time trying to clearly articulate my needs. I would really appreciate any input that any of you who have done this before could make.

+6

java html-parsing web-scraping screen-scraping data-mining

Jpc Nov 02 '10 at 16:24

source share

4 answers

I used Bixo to extract hyperlinks and images searching for depth. It is built on chaos and cascade, so there is a learning curve, but the above example is good enough to adjust the changes ...

+2

harshit Oct 05 '11 at 23:31

source share

Try using Web-Harvest .

+1

Boris Pavlović Nov 02 '10 at 16:29

source share

Checkout JSR-237 to manage your work, which is a great idea for multi-threading.

Regarding curettage, there are several alternatives. If ease of use is important, I would advise you to use HTMLUnit. Other than that, you have to quit your own

0

aldrinleal Nov 14 '10 at 16:19

source share

dogbane · Accepted Answer · 2010-11-02T16:48:04+0000

I found JSoup really good for parsing HTML.

For more pointers check out this article: How to write a multi-threaded web browser

Web scraping, screen squeaks, data mining tips?

More articles: