Web scraping, screen squeaks, data mining tips?

I am working on a project and I need to take a lot of screenshots to get as much data as possible. I am wondering if anyone knows of a good API or resources to help me.

I am using java by the way.

Here is what my workflow has been up to now:

  • Connect to website (using HTTPComponents from Apache)
  • The website contains a section with a bunch of links that I need to visit (using the built-in HTML java parsers to find out which all the links I need to visit are annoying and dirty code)
  • Visit all the links I found.
  • For every link that I visit, there is more data that I need to extract, spread out on several pages, so that I need to visit more links.

Thoughts:

  • Does anyone know of any higher level html parsers than the embedded java file?
  • This is mainly a depth search. I guess I would like to do this multi-threaded at some time, so I can visit some of these links in parallel.
  • Perhaps what I'm really looking for is a multi-threaded scanning library on the Internet.

If you do not understand, this is my first time when I come across this, so I have a difficult time trying to clearly articulate my needs. I would really appreciate any input that any of you who have done this before could make.

+6
java html-parsing web-scraping screen-scraping data-mining
source share
4 answers

I found JSoup really good for parsing HTML.

For more pointers check out this article: How to write a multi-threaded web browser

+9
source share

I used Bixo to extract hyperlinks and images searching for depth. It is built on chaos and cascade, so there is a learning curve, but the above example is good enough to adjust the changes ...

+2
source share

Try using Web-Harvest .

+1
source share

Checkout JSR-237 to manage your work, which is a great idea for multi-threading.

Regarding curettage, there are several alternatives. If ease of use is important, I would advise you to use HTMLUnit. Other than that, you have to quit your own

0
source share

All Articles