I am working on a project and I need to take a lot of screenshots to get as much data as possible. I am wondering if anyone knows of a good API or resources to help me.
I am using java by the way.
Here is what my workflow has been up to now:
- Connect to website (using HTTPComponents from Apache)
- The website contains a section with a bunch of links that I need to visit (using the built-in HTML java parsers to find out which all the links I need to visit are annoying and dirty code)
- Visit all the links I found.
- For every link that I visit, there is more data that I need to extract, spread out on several pages, so that I need to visit more links.
Thoughts:
- Does anyone know of any higher level html parsers than the embedded java file?
- This is mainly a depth search. I guess I would like to do this multi-threaded at some time, so I can visit some of these links in parallel.
- Perhaps what I'm really looking for is a multi-threaded scanning library on the Internet.
If you do not understand, this is my first time when I come across this, so I have a difficult time trying to clearly articulate my needs. I would really appreciate any input that any of you who have done this before could make.
java html-parsing web-scraping screen-scraping data-mining
Jpc
source share