I am working on a small project to analyze content on some sites that interest me; This is a real DIY project that I do for my entertainment / enlightenment, so I would like to encode it as much as possible.
Obviously, I will need the data to feed my application, and I thought that I would write a little crawler that would take maybe 20 thousand pages of html and write them to text files on my hard drive. However, when I looked at SO and other sites, I could not find any information on how to do this. Is it possible? There seem to be open source options available (webpshinx?), But I would like to write this myself if possible.
Schema is the only language that I know well, but I thought I would use this project to learn some of the Java, so I would be wondering if there are any racket or java libraries that would be useful for this.
So, I think, to summarize my question, what are your good resources to start with this? How can I get my crawler to request information from other servers? Should I write a simple parser for this, or is it superfluous, given that I want to take the whole html file and save it as txt?
source
share