I am working on a web crawler (please do not suggest an existing one, this is not an option). I work as expected. My only problem is that I am currently using a kind of server / client model where the server crawls and processes the data and then is placed in a central place.
This location is an object created from a class that I wrote. Internally, the class supports a hash map defined as HashMap<String, HashMap<String, String>>
I store the data on the map by creating a url key (I keep it unique), and the hasmap value stores the corresponding data fields for this URL, such as name, value, etc.
I sometimes serialize internal used objects, but the spider is multithreaded, and as soon as I say that 5 threads scanning memory requirements grow exponentially.
So far, performance has been excellent with hashmap, traversing 15K URLs in 2.r minutes with 30 second CPU time, so I really don't need to point in the direction of the existing spider, as most forum users have suggested.
Can anyone suggest a quick drive solution that is likely to support reading and writing at the same time? The data structure does not have to be the same, you just need to be able to store the associated meta tag values ββtogether, etc.
early
source share