Java based hashmap

Question

Java based hashmap

I am working on a web crawler (please do not suggest an existing one, this is not an option). I work as expected. My only problem is that I am currently using a kind of server / client model where the server crawls and processes the data and then is placed in a central place.

This location is an object created from a class that I wrote. Internally, the class supports a hash map defined as HashMap<String, HashMap<String, String>>

I store the data on the map by creating a url key (I keep it unique), and the hasmap value stores the corresponding data fields for this URL, such as name, value, etc.

I sometimes serialize internal used objects, but the spider is multithreaded, and as soon as I say that 5 threads scanning memory requirements grow exponentially.

So far, performance has been excellent with hashmap, traversing 15K URLs in 2.r minutes with 30 second CPU time, so I really don't need to point in the direction of the existing spider, as most forum users have suggested.

Can anyone suggest a quick drive solution that is likely to support reading and writing at the same time? The data structure does not have to be the same, you just need to be able to store the associated meta tag values together, etc.

early

+4

java hashmap

zcourts Jul 23 '10 at 8:45

source share

5 answers

There is Tokyo Cabinet , which is a quick implementation of a disk-based hash table.

In your case, I believe that the best way to store values in such a setting would be to prefix metadata keys with URLs:

 [url]_[name] => [value] [url]_[name2] => [value2]

Unfortunately, I'm not sure if you can list the metadata for a given URL using this solution.

If you want to use a more structured data warehouse, there are also MongoDB and SQLite, which I would recommend.

+1

Sirdarius Jul 23 '10 at 9:08

source share

The JDBM2 library provides persistent maps for Java. Its fast and thread safe.

UPDATE : Developed by MapDB project

+1

Andrejs Feb 23 '12 at 14:48

source share

how about using jpa in your class and store the data in a database (which can be textual like sqlite) http://en.wikipedia.org/wiki/Java_Persistence_API

0

fixitagain Jul 23 '10 at 8:57

source share

Chronicle Map is a hash-based embedded data storage that stores data on disk (in one file), which aims to be a replacement for ConcurrentHashMap (provides the same ConcurrentMap interface). The chronic card is the fastest store among similar solutions and has excellent read / write concurrency, almost linearly scaled by the number of available cores in the machine.

Disclaimer: I am a chronicle map developer.

0

leventov Jul 08 '16 at 16:07

source share

skaffman · Accepted Answer · 2010-07-23T08:48:38+0000

I suggest using EhCache , although what you are building is not really a cache. EhCache allows you to configure the cache instance so that it overflows in the disk storage, while retaining the most recent elements in memory. It can also be set to persistent disk, i.e. Data is uploaded to disk upon shutdown and read into memory at startup. Besides all this, it is based on a key value, so it already fits your model. It supports concurrent access, and since disk storage is managed as a separate stream, you do not need to worry about concurrency disk access.

Alternatively, you can consider the correct built-in database such as Hypersonic (or many others of a similar style), but probably more work.

Java based hashmap

More articles: