Java based hashmap

I am working on a web crawler (please do not suggest an existing one, this is not an option). I work as expected. My only problem is that I am currently using a kind of server / client model where the server crawls and processes the data and then is placed in a central place.

This location is an object created from a class that I wrote. Internally, the class supports a hash map defined as HashMap<String, HashMap<String, String>>

I store the data on the map by creating a url key (I keep it unique), and the hasmap value stores the corresponding data fields for this URL, such as name, value, etc.

I sometimes serialize internal used objects, but the spider is multithreaded, and as soon as I say that 5 threads scanning memory requirements grow exponentially.

So far, performance has been excellent with hashmap, traversing 15K URLs in 2.r minutes with 30 second CPU time, so I really don't need to point in the direction of the existing spider, as most forum users have suggested.

Can anyone suggest a quick drive solution that is likely to support reading and writing at the same time? The data structure does not have to be the same, you just need to be able to store the associated meta tag values ​​together, etc.

early

+4
source share
5 answers

I suggest using EhCache , although what you are building is not really a cache. EhCache allows you to configure the cache instance so that it overflows in the disk storage, while retaining the most recent elements in memory. It can also be set to persistent disk, i.e. Data is uploaded to disk upon shutdown and read into memory at startup. Besides all this, it is based on a key value, so it already fits your model. It supports concurrent access, and since disk storage is managed as a separate stream, you do not need to worry about concurrency disk access.

Alternatively, you can consider the correct built-in database such as Hypersonic (or many others of a similar style), but probably more work.

+3
source

There is Tokyo Cabinet , which is a quick implementation of a disk-based hash table.

In your case, I believe that the best way to store values ​​in such a setting would be to prefix metadata keys with URLs:

 [url]_[name] => [value] [url]_[name2] => [value2] 

Unfortunately, I'm not sure if you can list the metadata for a given URL using this solution.

If you want to use a more structured data warehouse, there are also MongoDB and SQLite, which I would recommend.

+1
source

The JDBM2 library provides persistent maps for Java. Its fast and thread safe.

UPDATE : Developed by MapDB project

+1
source

how about using jpa in your class and store the data in a database (which can be textual like sqlite) http://en.wikipedia.org/wiki/Java_Persistence_API

0
source

Chronicle Map is a hash-based embedded data storage that stores data on disk (in one file), which aims to be a replacement for ConcurrentHashMap (provides the same ConcurrentMap interface). The chronic card is the fastest store among similar solutions and has excellent read / write concurrency, almost linearly scaled by the number of available cores in the machine.

Disclaimer: I am a chronicle map developer.

0
source

Source: https://habr.com/ru/post/1316632/


All Articles