Creating a key value storage on disk using Concurrency in Java

I need to read a set of files and split it into key-value pairs and save them as (key, list of values) for this key on disk, as well as the map reduction paradigm. However, everything is on one computer. I could, for example, write different lists in different files and name the files using the key. This seems like a very bad way to do something. For starters, if you have a billion keys, you will have a billion files. So it’s obvious that this will not work, and I will need some kind of memory mapping. I will also have to have different threads performing the task of the card, so if they were to write to the same buffer, there should be some kind of synchronization between them. If I have a buffer to key mapping and buffer synchronization, then the threads should not step on all the other fingers, so I think this part should work. The question is how to do a mapping of values ​​on disk. How to write buffers corresponding to different keys in one file? If someone could point me in the right direction, that would be very grateful. My knowledge in this area is rather pathetic. Thanks again.

+4
source share
4 answers

From a practical point of view, it would be easy to do this with BerkeleyDB, as Lyric suggested.

If you are more interested in theory than practice, I would suggest that you approach it as an “external sort” operation. That is, read as much information as you can in memory, then sort by key. Write the sorted snippet as a separate file. Then the sorted files can be easily combined into one file.

Among other applications, this is the approach used by Lucene to create "inverted indexes" to search for text. “Keys” are words in documents, and “meanings” are a list of documents in which a word appears. Lucene reads documents, and for each word a record is created in the document. When the memory is full, it writes the index segment to disk. When there are many index segments on a disk, they are combined into one segment. In fact, you can also adapt the Lucene index pointer to your task.

Work can be divided into several threads. However, you must be sensitive to disk conflicts. Skipping around to read and write many files at the same time slows down the traditional drive. It may be possible to plan some activities at the same time. You could probably read new data from one file while writing the previous sorted fragment to disk, especially if there are two disks on the computer. Of course, using an SSD to temporarily store some sorted segments would help a lot.

+5
source

I think Oracle Berkeley DB might be simple for you:

Berkeleydb

Berkeley DB is designed to store data in the form of opaque byte data arrays in key / value pairs indexed by one of the available access methods, as shown above.

Berkeley is very reliable, mature and fast, but if you want to go with an easier approach, use SQLite .

Another option is to use Google LevelDB; It is written in C ++, but there are Java wrappers around it . LevelDB is staggering fast and very light!

Without additional information about your project, I can only say:

  • With all these solutions, the key / value pairs will be stored in one file (if necessary, several instances can store separate files, but I do not understand why this would be).
  • BerkeleyDB and LevelDB have really good caching and matching capabilities.
  • BDB and LDB also allow compression (not sure if SQLite is either).
  • Depending on your key distribution (i.e., if you use a good hash function like Google CityHash , for example), you can achieve really good data to reduce table scans.
  • You should probably write your own secure stream buffer, and you should avoid writing multiple streams to BDB / LDB, as these solutions are disk-based, and you generally don't want multi-threaded disk I / O.

Criticism: - I'm not sure what you mean by "mapping a buffer to a key" ... do you map a buffer to each key? Why do you need this?

+4
source

Have you looked at using Hadoop ?

0
source

Chronicle Map should be a good solution to this problem.

As a rule, it is very effective both in terms of speed and memory consumption, i.e. E. It is much faster than previously suggested. BerryleyDB.

Chronicle Map is a segmented repository and allows parallel processing of segments, for example. g:

for (int i = 0; i < chronicleMap.segments(); i++) { int segmentIndex = i; executor.submit(() -> { chronicleMap.segmentContext(segmentIndex).forEachSegmentEntry(entry -> { // do processing with entry.key() and entry.value(), // value() could be a List or some Iterator-like abstraction }); }); } 

See MapSegmentContext Javadocs .

However, the presence of (logically) several values ​​per key cannot always be effectively processed using a chronicle map . But in your case, if you only need to process a static set of values ​​for each key, and not add / remove values, this may work well.

0
source

All Articles