RAM caching using HDFS

I need to process some large files (~ 2 TB) with a small cluster (~ 10 servers) in order to create a relatively small report (several GB).

I’m most interested in the final report, not the intermediate results, and the machines have a large amount of RAM, so it would be fantastic to use it to minimize disk access (and therefore increase speed), ideally keeping data blocks volatile memory using a disk only when.

Looking at the configuration files and the previous question , it seems that Hadoop does not offer this feature. Spark website talks about the parameter memory_and_disk , but I would rather ask the company to deploy new software based on the new language.

The only "solution" I found was to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml to trick it into using volatile memory instead of a file system to store data, but in this case it will behave badly, I suppose, when the RAM is filled and uses a swap.

Is there a trick to make Hadoop storages as much as possible in RAM and write to disk only when necessary?

+4
source share
2 answers

You can play with mapred.job.reduce.input.buffer.percent (default is 0 , try something closer to 1.0 , see, for example, this blog post ), as well as the value mapred.inmem.merge.threshold - 0 . Please note that finding the right values ​​is a bit of art and requires some experimentation.

+1
source

Since the release of Hadoop 2.3, you can use HDFS in memory caching .

+1
source

All Articles