I need to process some large files (~ 2 TB) with a small cluster (~ 10 servers) in order to create a relatively small report (several GB).
I’m most interested in the final report, not the intermediate results, and the machines have a large amount of RAM, so it would be fantastic to use it to minimize disk access (and therefore increase speed), ideally keeping data blocks volatile memory using a disk only when.
Looking at the configuration files and the previous question , it seems that Hadoop does not offer this feature. Spark website talks about the parameter memory_and_disk , but I would rather ask the company to deploy new software based on the new language.
The only "solution" I found was to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml to trick it into using volatile memory instead of a file system to store data, but in this case it will behave badly, I suppose, when the RAM is filled and uses a swap.
Is there a trick to make Hadoop storages as much as possible in RAM and write to disk only when necessary?
source share