Hadoop - data block caching methods

Question

Hadoop - data block caching methods

I’m doing some experiments to compare the time taken (to reduce the map) to read and process data stored on HDFS with various parameters. I am using a pig script to start map reduction work. Since I often work with the same set of files, my results may suffer due to file and block caching.

I want to understand the various caching methods used in a map-reduction environment.

Suppose that the file foo (containing some data to be processed) stored on HDFS occupies 1 HDFS block and is stored in the STORE machine. During the map reduction task, the COMPUTE machine reads this block over the network and processes it. Caching can occur at two levels:

Cache in STORE machine memory (in cache in memory)
Cache in the memory / disk of the COMPUTE machine.

I am sure that caching #1 happening. I want to make sure something like #2 happening? From the message here , it seems that client-level caching is missing, since it is very unlikely that a block cached with COMPUTE will be needed again in the same before flushing the cache.

In addition, is the distributed hadoop cache used, which is used only to distribute any files of specific applications (and not to data files of specific tasks) to all task tracking nodes? Or is it specific data of input files (for example, a block of files foo ) cached in a distributed cache? I assume that local.cache.size and related parameters control only the distributed cache.

Please clarify.

+1

caching mapreduce hadoop hdfs

sachin2182 Feb 24 '13 at 1:40

source share

1 answer

Thomas jungblut · Accepted Answer · 2013-02-24T10:13:14+0000

The only caching ever used in HDFS is OS caching to minimize disk access. Therefore, if you access a block from a datanode, it will most likely be cached if nothing happens there.

On your client side, it depends on what you do with the block. If you write it directly to disk, it is also very likely that your client OS caches it.

The distributed cache is intended only for banners and files that need to be distributed across the cluster, where your task launches tasks. The name is thus a little misleading, as it "caches" nothing.

Hadoop - data block caching methods

More articles: