I’m doing some experiments to compare the time taken (to reduce the map) to read and process data stored on HDFS with various parameters. I am using a pig script to start map reduction work. Since I often work with the same set of files, my results may suffer due to file and block caching.
I want to understand the various caching methods used in a map-reduction environment.
Suppose that the file foo (containing some data to be processed) stored on HDFS occupies 1 HDFS block and is stored in the STORE machine. During the map reduction task, the COMPUTE machine reads this block over the network and processes it. Caching can occur at two levels:
- Cache in
STORE machine memory (in cache in memory) - Cache in the memory / disk of the
COMPUTE machine.
I am sure that caching #1 happening. I want to make sure something like #2 happening? From the message here , it seems that client-level caching is missing, since it is very unlikely that a block cached with COMPUTE will be needed again in the same before flushing the cache.
In addition, is the distributed hadoop cache used, which is used only to distribute any files of specific applications (and not to data files of specific tasks) to all task tracking nodes? Or is it specific data of input files (for example, a block of files foo ) cached in a distributed cache? I assume that local.cache.size and related parameters control only the distributed cache.
Please clarify.
source share