Hadoop Distributed Cache Life

Question

Hadoop Distributed Cache Life

When files are transferred to nodes using the distributed cache mechanism in the Hadoop streaming task, does the system delete these files after the task is completed? If they are deleted, what can I assume if there is a way to cache several jobs? Does it work the same on Amazon Elastic Mapreduce?

+6

amazon-web-services elastic-map-reduce hadoop

Jd long Dec 19 '10 at 15:57

source share

2 answers

I cross posted this question on the AWS forum and received a good recommendation to use hadoop fs -get to transfer files in a way that saves jobs.

+2

Jd long Dec 21 '10 at 21:31

source share

Bkkbrad · Accepted Answer · 2010-12-20T15:18:03+0000

I dug in the source code, and it looks like files are deleted on the TrackerDistributedCacheManager about once a minute, when their number of links drops to zero. TaskRunner explicitly frees all its files at the end of the task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?

Hadoop Distributed Cache Life

More articles: