Hadoop Distributed Cache Life

When files are transferred to nodes using the distributed cache mechanism in the Hadoop streaming task, does the system delete these files after the task is completed? If they are deleted, what can I assume if there is a way to cache several jobs? Does it work the same on Amazon Elastic Mapreduce?

+6
amazon-web-services elastic-map-reduce hadoop
source share
2 answers

I dug in the source code, and it looks like files are deleted on the TrackerDistributedCacheManager about once a minute, when their number of links drops to zero. TaskRunner explicitly frees all its files at the end of the task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?

+5
source share

I cross posted this question on the AWS forum and received a good recommendation to use hadoop fs -get to transfer files in a way that saves jobs.

+2
source share

All Articles