Using Distributed Cache with Pig on an Elastic Card

Question

Using Distributed Cache with Pig on an Elastic Card

I am trying to run my Pig script (which uses UDF) on Amazon Elastic Map Reduce. I need to use some static files from my UDFs.

I am doing something similar in my UDF:

public class MyUDF extends EvalFunc<DataBag> {
    public DataBag exec(Tuple input) {
        ...
        FileReader fr = new FileReader("./myfile.txt");
        ...
    }
    public List<String> getCacheFiles() {
        List<String> list = new ArrayList<String>(1);
        list.add("s3://path/to/myfile.txt#myfile.txt");
        return list;
    }
}

I saved the file in my s3 bucket / path / to / myfile.txt

However, when starting my Pig job, I see an exception:

Got an exception java.io.FileNotFoundException: ./myfile.txt (No such file or directory)

So my question is: how to use distributed cache files when running pig script on amazon EMR?

EDIT: I realized that pig-0.6, unlike pig-0.9, does not have a function called getCacheFiles (). Amazon does not support pigs 0.6, so I need to figure out another way to get distributed cache work at 0.6

+5

elastic-map-reduce hadoop apache-pig

Vivek pandey Nov 22 '11 at 12:19

1

cabad · Answer 1 · 2013-06-07T14:39:43+0000

, Pig ( s3 s3n, , ):

–cacheFile s3n://bucket_name/file_name#cache_file_name

" " .

Using Distributed Cache with Pig on an Elastic Card

More articles: