I am trying to run my Pig script (which uses UDF) on Amazon Elastic Map Reduce. I need to use some static files from my UDFs.
I am doing something similar in my UDF:
public class MyUDF extends EvalFunc<DataBag> {
public DataBag exec(Tuple input) {
...
FileReader fr = new FileReader("./myfile.txt");
...
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("s3://path/to/myfile.txt#myfile.txt");
return list;
}
}
I saved the file in my s3 bucket / path / to / myfile.txt
However, when starting my Pig job, I see an exception:
Got an exception java.io.FileNotFoundException: ./myfile.txt (No such file or directory)
So my question is: how to use distributed cache files when running pig script on amazon EMR?
EDIT: I realized that pig-0.6, unlike pig-0.9, does not have a function called getCacheFiles (). Amazon does not support pigs 0.6, so I need to figure out another way to get distributed cache work at 0.6