I am trying to analyze a Wikipedia article dataset using Amazon EMR. This data set contains pageview statistics for a three-month period (January 1, 2011 - March 31, 2011). I am trying to find the article with the most views during this time. Here is the code I'm using:
public class mostViews { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable views = new IntWritable(1); private Text article = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String[] words = line.split(" "); article.set(words[1]); views.set(Integer.parseInt(words[2])); output.collect(article, views); } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(mostViews.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
The code itself works, but when I create a cluster and add a custom jar, it sometimes fails, but in other cases it works. Using the entire dataset as input results in a failure, but using one month, such as January, it ends. After starting up using the entire dataset, I looked at the "controller" log file and found this, which in my opinion matters:
2015-03-10T11:50:12.437Z INFO Synchronously wait child process to complete : hadoop jar /mnt/var/lib/hadoop/steps/s-22ZUAWNM... 2015-03-10T12:05:10.505Z INFO Process still running 2015-03-10T12:20:12.573Z INFO Process still running 2015-03-10T12:35:14.642Z INFO Process still running 2015-03-10T12:50:16.711Z INFO Process still running 2015-03-10T13:05:18.779Z INFO Process still running 2015-03-10T13:20:20.848Z INFO Process still running 2015-03-10T13:35:22.916Z INFO Process still running 2015-03-10T13:50:24.986Z INFO Process still running 2015-03-10T14:05:27.056Z INFO Process still running 2015-03-10T14:20:29.126Z INFO Process still running 2015-03-10T14:35:31.196Z INFO Process still running 2015-03-10T14:50:33.266Z INFO Process still running 2015-03-10T15:05:35.337Z INFO Process still running 2015-03-10T15:11:37.366Z INFO waitProcessCompletion ended with exit code 1 : hadoop jar /mnt/var/lib/hadoop/steps/s-22ZUAWNM... 2015-03-10T15:11:40.064Z INFO Step created jobs: job_1425988140328_0001 2015-03-10T15:11:50.072Z WARN Step failed as jobs it created failed. Ids:job_1425988140328_0001
Can someone tell me what is going wrong and what can I do to fix it? The fact that it works for one month, but not for two or three months, makes me think that the data set may be too large, but I'm not sure. I'm still new to all of this Hadoop / EMR work, so if any of the information I left just let me know. Any help or advice would be greatly appreciated.
Thanks in advance!