Reusing writable objects in mapreduce

Question

Reusing writable objects in mapreduce

I am trying to find performance benefits for reusing scripts to write and create new objects in the wordcount mapreduce conversion program. However, the two versions take almost the same time to fill in all the big input.

I also tried to give the task less heap space by changing

<property> <name>mapred.child.java.opts</name> <value>-Xmx120m</value> </property>

But both versions worked a little slower compared to the higher heap. I could never get a program that reuses recordings for recording in order to work better. Did I miss something?

The part of wordcount that I modified is

 public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); } }

+4

mapreduce hadoop

Chitra Dec 28 '12 at 0:47

source share

2 answers

Thomas jungblut · Answer 1 · 2012-12-28T09:27:34+0000

This does not really matter for two reasons:

You do IO slowly , so it’s easy to create several new objects on each input line and let it collect garbage.
Most likely, in any case, you have a very low amount of memory. Therefore, if you create objects, they will be stored in the heap memory until a certain memory threshold is exceeded. Therefore, it is likely that your other solution takes up more heap memory than another. If you now drop your heap memory, the Garbage Collector should run more often because the threshold exceeds the frequency. You will see this in the GC logs if you enable it.

Another reason may be how you measure time, and the Map task includes a lot of RPC communications in the back, so you can’t always be 100% sure that your data is not distorted by network congestion or other environmental consequences.

Qiang jin · Answer 2 · 2012-12-28T01:01:35+0000

The problem is that there is a performance neck, or that affects performance more, IntVariable or IO is reused.

The reuse variable is theoretically better, but based on the Amdahl law http://en.wikipedia.org/wiki/Amdahl%27s_law, the improvement may not even be noticeable.

Reusing writable objects in mapreduce

More articles: