Hadoop: an easy way to have an object as an output value without a Writable interface

I am trying to use hadoop to train several models. My data is small enough to fit into memory, so I want one model to be trained in each map task.

My problem is that when I finished training my model, I need to send it to the gearbox. I use Weka to train the model. I don’t want to start looking at how to implement the Writable interface in Weka classes, because it needs a lot of effort. I am looking for an easy way to do this.

The Classifier class in Weka implements the Serializable interface. How can I send this object to the gearbox?

        edits

Here is a link that mentions the serialization of weka objects: http://weka.wikispaces.com/Serialization

Here's what my code looks like: Job setup (only part of the configuration is published):

       conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization," + "org.apache.hadoop.io.serializer.WritableSerialization"); 
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(Classifier.class);

Card Function:

     //load dataset in data variable
     Classifier tree=new J48();
     tree.buildClassifier();
     context.write(new Text("whatever"), tree);

My Map class extends Mapper (object, text, text, classifier)

But I get this error:

     java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:964)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)

What am I doing wrong?

+5
source share
1 answer

You can define your own serialization engine.

I think it solves around the implementation of the Serialization interface and defines your implementation in the configuration property io.serializations

In your case, if you just want to use java serialization, set this property to:

  • org.apache.hadoop.io.serializer.JavaSerialization
+6
source

All Articles