I personally would recommend using Scalding , this is an abstraction of Scala on top of cascading for Hadoop abstract low-level parts. It was developed on Twitter and today seems mature enough, so you can start using it without any problems.
Here is an example of how you will make Wordcount in Scalding:
package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") } }
I think this is a good candidate, because using Scala, it is not too far from the usual Map / Reduce Java programs, and even if you do not know Scala, it is not too difficult to find.
source share