Best functional language for MapReduce?

I am completing a task for a course that requires me to implement the parallel MapReduce mechanism in a functional language, and then use it to solve some simple problems .

What functional language do you think I should use?

Here are my requirements:

  • It should be relatively easy to learn, since I only have about 2 weeks for this assignment.
  • It has existing implementations of MapReduce, which can be found on the Internet - my course does not allow not to prohibit the use of open source codes or Internet resources in general.
  • It should be consistent with this problem and be a public language for learning (relatively popular language).

I am currently looking at Haskell and Clojure, but both of these languages ​​are new to me - I have no idea if either of these languages ​​is really suitable for this situation.

+6
source share
5 answers

Both Clojure and Haskell are certainly worth exploring for various reasons. If you have a chance, I would try both. I also suggest adding Scala to your list.

If you need to choose one, I would choose Clojure for the following reasons:

  • This is Lisp - everyone should learn Lisp. See http://www.paulgraham.com/avg.html
  • It has a unique approach to concurrency - see http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
  • This is the JVM language, which makes it immediately useful from a practical point of view: the library and tool ecosystem on the JVM are extremely good, better than any other IMHO platform. If you want to engage in serious technology. work in the enterprise or in the launching space, it is very useful to get a good knowledge of the JVM. FWIW, Scala also falls into this category of "interesting JVM languages."

In addition, Clojure simplifies the parallel display of maps. Here is one for starters:

(reduce + (pmap inc (range 1000))) => 500500 

Using pmap instead of map enough to give you a parallel matching operation. There are also parallel gearboxes if you are using Clojure 1.5, see gearbox framework for more details.

+7
source

Cloud Haskell is the right choice for a distributed system engine on which the map / reduce model will be implemented. However, for a dual-core local system, it is simple enough to implement it directly in the GHC using existing parallelism support in the GHC runtime. Light streams, job theft queues, and other useful primitives are provided out of the box.

If I implemented the / new / MapReduce mechanism, I would use GHC. Types, parallel debugging tools such as ThreadScope , and the optimizing compiler ensures that you can get the required performance from the code, while the excellent multi-core runtime allows you to scale well.

+7
source

Cascalog and Clojure will provide you with a sufficiently suitable key to get started. If you need to create your own cluster, I recommend using pallet-hadoop to deploy the hadoop cluster, although cascalog works well locally for educational purposes.

+1
source

I personally would recommend using Scalding , this is an abstraction of Scala on top of cascading for Hadoop abstract low-level parts. It was developed on Twitter and today seems mature enough, so you can start using it without any problems.

Here is an example of how you will make Wordcount in Scalding:

 package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") } } 

I think this is a good candidate, because using Scala, it is not too far from the usual Map / Reduce Java programs, and even if you do not know Scala, it is not too difficult to find.

+1
source

All Articles