Cassandra and MapReduce - minimum setup requirements

I need to execute MapReduce in my Cassandra cluster, including data locality, i.e. each job only asks for lines that belong to the local Casandra Node where the job runs.

There are tutorials on how to configure Hadoop for MR on an earlier version of Cassandra (0.7). I can not find one for the current version.

What has changed from 0.7 in this regard?

What software modules are needed for minimal configuration (Hadoop + HDFS + ...)?

Do I need Cassandra Enterprise?

+4
cassandra mapreduce hadoop
source share
2 answers

Cassandra contains several classes that are enough to integrate with Hadoop:

  • ColumnFamilyInputFormat is the input to a map function that can read all rows from a single CF when using a random Cassandra delimiter, or can read a range of rows when used with Cassandra ordered partitioning. The Cassandra cluster has a ring shape, where each ring part is responsible for a specific key range. The main task of the input format is to split the input of the card into pieces of data that can be processed in parallel - they are called InputSplits . In the case of Cassandra, this is simple - each range of calls has one node master, which means that the Input Format will create one InputSplit for each element of the ring, and this will lead to one map task. Now we want to complete the Map task on the same host where the data is stored. Each InputSplit remembers the IP address of its ring part - this is the IP address of the Cassandra node responsible for this particular range of keys. JobTracker will create an InputSplits map task form and assign it to TaskTracker to execute. JobTracker will try to find a TaskTracker that has the same IP address as InputSplit - basically we need to run TaskTracker on the Cassandra host, and this ensures data locality.
  • ColumnFamilyOutputFormat - This sets the context for the Zoom Out function. So that the results can be saved in Cassandra
  • The results of all functions of the Card must be combined together before they can be transferred to reduce the function - this is called shuffle. It uses the local file system - from the point of view of Cassandra, nothing is needed here, we just need to configure the path to the local temp directory. There is also no need to replace this solution with something else (for example, stored in Cassandra) - this data does not need to be replicated, map tasks are idempotent.

Basically, using the provided Hadoop integration makes it possible to complete the map task on the hosts where the data is stored, and the Abbreviations function can store the results back to Cassandra - all I need.

There are two options for doing Map-Reduce:

  • org.apache.hadoop.mapreduce.Job - This class models Hadoop in one process. It performs the Map-Resuce task and does not require any additional services / dependencies, it only needs access to a temporary directory to store the results of the map job for shuffling. Basically, we need to name several setters on the Job class that contain things like class names for the Map task, Reduce the task, input format, Cassandra connection, when the job.waitForCompletion(true) done job.waitForCompletion(true) should be called - it Launches the "Reduce map" task and waits for the results. This solution can be used to quickly penetrate the world of Hadoop and for testing. It will not scale (one process), and it will retrieve data over the network, but still - this will be good for a start.
  • Real Hadoop cluster - I have not configured it yet, but, as I understand it, the Map-Reduce jobs from the previous example will work fine. We will also need HDFS, which will be used to distribute banners containing Map-Reduce classes in the Hadoop cluster.
+13
source share

yes I searched the same thing, it seems DataStaxEnterprise has simplified Hadoop integration, read this http://wiki.apache.org/cassandra/HadoopSupport

+3
source share

All Articles