The best tool / method for testing a unit for working with a map

Question

The best tool / method for testing a unit for working with a map

I'm new here, but you need to know how best to do unit testing for programs written on top of Apache Hadoop. I know that we can write unit test jUnit cases for logic inside the map and reduction methods. We can also do the same for other logics, but this does not guarantee that it is well tested and will work in a real working environment.

I read about MRUnit, but it also looks like what I mentioned above, but in a more mature way. But this also does not work like a real mapreduce job, but it is a mockery.

Any help would be appreciated.

Thanks.

+7

unit-testing mapreduce hadoop

user1913522 Dec 18 '12 at 16:52

source share

1 answer

Amar · Accepted Answer · 2012-12-18T16:56:39+0000

You probably have other options. Minor googling, and you would get it yourself. Here I did it for you!

Here is the text I paste from: http://blog.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/

In addition to using traditional jUnit and MRUnit, you have the following options:

Testing a local job to run - running MR jobs on one machine in one JVM

Traditional unit tests and MRUnit should do pretty good job of detecting errors before, but none of them will check your MR jobs with Hadoop. A local job runner allows you to run Hadoop on a local machine in a single JVM, making MR jobs a little easier to debug if a job fails.

To enable the local job runner, set "mapred.job.tracker" to "local" and "fs.default.name" to "file: /// some / local / path" (these are the default values).

Remember that there is no need to run any Hadoop daemons when using the local workstation. Running bin / hadoop will start the JVM and start your work for you. It probably makes sense to create a new hadoop-local.xml file (or mapred-local.xml and hdfs-local.xml if you use 0.20). Then you can use the -config option to tell bin / hadoop which configuration directory to use. If you prefer to avoid using configuration files, you can create a class that implements Tool and uses ToolRunner , and then run this class with bin / hadoop jar foo.jar com.example.Bar -D mapred.job.tracker = local -D fs .default.name = file: /// (args), where Bar is the implementation of the tool.

To start using the local desktop to check your MR jobs in Hadoop, create a new configuration directory that is included in the local desktop and invoke your work, as usual, remember to include the -config option, which points to the directory containing your local files configurations.

The -conf option also works in 0.18.3 and allows you to specify your hadoop-local.xml file instead of specifying a directory with -config. Khadoop will work happily. The difficulty with this form of testing is to verify that it is working properly. Note: you will need to make sure that the input files are configured correctly and that the output directories do not exist before the job starts.

Assuming that you were able to configure the local desktop and complete the task, you will need to make sure that your work is done correctly. Just substantiating success on exit codes is not good enough. At the very least, you will want to verify the correctness of the output of your work. You can also scan bin / hadoop output for exceptions. You must create a script or unit test that sets the prerequisites, starts the task, delimits the actual output and the expected result, and scans the raised exceptions. This script or unit test can then exit with the appropriate status and display specific messages explaining how the operation failed.

Please note that the local job runner has several limitations: only one reducer is supported, and DistributedCache does not work ( correction continues ).

Pseudo-distributed testing - running MR tasks on the same machine using daemons

The local runner sets the task in one thread. Running an MR job in a single thread is useful for debugging, but it does not simulate a real cluster with several running Hadoop daemons (e.g. NameNode, DataNode, TaskTracker, JobTracker, SecondaryNameNode). A pseudo-distributed cluster consists of a single machine on which all Hadoop daemons are running. This cluster is still relatively easy to manage (albeit more complicated than a local worker), and tests integration with Hadoop better than a local runner.

To start using a pseudo-distributed cluster to validate your MR jobs in Hadoop, follow the above tips for using a local workstation, but there is a configuration and launch of all Hadoop daemons in the preset setup. Then, to get started, just use bin / hadoop, as usual.

Complete Integration Testing - Starting MR Jobs in a QA Cluster

Probably the most thorough but cumbersome mechanism for checking your MR tasks is to run them in a QA cluster of at least several machines. By running MR jobs on a QA cluster, you will test all aspects of your work and its integration with Hadoop.

Running your jobs in a QA cluster has many of the same problems as a local runner. Namely, you need to check the result of your work for correctness. You can also scan stdin and stdout generated by each task attempt, which will require collecting these logs in a central location and grepping them. Scribe is a useful logging tool, although it may be redundant depending on your QA cluster.

We find that most of our customers have some kind of QA or development cluster where they can deploy and test new jobs, test new versions of Hadoop, and practice upgrading clusters from one version of Hadoop to another. If Hadoop is the main part of your production pipeline, then creating a QA or development cluster makes a lot of sense, and performing multiple tasks on it ensures that changes to your tasks continue to be thoroughly tested. EC2 can be a good host for your QA cluster, as you can push it up and down on demand. Take a look at our beta EC2 EBS Hadoop scripts if you are interested in creating a QA cluster in EC2.

You should choose QA methods based on the importance of QA for your organization, as well as the amount of resources you have. By simply using a traditional unit testing platform, MRUnit and the local runner can thoroughly test your MR tasks without using too many resources. However, completing your assignments in a QA or development cluster is, of course, the best way to fully validate your MR assignments with the costs and operational tasks of the Hadoop cluster.

The best tool / method for testing a unit for working with a map

Testing a local job to run - running MR jobs on one machine in one JVM

Pseudo-distributed testing - running MR tasks on the same machine using daemons

Complete Integration Testing - Starting MR Jobs in a QA Cluster

More articles: