What does the introduction of an advanced job management system mean, which helps to combine several abbreviations Map-Reduce?

I am new to Hadoop and currently have a project allocated to me

"Implement an advanced task management environment to help link several Map-Reduce tasks, that is, explore / improve the existing org.apache.hadoop.mapred.jobcontrol package.

This project is listed on the Project Proposal page in the Random Ideas section at http://wiki.apache.org/hadoop/ProjectSuggestions#research_projects

My confusion is that I need to create a preliminary version of Oozie (which, I think, is a job management framework for creating multiple jobs) or something similar to this, or it means something completely different.

What am I missing?

+7
source share
1 answer

It looks like the project you are referring to may be associated with this Jira ticket .

Currently, the JobControl class is pretty bare, and it lacks a few functions that can make life easier for the user. For example:

  • The ability to receive notifications when a job changes: right now you call JobControl.run and that is, but in practice it may be interesting if I can get a notification when something changes in my work.
  • Retry unsuccessful jobs: you can implement a tool for resubmitting a job if / if it fails, for example, you can have the maximum number of retry options in the ControlledJob class and try again up to this point before sending a notification that it failed.
  • Many tasks are performed on a regular basis: weekly, daily, hourly ... This is usually done through crontab, so it would be interesting to include this function in Hadoop, for example, users could set a recurring task by specifying a period, and JobControl will run it through these regular intervals.
  • Perhaps there is a user interface to visualize your workflow and each job dependency, which steps have already been completed and you don’t.
  • It would be interesting to be able to not only run Map / Reduce tasks, but also Hive, Pig, for example, so that you can provide a common interface for users, so that they can send any tasks and easily track them.

In the end, I don’t think you need to invent a completely new structure, the JobControl class already provides a good starting point. Try to think from a user’s perspective what you can do to simplify and shorten the presentation and management of tasks. The ideas here and on the ticket are just an example, you can come up with your ideas.

As for Oozie , it provides a higher abstraction for controlling the flow of tasks, but it is also more complex to configure and should be reserved for more complex tasks. I know that some people hesitate to use Oozie because it adds extra overhead to your applications. The big difference is also that Oozie is the server, and JobControl runs only on the client machine, which is an additional cost. Although some of the features mentioned above are present in Oozie in one form or another, the ability to keep it simple and running on a client machine without additional work, for example, Oozie , in my opinion, is the key to your project.

+5
source

All Articles