Configure map connectivity for multiple maps in Hadoop Map / Reduce

I had a question about setting up an internal Map / Side connection for several mappers in Hadoop. Suppose I have two very large datasets A and B, I use the same separation and sorting algorithm to break them into smaller parts. Suppose that for A there are (1) - (10), and for B i - b (1) - b (10). I am sure that (1) and b (1) contain the same keys, and (2) and b (2) have the same keys, etc. I would like to configure 10 maps, specifically mapper (1) in mapper (10). As far as I understand, Map / Side join is a preprocessing task before the cartographer, so I would like to join (1) and b (1) for mapper (1) to join (2) and b (2) to display (2 ) etc.

After reading some reference materials, it’s still not clear to me how to set up these ten cartographers. I understand that with the help of CompositeInputFormat I could join two files, but it seems to only configure one mapmaker and attach a pair of 20 files after a pair (in 10 sequential tasks). How to configure all these ten cards and simultaneously connect ten pairs in a genuine map / Reduce (10 tasks in parallel)? As I understand it, ten cards will require ten CompositeInputFormat parameters, because the files for connecting are all different. I firmly believe that this is practical and doable, but I could not figure out which commands I should use.

Any hints and suggestions are welcome and welcome.

Shi


Thanks so much for the answers, David and Thomas!

I appreciate your focus on the prerequisites for joining a card. Yes, I know about sorting, API, etc. After reading your comments, I think my actual problem is what is the correct expression to combine multiple sections of two files in CompositeInputFormat. For example, I have dataA and dataB sorted and reduced in 2 files respectively:

/ A / dataA-t-00000

/ A / dataA-t-00001

/ In / dataB-t-00000

/ In / dataB-t-00001

Now I use the expression command:

internal (TBL (org.apache.hadoop.mapred.KeyValueTextInputFormat, "/ A / dataA-t-00000"), TBL (org.apache.hadoop.mapred.KeyValueTextInputFormat, "/ B / dataB-t-00000"))

It works, but, as you mentioned, it only launches two mappers (because the internal connection prevents splitting) and can be very inefficient if the files are large. If I want to use more mappers (let's say 2 more mappers join dataA-r-00001 and dataB-r-00001), how should I build an expression, this is something like:

String joinexpression = "inner (tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ A / dataA-r-00000'), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ B / dataB -r-r- 00000 '), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat,' / A / dataA-r-00001 '), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat,' / B / dataB-r -00001 ' )) ";

But I think this may be wrong, because the command above actually performs an internal merge of the four files (which will not lead to anything in my case, because the * r-00000 and * r-00001 files have non-overlapping keys).

Or I could just use two dirs as inputs, for example:

String joinexpression = "inner (tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ A /'), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ B /'));

Will the internal connection automatically match the pairs according to the end of the file, for example, "00000" - "00000", "00001" - "00001"? I am stuck at this point because I need to build an expression and pass it to

conf.set ("mapred.join.expr", joinexpression);

So, in a word, how do I create the correct expression if I want to use more mappers to combine multiple pairs of files at once?

+4
source share
2 answers

There are side connections map and reduce. You suggested using the card side connection, which is done inside the display device, and not in front of it. Both parties must have the same key and value types. Therefore, you cannot join LongWritable and Text , although they may have the same meaning.

There are noteworthy things:

  • the input files must be sorted, so this is likely to be the output of the reducer.
  • You can control the number of cartographers in the connection-card phase by setting the number of reducers in the task that was supposed to sort the data sets

The whole procedure basically works as follows: you have dataset A and dataset B, both have the same key, albeit LongWritable .

  • Run two jobs that sort two data sets by their keys, both jobs should set the number of reducers to an equal number, say 2.
  • this will create 2 sorted files for each data set
  • Now you set up your work, which joins the data sets, this work will appear with 2 cards. This might be more if you set the number reduction above in the previous task.
  • do whatever you like in the reduction phase.

If the number of connected files is not equal, this will lead to an exception during job installation.

Setting up a connection is very painful, mainly because you need to use the old API for the cartographer and reducer if your version is less than 0.21.x.

This document describes very well how it works. Scroll to the end, unfortunately this documentation is somehow missing from the latest Hadoop docs.

Another good recommendation is the Hadoop the Definitive Guide, which explains all this in more detail and with examples.

+4
source

I think you are missing the point. You do not control the number of cartographers. This is the number of gears you control. Just release the correct keys to your cartographer. Then run 10 gears.

+1
source

All Articles