I had a question about setting up an internal Map / Side connection for several mappers in Hadoop. Suppose I have two very large datasets A and B, I use the same separation and sorting algorithm to break them into smaller parts. Suppose that for A there are (1) - (10), and for B i - b (1) - b (10). I am sure that (1) and b (1) contain the same keys, and (2) and b (2) have the same keys, etc. I would like to configure 10 maps, specifically mapper (1) in mapper (10). As far as I understand, Map / Side join is a preprocessing task before the cartographer, so I would like to join (1) and b (1) for mapper (1) to join (2) and b (2) to display (2 ) etc.
After reading some reference materials, itβs still not clear to me how to set up these ten cartographers. I understand that with the help of CompositeInputFormat I could join two files, but it seems to only configure one mapmaker and attach a pair of 20 files after a pair (in 10 sequential tasks). How to configure all these ten cards and simultaneously connect ten pairs in a genuine map / Reduce (10 tasks in parallel)? As I understand it, ten cards will require ten CompositeInputFormat parameters, because the files for connecting are all different. I firmly believe that this is practical and doable, but I could not figure out which commands I should use.
Any hints and suggestions are welcome and welcome.
Shi
Thanks so much for the answers, David and Thomas!
I appreciate your focus on the prerequisites for joining a card. Yes, I know about sorting, API, etc. After reading your comments, I think my actual problem is what is the correct expression to combine multiple sections of two files in CompositeInputFormat. For example, I have dataA and dataB sorted and reduced in 2 files respectively:
/ A / dataA-t-00000
/ A / dataA-t-00001
/ In / dataB-t-00000
/ In / dataB-t-00001
Now I use the expression command:
internal (TBL (org.apache.hadoop.mapred.KeyValueTextInputFormat, "/ A / dataA-t-00000"), TBL (org.apache.hadoop.mapred.KeyValueTextInputFormat, "/ B / dataB-t-00000"))
It works, but, as you mentioned, it only launches two mappers (because the internal connection prevents splitting) and can be very inefficient if the files are large. If I want to use more mappers (let's say 2 more mappers join dataA-r-00001 and dataB-r-00001), how should I build an expression, this is something like:
String joinexpression = "inner (tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ A / dataA-r-00000'), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ B / dataB -r-r- 00000 '), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat,' / A / dataA-r-00001 '), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat,' / B / dataB-r -00001 ' )) ";
But I think this may be wrong, because the command above actually performs an internal merge of the four files (which will not lead to anything in my case, because the * r-00000 and * r-00001 files have non-overlapping keys).
Or I could just use two dirs as inputs, for example:
String joinexpression = "inner (tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ A /'), tbl (org.apache.hadoop.mapred.KeyValueTextInputFormat, '/ B /'));
Will the internal connection automatically match the pairs according to the end of the file, for example, "00000" - "00000", "00001" - "00001"? I am stuck at this point because I need to build an expression and pass it to
conf.set ("mapred.join.expr", joinexpression);
So, in a word, how do I create the correct expression if I want to use more mappers to combine multiple pairs of files at once?