How to implement self-learning / cross product with hadoop?

A common task is to evaluate some pairs: Examples: deduplication, collaborative filtering, similar elements, etc. This is basically a self-join or cross-product with the same data source.

+4
source share
2 answers

To make a self-join, you can follow the "join sides" pattern. The converter converts the connection / foreign key as a key, and the record as a value.

So, let's say, we wanted to make an independent association in the "city" (middle column) according to the following data:

don,baltimore,12 jerry,boston,19 bob,baltimore,99 cameron,baltimore,13 james,seattle,1 peter,seattle,2 

The converter selected the key-> value pairs:

 (baltimore -> don,12) (boston -> jerry,19) (baltimore -> bob,99) (baltimore -> cameron,13) (seattle -> james,1) (seattle -> peter,2) 

In the gearbox we get the following:

 (baltimore -> [(don,12), (bob,99), (cameron,13)]) (boston -> [(jerry,19)]) (seattle -> [(james,1), (peter,2)]) 

From here you can make internal connection logic if you decide so. To do this, you simply map each item to every other item. To do this, load the data into the list of arrays, then execute an N x N cycle over the elements to compare them with each other.

Understand that connections on the reduced side are expensive. They send almost all the data to the gearbox if you are not filtering anything. In addition, be careful when loading data into memory in reducers - you can blow up a bunch on a hot key by adding all the data to the list of arrays.


The above is slightly different from the typical abbreviation combination. The idea is the same when you join two datasets: the foreign key is the key, and the record is the value. The only difference is that values ​​can come from two or more data sets. You can use MultipleInputs so that different mappers analyze different input sets, and then gear data collects data from both.


Cross-product in the case when you do not have any restrictions, is a nightmare. I.e.

 select * from tablea, tableb; 

There are several ways to do this. None of them are particularly effective. If you want this type of behavior, leave me a comment and I will spend more time explaining the way to do this.

If you can find some kind of join key, which is the fundamental key to similarity, you are much better off.


Connect to my book: MapReduce Design Patterns . It should be published in a few months, but if you are really interested, I can write you a chapter on affiliations.

+7
source

A reducer is usually used to carry out any logic necessary for the connection. The trick is to map the dataset twice, possibly adding some token to a value indicating which one is running. Then self-connection is no different from any other type of connection.

0
source

All Articles