I just looked at Batch Data Processing with an App Engine session in Google I / O 2010 , read some parts of MapReduce from Google Research , and now Iām thinking of using MapReduce in the Google App Engine to implement the recommendation system in Python.
I prefer to use appengine-mapreduce instead of the task queue API, because the first one offers easy iteration in all instances of some type, automatic batching, automatic task chain, etc. The problem is that my recommender system must calculate the correlation between the instances of two different Models, i.e. copies of two different species.
Example: I have two models: user and element. Each of them has a list of tags as an attribute. Below are the functions for calculating the correlation between users and elements. Note that calculateCorrelation should be called for each combination of users and elements:
def calculateCorrelation(user, item): return calculateCorrelationAverage(u.tags, i.tags) def calculateCorrelationAverage(tags1, tags2): correlationSum = 0.0 for (tag1, tag2) in allCombinations(tags1, tags2): correlationSum += correlation(tag1, tag2) return correlationSum / (len(tags1) + len(tags2)) def allCombinations(list1, list2): combinations = [] for x in list1: for y in list2: combinations.append((x, y)) return combinations
But this calculateCorrelation not a valid Mapper in appengine-mapreduce, and perhaps this function is not even compatible with the MapReduce calculation concept. However, I need to be sure ... it would be really beneficial for me to take advantage of automatic dosing and a task chain.
Is there any solution for this?
Should I define my own InputReader? Is the new InputReader that reads all instances of two different types compatible with the current appengine-mapreduce implementation?
Or should I try the following:
- Combine all the keys of all objects of these two types, two by two, into instances of the new model (possibly using MapReduce)
- Iteration using instance mappings of this new model
- For each instance, use the keys inside it to get two entities of different types and calculate the correlation between them.
python google-app-engine mapreduce google-cloud-datastore task-queue
fjsj
source share