Implementing PageRank using MapReduce

I am trying to figure out a problem with PageRank implementation theory with MapReduce.

I have the following simple scenario with three nodes: AB C.

The adjacency matrix is ​​here:

A { B, C } B { A } 

For example, PageRank for B is:

 (1-d)/N + d ( PR(A) / C(A) ) N = number of incoming links to B PR(A) = PageRank of incoming link A C(A) = number of outgoing links from page A 

I have everything in order with all the circuits and how the handler and gearbox will work, but I can’t understand how C (A) will be known at the time of the calculation by the gearbox. As a reducer, when calculating PageRank of B by combining incoming links to B, he will know the number of outgoing links from each page. Is it required to search in some external data source?

+8
algorithm pagerank mapreduce
source share
3 answers

Here is the pseudo code:

 map( key: [url, pagerank], value: outlink_list ) for each outlink in outlink_list emit( key: outlink, value: pagerank/size(outlink_list) ) emit( key: url, value: outlink_list ) reducer( key: url, value: list_pr_or_urls ) outlink_list = [] pagerank = 0 for each pr_or_urls in list_pr_or_urls if is_list( pr_or_urls ) outlink_list = pr_or_urls else pagerank += pr_or_urls pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank ) emit( key: [url, pagerank], value: outlink_list ) 

It is important that in the abbreviation you should display outgoing links, not inlinks, as some articles on intenret suggest. Thus, sequential iterations will also have outgoing links as input to the converter.

Please note that multiple outbound links with the same address from the same page are considered one. Also, ignore loops (linking the page to yourself).

The attenuation coefficient is usually 0.85, although you can play with other values.

+13
source share
+3
source share

Iteratively evaluate PR. PR (x) = Sum (PR (a) * weight (a), a in in_links) on

 map ((url,PR), out_links) //PR = random at start for link in out_links emit(link, ((PR/size(out_links)), url)) reduce(url, List[(weight, url)): PR =0 for v in weights PR = PR + v Set urls = all urls from list emit((url, PR), urls) 

therefore, the output is equal to the input value, and we can do this before coverage.

+1
source share

All Articles