PIG: Calculate the maximum monthly increase in the number of wiki pagecount data requests per article

I have some wiki dump data from https://dumps.wikimedia.org/other/pagecounts-raw/2015/ now I want to calculate the monthly growth of requests for each wiki article for 2015, and then find out what month is the biggest growth requests for an article and how high this growth is ... for an explanation: Wikidata has the format: "wikiproject" "article-url" "number of requests" "page size in bytes", for example: fr.b Special: Recherche / Achille_Baraguey_d% 5C% 27Hilliers 1,624 en Main_Page 242332 4737756101

the configuration of our cluster is still "incomplete", so I have to try it on a high-speed cloudera virtual machine with a smaller data set. I used only pagedumps from 1 hour 3 months ... however, when I try to illustrate this, it ends up with JAVA empty space, or I get a GC overload message ....

This is my code:

m1  = LOAD '/user/cloudera/2015/2015-01' USING PigStorage(' ') as(proj:chararray, url:chararray, req:long, size:long);
m2  = LOAD '/user/cloudera/2015/2015-02' USING PigStorage(' ') as(proj:chararray, url:chararray, req:long, size:long);
m3  = LOAD '/user/cloudera/2015/2015-03' USING PigStorage(' ') as(proj:chararray, url:chararray, req:long, size:long);

m11 = SAMPLE m1 0.1;
m22 = SAMPLE m2 0.1;
m33 = SAMPLE m3 0.1;

a = COGROUP m11 by url, m22 by  url, m33 by  url;
b = FOREACH a generate group, SUM(m11.req) as s1, SUM(m22.req) as s2, SUM(m33.req) as s3;
c = FOREACH b generate group, ((s2-s1) > 0 ? (s2-s1): 0) as dm2, ((s3-s2)> 0 ? (s3-2): 0) as dm3 parallel 10;
d = FOREACH c generate group as Artikel, MAX(TOBAG(dm2,dm3)) as maxZugriffe;
e = order d by maxZugriffe desc;
f = limit e 10;

, 10% , (= url), . , , , a > 0 ( - ), orther my ratio by maxRequests (= maxZugriffe) 10...

- , , - ? , Quickstart VM , , ...

: bincondition ? : c = FOREACH b , ((s2-s1) 'diff' > 0? diff: 0) dm2; "diff", , (s2-s1) ...

edit: ... , - ?

+4
1

: " bincondition ?" bicondition. , SQL . (=).
, ,

b = FOREACH a generate group, SUM(m11.req) as s1, SUM(m22.req) as s2, SUM(m33.req) as s3;  
x = FOREACH b generate group,s1,s2,s3,(s2-s1) as diff;  
c = FOREACH x generate group, (diff > 0 ? diff: 0) as dm2;

, , (s2-s1) alias diff . , . .

+2

All Articles