Gini coefficient in Julia: efficient and accurate code

Question

Gini coefficient in Julia: efficient and accurate code

I am trying to implement the following formula in Julia to calculate the Gini coefficient of wage distribution:

enter image description here

Where enter image description here

Here is a simplified version of the code I'm using for this:

# Takes a array where first column is value of wages # (y_i in formula), and second column is probability # of wage value (f(y_i) in formula). function gini(wagedistarray) # First calculate S values in formula for i in 1:length(wagedistarray[:,1]) for j in 1:i Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1] end end # Now calculate value to subtract from 1 in gini formula Gwages = Swages[1]*wagedistarray[1,2] for i in 2:length(Swages) Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1]) end # Final step of gini calculation return giniwages=1-(Gwages/Swages[length(Swages)]) end wagedistarray=zeros(10000,2) Swages=zeros(length(wagedistarray[:,1])) for i in 1:length(wagedistarray[:,1]) wagedistarray[i,1]=1 wagedistarray[i,2]=1/10000 end @time result=gini(wagedistarray)

It gives an almost zero value that you expect from a completely equal pay distribution. However, this takes a rather long time: 6.796 seconds.

Any ideas for improvement?

+6

statistics inequality distribution julia-lang

David Zentler-Munro Jul 9 '15 at 15:24

source share

2 answers

IainDunning has already provided a good answer with code that is fast enough for practical purposes ( gini2 function). If you use performance tuning, you can get an additional increase in speed by 20 times, avoiding temporary arrays ( gini3 ). See the following code that compares the performance of two implementations:

 using TimeIt wagedistarray=zeros(10000,2) for i in 1:size(wagedistarray,1) wagedistarray[i,1]=1 wagedistarray[i,2]=1/10000 end wages = wagedistarray[:,1] wagefrequencies = wagedistarray[:,2]; # original code function gini2(wagedistarray) Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2]) Gwages = Swages[1]*wagedistarray[1,2] + sum(wagedistarray[2:end,2] .* (Swages[2:end]+Swages[1:end-1])) return 1 - Gwages/Swages[end] end # new code function gini3(wages, wagefrequencies) Swages_previous = wages[1]*wagefrequencies[1] Gwages = Swages_previous*wagefrequencies[1] @inbounds for i = 2:length(wages) freq = wagefrequencies[i] Swages_current = Swages_previous + wages[i]*freq Gwages += freq * (Swages_current+Swages_previous) Swages_previous = Swages_current end return 1.0 - Gwages/Swages_previous end result=gini2(wagedistarray) # warming up JIT println("result with gini2: $result, time:") @timeit result=gini2(wagedistarray) result=gini3(wages, wagefrequencies) # warming up JIT println("result with gini3: $result, time:") @timeit result=gini3(wages, wagefrequencies)

Output:

 result with gini2: 0.0, time: 1000 loops, best of 3: 321.57 µs per loop result with gini3: -1.4210854715202004e-14, time: 10000 loops, best of 3: 16.24 µs per loop

gini3 slightly less accurate than gini2 due to sequential summation; one would have to use the pairing option to increase accuracy.

+4

Andras niedermayer Jul 30 '15 at 20:23

source share

Iaindunning · Accepted Answer · 2015-07-09T15:56:31+0000

Try the following:

 function gini(wagedistarray) nrows = size(wagedistarray,1) Swages = zeros(nrows) for i in 1:nrows for j in 1:i Swages[i] += wagedistarray[j,2]*wagedistarray[j,1] end end Gwages=Swages[1]*wagedistarray[1,2] for i in 2:nrows Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1]) end return 1-(Gwages/Swages[length(Swages)]) end wagedistarray=zeros(10000,2) for i in 1:size(wagedistarray,1) wagedistarray[i,1]=1 wagedistarray[i,2]=1/10000 end @time result=gini(wagedistarray)

Time to: 5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
Time after: 0.134799301 seconds (507260 bytes allocated)
Time after (second start): elapsed time: 0.123665107 seconds (80112 bytes allocated)

The main problems are that Swages was a global variable (not a function), which is not good coding practice, but more importantly, it is a performance killer . Another thing I noticed is length(wagedistarray[:,1]) , which makes a copy of this column and then asks for its length - this created additional “garbage”. The second run is faster because the first time the function is run, compilation time is executed.

You increase productivity even higher using @inbounds , i.e.

 function gini(wagedistarray) nrows = size(wagedistarray,1) Swages = zeros(nrows) @inbounds for i in 1:nrows for j in 1:i Swages[i] += wagedistarray[j,2]*wagedistarray[j,1] end end Gwages=Swages[1]*wagedistarray[1,2] @inbounds for i in 2:nrows Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1]) end return 1-(Gwages/Swages[length(Swages)]) end

which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)

Finally, check out this version, which is actually faster than everything, and also the most accurate, I think:

 function gini2(wagedistarray) Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2]) Gwages = Swages[1]*wagedistarray[1,2] + sum(wagedistarray[2:end,2] .* (Swages[2:end]+Swages[1:end-1])) return 1 - Gwages/Swages[end] end

What has elapsed time: 0.00041119 seconds (721664 bytes allocated) . The main advantage was that the transition from O (n ^ 2) to the cycle is O (n) cumsum .

Gini coefficient in Julia: efficient and accurate code

More articles: