Calculating the parallelism gradient in Julia

Question

Calculating the parallelism gradient in Julia

Some time ago I was persuaded to abandon my convenient programming in Matlab and start programming in Julia. I worked with neural networks for a long time, and I thought that now with Julia I could speed up work by parallelizing gradient calculations.

The gradient does not need to rely on the entire data set at a time; instead, you can break down the calculation. For example, breaking a data set in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by summing the partial gradients.

Although, the principle is simple, when I parallel with Julia, I get a performance degradation, i.e. one process is faster than two processes! I obviously am doing something wrong ... I consulted with other questions posed in the forum, but I still could not collect the answer. I think my problem is that there are a lot of unnecessary data movements, but I cannot fix them properly.

To avoid publishing messy neural network code, I am posting a simpler example below that replicates my problem in setting up linear regression.

In the code block below, some data is generated for the linear regression task. The code explains the constants, but X is a matrix containing input data. We arbitrarily create a weight vector w , which when multiplied by X creates some targets Y.

###################################### ## CREATE LINEAR REGRESSION PROBLEM ## ###################################### # This code implements a simple linear regression problem MAXITER = 100 # number of iterations for simple gradient descent N = 10000 # number of data items D = 50 # dimension of data items X = randn(N, D) # create random matrix of data, data items appear row-wise Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets Y = X*Wtrue # generate targets

The following code block below defines functions for measuring the suitability of our regression (i.e., negative logarithmic likelihood) and the gradient of the weight vector w:

 #################################### ## DEFINE FUNCTIONS ## #################################### @everywhere begin #------------------------------------------------------------------- function negative_loglikelihood(Y,X,W) #------------------------------------------------------------------- # number of data items N = size(X,1) # accumulate here log-likelihood ll = 0 for nn=1:N ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2) end return ll end #------------------------------------------------------------------- function negative_loglikelihood_grad(Y,X,W, first_index,last_index) #------------------------------------------------------------------- # number of data items N = size(X,1) # accumulate here gradient contributions by each data item grad = zeros(similar(W)) for nn=first_index:last_index grad = grad + X[nn,:]' * (Y[nn,:] - X[nn,:]*W) end return grad end end

Please note that the above functions are not intended for vectorization! I prefer not to vectorize, since the final code (the case of a neural network) also does not allow any vectorization (we will not go into details about this).

Finally, the code block below shows a very simple gradient descent, which attempts to recover the vector of the weight parameter w from the data Y and X

 #################################### ## SOLVE LINEAR REGRESSION ## #################################### # start from random initial solution W = randn(D,1) # learning rate, set here to some arbitrary small constant eta = 0.000001 # the following for-loop implements simple gradient descent for iter=1:MAXITER # get gradient ref_array = Array(RemoteRef, nworkers()) # let each worker process part of matrix X for index=1:length(workers()) # first index of subset of X that worker should work on first_index = (index-1)*int(ceil(N/nworkers())) + 1 # last index of subset of X that worker should work on last_index = min((index)*(int(ceil(N/nworkers()))), N) ref_array[index] = @spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index) end # gather the gradients calculated on parts of matrix X grad = zeros(similar(W)) for index=1:length(workers()) grad = grad + fetch(ref_array[index]) end # now that we have the gradient we can update parameters W W = W + eta*grad; # report progress, monitor optimisation @printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W)) end

As we hope, apparently, I tried to compare the calculation of the gradient in the simplest way here. My strategy is to break down the calculation of the gradient into as many parts as are available to the workers. Each worker should work only on part of the matrix X, part of which is indicated by first_index and last_index. Therefore, every worker should work with X[first_index:last_index,:] . For example, for 4 workers and N = 10000, the work should be divided as follows:

worker 1 => first_index = 1, last_index = 2500
worker 2 => first_index = 2501, last_index = 5000
worker 3 => first_index = 5001, last_index = 7500
worker 4 => first_index = 7501, last_index = 10000

Unfortunately, all this code is faster if I only have one working. If you add more workers through addprocs() , the code will run slower. You can aggravate this problem by creating more data elements, for example, instead of N = 20,000. With a large number of data elements, the deterioration is even more pronounced. In my computing environment with N = 20,000 and one core, the code runs after ~ 9 seconds. With N = 20,000 and 4 cores ~ 18 seconds are required!

I tried a lot of different things, inspired by questions and answers on this forum, but, unfortunately, to no avail. I understand that parallelization is naive and that data movement should be a problem, but I have no idea how to do it properly. The documentation seems to be a bit sparse on this issue (as is the good book by Ivo Balbar).

I would appreciate your help as I have been stuck for some time with this and I really need this for my work. For those who want to run the code to save you the trouble of copying, you can get the code here .

Thanks for taking the time to read this very long question! Help me turn this into a model answer that anyone at Julia can then consult!

+7

parallel-processing gradient julia-lang linear-regression

user1438310 Jul 27 '15 at 15:15

source share

2 answers

If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You can select only one output vector and pass it as an argument for each worker. Each worker installs it, as you expected.

+4

tholy Jul 28 '15 at 1:21

source share

Maciek leks · Accepted Answer · 2015-08-23T12:25:23+0000

I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray , or its own implementation of the distribution of pieces of data.

The problem is not in Julia, but in the GD algorithm. Consider the code:

The main process:

 for iter = 1:iterations #iterations: "the more the better" δ = _gradient_descent_shared(X, y, θ) θ = θ - α * (δ/N) end

The problem is the for-loop above, which is mandatory. No matter how good _gradient_descent_shared , the total number of iterations destroys the noble concept of parallelization.

After reading the question and the suggestion above, I started implementing GD with SharedArray . Please note: I am not a SharedArrays specialist.

The main parts of the process (simple implementation without regularization):

 run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin N = length(y) for iter = 1:iterations δ = _gradient_descent_shared(X, y, θ) θ = θ - α * (δ/N) end θ end _gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin if size(X,1) <= length(procs(X)) return _gradient_descent_serial(X, y, θ) else rrefs = map(p -> (@spawnat p _gradient_descent_serial(X, y, θ)), procs(X)) return mapreduce(r -> fetch(r), op, rrefs) end end

Code common to all employees:

 #= Returns the range of indices of a chunk for every worker on which it can work. The function splits data examples (N rows into chunks), not the parts of the particular example (features dimensionality remains intact).=# @everywhere function _worker_range(S::SharedArray) idx = indexpids(S) if idx == 0 return 1:size(S,1), 1:size(S,2) end nchunks = length(procs(S)) splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)] splits[idx]+1:splits[idx+1], 1:size(S,2) end #Computations on the chunk of the all data. @everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray) = begin prange = _worker_range(X) pX = sdata(X[prange[1], prange[2]]) py = sdata(y[prange[1],:]) tempδ = pX' * (pX * sdata(θ) .- py) end

Download and training data. Let me assume that we have:

in X :: An array of size (N, D), where N is the number of examples, D is the dimension of functions
tags in y :: Array of size (N, 1)

The main code might look like this:

 X=[ones(size(X,1)) X] #adding the artificial coordinate N, D = size(X) MAXITER = 500 α = 0.01 initialθ = SharedArray(Float64, (D,1)) sX = convert(SharedArray, X) sy = convert(SharedArray, y) X = nothing y = nothing gc() finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);

After implementing this and launching (on the 8-core kernels of my Intell Clore i7) I received very little acceleration compared to the serial GD (1-core) on my training multiclasses (19 classes) (715 s for sequential GD / 665 s for general GD).

If my implementation is correct (please check this - I am counting on it), then parallelizing the GD algorithm is not worth it. You can definitely get better acceleration using stochastic GD at the 1-core level.

Calculating the parallelism gradient in Julia

More articles: