Some time ago I was persuaded to abandon my convenient programming in Matlab and start programming in Julia. I worked with neural networks for a long time, and I thought that now with Julia I could speed up work by parallelizing gradient calculations.
The gradient does not need to rely on the entire data set at a time; instead, you can break down the calculation. For example, breaking a data set in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by summing the partial gradients.
Although, the principle is simple, when I parallel with Julia, I get a performance degradation, i.e. one process is faster than two processes! I obviously am doing something wrong ... I consulted with other questions posed in the forum, but I still could not collect the answer. I think my problem is that there are a lot of unnecessary data movements, but I cannot fix them properly.
To avoid publishing messy neural network code, I am posting a simpler example below that replicates my problem in setting up linear regression.
In the code block below, some data is generated for the linear regression task. The code explains the constants, but X is a matrix containing input data. We arbitrarily create a weight vector w , which when multiplied by X creates some targets Y.
The following code block below defines functions for measuring the suitability of our regression (i.e., negative logarithmic likelihood) and the gradient of the weight vector w:
Please note that the above functions are not intended for vectorization! I prefer not to vectorize, since the final code (the case of a neural network) also does not allow any vectorization (we will not go into details about this).
Finally, the code block below shows a very simple gradient descent, which attempts to recover the vector of the weight parameter w from the data Y and X
As we hope, apparently, I tried to compare the calculation of the gradient in the simplest way here. My strategy is to break down the calculation of the gradient into as many parts as are available to the workers. Each worker should work only on part of the matrix X, part of which is indicated by first_index and last_index. Therefore, every worker should work with X[first_index:last_index,:] . For example, for 4 workers and N = 10000, the work should be divided as follows:
- worker 1 => first_index = 1, last_index = 2500
- worker 2 => first_index = 2501, last_index = 5000
- worker 3 => first_index = 5001, last_index = 7500
- worker 4 => first_index = 7501, last_index = 10000
Unfortunately, all this code is faster if I only have one working. If you add more workers through addprocs() , the code will run slower. You can aggravate this problem by creating more data elements, for example, instead of N = 20,000. With a large number of data elements, the deterioration is even more pronounced. In my computing environment with N = 20,000 and one core, the code runs after ~ 9 seconds. With N = 20,000 and 4 cores ~ 18 seconds are required!
I tried a lot of different things, inspired by questions and answers on this forum, but, unfortunately, to no avail. I understand that parallelization is naive and that data movement should be a problem, but I have no idea how to do it properly. The documentation seems to be a bit sparse on this issue (as is the good book by Ivo Balbar).
I would appreciate your help as I have been stuck for some time with this and I really need this for my work. For those who want to run the code to save you the trouble of copying, you can get the code here .
Thanks for taking the time to read this very long question! Help me turn this into a model answer that anyone at Julia can then consult!