Is there a faster way to make this pseudo code efficient in python numpy?

I have three arrays called RowIndex , ColIndex and Entry in numpy. In fact, this is a subset of the records from the matrix with row indices, column indices and the value of this record in these three variables, respectively. I have two numpy 2D arrays (matrices) U and M Let alpha and beta be two given constants. I need to iterate over a subset of the matrix entries, which is possible if I repeat through RowIndex , ColIndex and Value . Let's say

 i=RowIndex[0], j=ColIndex[0], value = Entry[0] 

then i need to update i'th row and j'th column U and M respectively according to some equation. Then i do

 i=RowIndex[1], j=ColIndex[1], value = Entry[1] 

etc. Details below.

 for iter in np.arange(length(RowIndex)): i = RowIndex[iter] j = ColIndex[iter] value = Entry[iter] e = value - np.dot(U[i,:],M[:,j]) OldUi = U[i,:] OldMj = M[:,j] U[i,:] = OldUi + beta * (e*OldMj - alpha*OldUi) M[:,j] = OldMj + beta * (e*OldUi - alpha*OldMj) 

The problem is that the code is very slow. Is there any piece of code where I can speed this up?

PS: For the curious, this is a variant of a winning solution to the famous problem with the NetFlix prize pool. RowIndex corresponds to users, and ColIndex corresponds to films and values ​​corresponding to their ratings. Most ratings are missing. Well-known ratings add up to RowIndex, ColIndex and Entry. Now you are trying to find the matrices U and M, so the rating of the i user for the j movie is given by np.dot(U[i,:],M[:,j]) . Now, based on the available ratings, you are trying to find the matrices U and M (or their rows and columns) using the update equation, as shown in the code above.

+5
source share
1 answer

I think if I did not understand that your code can be vectorized as follows:

 import numpy as np U, M = # two 2D matrices rows_idx = # list of indexes cols_idx = # list of indexes values = # np.array() of values e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal() Uo = U.copy() Mo = M.copy() U[rows_idx, :] += beta * ((e * Mo[:, cols_idx]).T - alpha * Uo[rows_idx, :]) M[:, cols_idx] += beta * ((e * Uo[rows_idx, :].T) - alpha * Mo[:, cols_idx]) 

Here

 e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal() 

calculates your

 e = value - np.dot(U[i,:],M[:,j]) 

Note that the result you want is on the diagonal of the point product between the matrices.

This will not handle consecutive updates (since there is no vectorization available for it), but it will allow you to execute a package of independent updates in a vector and faster.


As stated above, the code that I proposed to you cannot process sequential updates, because by definition, the sequential update scheme cannot be vectorized. Anything in shape

 A(t) = A(t-1) +/* something 

where t determines the time, cannot be updated in parallel.

So, what I suggested is a vectorized update for independent updates.

Imagine you have M and U with 10x10 rows each, and you have the following row and column indices:

 rows_idx = [1, 1, 3, 4, 5, 0] cols_idx = [7, 1, 7, 5, 6, 5] 

You can define two independent sets from there (given that the indices are ordered):

 rows_idx = [1, 4, 5], [1, 3, 0] cols_idx = [7, 5, 6], [1, 7, 5] 

Note that independent sets are created by indexes on both rows and columns that are unique. With this definition, you can reduce the number of cycles you need from 6 (in this case) to 2:

 for i in len(rows_idx): ridx = rows_idx[i] cidx = cols_idx[i] # Use the vectorized scheme proposed above the edit e = values - np.dot(U[ridx, :], M[:, cidx]).diagonal() Uo = U.copy() Mo = M.copy() U[ridx, :] += beta * ((e * Mo[:, cidx]).T - alpha * Uo[ridx, :]) M[:, cidx] += beta * ((e * Uo[ridx, :].T) - alpha * Mo[:, cidx]) 

So, if you have a way to manually (or easily) extract independent updates or calculate a list using a search algorithm, the above code will vectorize independent updates .


For clarification, just in case, in the above example:

 rows_idx = [1, 1, 3, 4, 5, 0] cols_idx = [7, 1, 7, 5, 6, 5] 

The 2nd row cannot be parallelized, because 1 appeared earlier, and the third and last columns cannot be parallelized for the same reason (from 7 and 5 ). Since both rows and columns must be unique, we get 2 sets of tuples:

 rows_idx = [1, 4, 5], [1, 3, 0] cols_idx = [7, 5, 6], [1, 7, 5] 

From here, the path to work will depend on your data. The problem of finding independent sets can be very expensive, especially if most of them depend on some previous updates.

If you have a path from your data (say that you have data recorded on time) to extract independent sets, then a batch update will help you. On the other hand, if you have data together (which is shared), this will depend on one factor:

If you can assure that the length of the independent sets N much greater than the number of independent sets M (which more or less means that if you finish with several M = {2,3,4} independent sets for your indices N = 100000, with N >> M row / col), then you may need to search for independent sets.

In other words, if you are going to update 30 authors and 30 films in 10,000 different combinations, then your data is likely to depend on previous updates, however, if you are going to update 100,000 authors and 100,000 films in 30 combinations, then your data is probably will be independent.

Some pseudo code to find an independent set, if you have no way to extract them without information, would be something like this:

 independent_sets = [] # list with sets for row, col in zip(rows_idx, cols_idx): for iset in independent_sets: if row and col DONT exist in iset: insert row and col break if nothing inserted: add new set to independent set add current (row, col) to the new set 

as you can see, to search for independent sets you already need to iterate over the entire list of row / column indexes. The pseudocode above is not the most efficient, and I'm sure there will be special algorithms for this. But the cost of finding an independent set may be higher than performing all of your consecutive updates, if your updates are likely to depend on previous ones.

Finish: after the whole message it all depends on your data.

  • If you can advance from how you get the rows / columns that you want to update, extract independent sets, you can easily update them.

  • If you can guarantee that most of your updates will be independent (for example, 990 out of 10000 will be), it might be worth trying to find the 990 kit. One way to approximate a set is to use np.unique :

     # Just get the index of the unique rows and columns _, idx_rows = np.unique(rows_idx, return_index=True) _, idx_cols = np.unique(cols_idx, return_index=True) # Get the index where both rows and columns are unique idx = np.intersection1d(idx_rows, idx_cols) 

    Now idx contains the rows_idx and cols_idx positions, which are unique , hopefully this can significantly reduce your computing costs. You can use my batch update to quickly update these rows and columns corresponding to these indices. Then you can use your original approach to update, hopefully several records that repeat iterations over non-ideal indexes.

  • If you have several updates for the same actors or films, then ... keep your consistent update scheme, as finding independent sets will be more difficult than an iterative update.

+4
source

All Articles