Convert dictionary to sparse matrix

I have a dictionary with keys like user_ids and values ​​as a list of my favorite C # unique_users = 573000 and # unique_movies = 16000 video streams.

{1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923], ..........}

Now I want to convert this to a matrix with rows as user_ids and columns as movies_id with values ​​of 1 for movies that the user liked. It will be 573000 * 16000

Ultimately, I need to multiply this matrix, with which it is transposed, in order to have a coincidence matrix with dim (# unique_movies, # unique_movies).

There will also be a temporary complexity of operation X '* X, where X is similar (500000, 12000).

+5
source share
3 answers

I think you can create an empty dok_matrix and fill in the values. Then rearrange it and convert to csr_matrix for efficient matrix multiplication.

import numpy as np import scipy.sparse as sp d = {1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923]} mat = sp.dok_matrix((573000,16000), dtype=np.int8) for user_id, movie_ids in d.items(): mat[user_id, movie_ids] = 1 mat = mat.transpose().tocsr() print mat.shape 
+1
source
 df = {1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923],..........} df2 = pd.DataFrame.from_dict(df, orient='index') df2 = df2.stack().reset_index() df2.level_1=1 df2.pivot(index='level_0',columns=0,values='level_1').fillna(0) 

This converts the dict into a data frame followed by stacking to get the user IDs and movie IDs in separate columns, then all the values ​​of the unused column level_1 are equal to 1. The last statement creates a pivot table filling in nonexistent combinations with zeros.

+2
source

You can create csr_matrix right away (for example, this format: csr_matrix((data, (row_ind, col_ind)) ). Here is a snippet of how to do this.

 import scipy.sparse as sp d = {0: [0,1], 1: [1,2,3], 2: [3,4,5], 3: [4,5,6], 4: [5,6,7], 5: [7], 6: [7,8,9]} row_ind = [k for k, v in d.items() for _ in range(len(v))] col_ind = [i for ids in d.values() for i in ids] X = sp.csr_matrix(([1]*len(row_ind), (row_ind, col_ind))) # sparse csr matrix 

You can use matrix X to find the cooccurrence matrix later (i.e. XT * X ) (credit github @ daniel-acuna). I think there is a faster way to convert a list dictionary to row_ind , col_ind .

0
source

All Articles