Calculate a pairing matrix: is this a scalable big data approach available in Python?

Question

Calculate a pairing matrix: is this a scalable big data approach available in Python?

I have a CSV file containing function values for elements: each line is a triple (id_item, id_feature, value) representing the value of a particular function for a particular element. The data is very scarce.

I need to calculate two element distance matrices using Pearson correlation as a metric and the other using the Jaccard index.

At the moment, I have implemented an inline solution, and I am doing something like this:

import numpy as np
from numpy import genfromtxt
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix
from scipy.stats.stats import pearsonr
import sklearn.metrics.pairwise
import scipy.spatial.distance as ds
import scipy.sparse as sp

# read the data
my_data = genfromtxt('file.csv', delimiter=',')
i,j,value=my_data.T

# create a sparse matrix
m=coo_matrix( (value,(i,j)) )

# convert in a numpy array
m = np.array(m.todense())

# create the distance matrix using pdist
d = ds.pdist(m.T, 'correlation')

d= ds.squareform(d)

, , . , , . , ; .

?

1) Sklearn n_jobs, (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html), , . scikit-learn HPC, , , Joblib .

, , , CSV : CSV HDFS , - :

import subprocess
cat = subprocess.Popen(["hadoop", "fs", "-cat", "data.csv"], stdout=subprocess.PIPE)

cat.stdout:

for line in cat.stdout:
    ....

, .

2) HDFS, , mrjob

3) HDFS, SQL ( , , ) PyHive

, , 1) .

+6

python scikit-learn hadoop bigdata pearson-correlation

Eugenio 14 . '17 20:08

1

glegoux · Accepted Answer · 2017-06-22T21:14:40+0000

:

Pyro4 divide and conquer, node ,

n , n(n-1)/2, sklearn ( n_jobs) node.

a b node.

:

PySpark 2.1.1. .

Calculate a pairing matrix: is this a scalable big data approach available in Python?

More articles: