I have a CSV file containing function values for elements: each line is a triple (id_item, id_feature, value) representing the value of a particular function for a particular element. The data is very scarce.
I need to calculate two element distance matrices using Pearson correlation as a metric and the other using the Jaccard index.
At the moment, I have implemented an inline solution, and I am doing something like this:
import numpy as np
from numpy import genfromtxt
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix
from scipy.stats.stats import pearsonr
import sklearn.metrics.pairwise
import scipy.spatial.distance as ds
import scipy.sparse as sp
my_data = genfromtxt('file.csv', delimiter=',')
i,j,value=my_data.T
m=coo_matrix( (value,(i,j)) )
m = np.array(m.todense())
d = ds.pdist(m.T, 'correlation')
d= ds.squareform(d)
, , . , , . , ; .
?
1) Sklearn n_jobs, (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html), , .
scikit-learn HPC, , , Joblib .
, , , CSV : CSV HDFS , - :
import subprocess
cat = subprocess.Popen(["hadoop", "fs", "-cat", "data.csv"], stdout=subprocess.PIPE)
cat.stdout:
for line in cat.stdout:
....
, .
2) HDFS, , mrjob
3) HDFS, SQL ( , , ) PyHive
, , 1) .