Quick profit calculation

I need to calculate Information Gain scores for functions> 100k in> 10k documents to classify text . The code below works fine, but for a complete set of data is very slow - it takes more than an hour on a laptop. Dataset is 20newsgroup, and I use scikit-learn, the chi2 function that is provided in scikit is very fast.

Any idea how to calculate gain information faster for such a dataset?

def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) return - np.sum(probs * np.log(probs)) def _information_gain(feature, y): feature_set_indices = np.nonzero(feature)[1] feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] entropy_x_set = _entropy(y[feature_set_indices]) entropy_x_not_set = _entropy(y[feature_not_set_indices]) return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set) + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set)) feature_size = x.shape[0] feature_range = range(0, feature_size) entropy_before = _entropy(y) information_gain_scores = [] for feature in xT: information_gain_scores.append(_information_gain(feature, y)) return information_gain_scores, [] 

EDIT:

I combined the internal functions and ran cProfiler as shown below (in a dataset limited to ~ 15k functions and ~ 1k documents):

 cProfile.runctx( """for feature in xT: feature_set_indices = np.nonzero(feature)[1] feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] values = y[feature_set_indices] counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) entropy_x_set = - np.sum(probs * np.log(probs)) values = y[feature_not_set_indices] counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) entropy_x_not_set = - np.sum(probs * np.log(probs)) result = entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set) + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set)) information_gain_scores.append(result)""", globals(), locals()) 

Result 20 on top of tottime :

 ncalls tottime percall cumtime percall filename:lineno(function) 1 60.27 60.27 65.48 65.48 <string>:1(<module>) 16171 1.362 0 2.801 0 csr.py:313(_get_row_slice) 16171 0.523 0 0.892 0 coo.py:201(_check) 16173 0.394 0 0.89 0 compressed.py:101(check_format) 210235 0.297 0 0.297 0 {numpy.core.multiarray.array} 16173 0.287 0 0.331 0 compressed.py:631(prune) 16171 0.197 0 1.529 0 compressed.py:534(tocoo) 16173 0.165 0 1.263 0 compressed.py:20(__init__) 16171 0.139 0 1.669 0 base.py:415(nonzero) 16171 0.124 0 1.201 0 coo.py:111(__init__) 32342 0.123 0 0.123 0 {method 'max' of 'numpy.ndarray' objects} 48513 0.117 0 0.218 0 sputils.py:93(isintlike) 32342 0.114 0 0.114 0 {method 'sum' of 'numpy.ndarray' objects} 16171 0.106 0 3.081 0 csr.py:186(__getitem__) 32342 0.105 0 0.105 0 {numpy.lib._compiled_base.bincount} 32344 0.09 0 0.094 0 base.py:59(set_shape) 210227 0.088 0 0.088 0 {isinstance} 48513 0.081 0 1.777 0 fromnumeric.py:1129(nonzero) 32342 0.078 0 0.078 0 {method 'min' of 'numpy.ndarray' objects} 97032 0.066 0 0.153 0 numeric.py:167(asarray) 

It seems like most of the time is spent on _get_row_slice . I'm not quite sure about the first line, it looks like it covers the whole block that I provided cProfile.runctx , although I do not know why there is such a big gap between the first line totime=60.27 and the second tottime=1.362 . Where was the difference? Is it possible to check it in cProfile ?

Basically, it looks like a problem with sparse matrix operations (cutting, obtaining elements) - the solution would probably be to calculate Information gain using matrix algebra (for example, chi2 is implemented in scikit ). But I have no idea how to express this calculation in terms of operations with matrices ... Does anyone have an idea?

+7
performance python scikit-learn machine-learning feature-selection
source share
3 answers

I don’t know if he helps, since a year has passed. But now I have to face the same task for classifying the text. I rewrote your code using the nonzero () function provided for a sparse matrix. Then I just scan nz, count the corresponding y_value, and calculate the entropy.

The following code only takes seconds to start the news20 dataset (loaded using the libsvm enabled matrix format).

 def information_gain(X, y): def _calIg(): entropy_x_set = 0 entropy_x_not_set = 0 for c in classCnt: probs = classCnt[c] / float(featureTot) entropy_x_set = entropy_x_set - probs * np.log(probs) probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) for c in classTotCnt: if c not in classCnt: probs = classTotCnt[c] / float(tot - featureTot) entropy_x_not_set = entropy_x_not_set - probs * np.log(probs) return entropy_before - ((featureTot / float(tot)) * entropy_x_set + ((tot - featureTot) / float(tot)) * entropy_x_not_set) tot = X.shape[0] classTotCnt = {} entropy_before = 0 for i in y: if i not in classTotCnt: classTotCnt[i] = 1 else: classTotCnt[i] = classTotCnt[i] + 1 for c in classTotCnt: probs = classTotCnt[c] / float(tot) entropy_before = entropy_before - probs * np.log(probs) nz = XTnonzero() pre = 0 classCnt = {} featureTot = 0 information_gain = [] for i in range(0, len(nz[0])): if (i != 0 and nz[0][i] != pre): for notappear in range(pre+1, nz[0][i]): information_gain.append(0) ig = _calIg() information_gain.append(ig) pre = nz[0][i] classCnt = {} featureTot = 0 featureTot = featureTot + 1 yclass = y[nz[1][i]] if yclass not in classCnt: classCnt[yclass] = 1 else: classCnt[yclass] = classCnt[yclass] + 1 ig = _calIg() information_gain.append(ig) return np.asarray(information_gain) 
+6
source share

Here is a version that uses matrix operations. IG for a function is the average of its classes.

 import numpy as np from scipy.sparse import issparse from sklearn.preprocessing import LabelBinarizer from sklearn.utils import check_array from sklearn.utils.extmath import safe_sparse_dot def ig(X, y): def get_t1(fc, c, f): t = np.log2(fc/(c * f)) t[~np.isfinite(t)] = 0 return np.multiply(fc, t) def get_t2(fc, c, f): t = np.log2((1-f-c+fc)/((1-c)*(1-f))) t[~np.isfinite(t)] = 0 return np.multiply((1-f-c+fc), t) def get_t3(c, f, class_count, observed, total): nfc = (class_count - observed)/total t = np.log2(nfc/(c*(1-f))) t[~np.isfinite(t)] = 0 return np.multiply(nfc, t) def get_t4(c, f, feature_count, observed, total): fnc = (feature_count - observed)/total t = np.log2(fnc/((1-c)*f)) t[~np.isfinite(t)] = 0 return np.multiply(fnc, t) X = check_array(X, accept_sparse='csr') if np.any((X.data if issparse(X) else X) < 0): raise ValueError("Input X must be non-negative.") Y = LabelBinarizer().fit_transform(y) if Y.shape[1] == 1: Y = np.append(1 - Y, Y, axis=1) # counts observed = safe_sparse_dot(YT, X) # n_classes * n_features total = observed.sum(axis=0).reshape(1, -1).sum() feature_count = X.sum(axis=0).reshape(1, -1) class_count = (X.sum(axis=1).reshape(1, -1) * Y).T # probs f = feature_count / feature_count.sum() c = class_count / float(class_count.sum()) fc = observed / total # the feature score is averaged over classes scores = (get_t1(fc, c, f) + get_t2(fc, c, f) + get_t3(c, f, class_count, observed, total) + get_t4(c, f, feature_count, observed, total)).mean(axis=0) scores = np.asarray(scores).reshape(-1) return scores, [] 

In a dataset with 1000 instances and 1000 unique functions, this implementation is> 100 faster than one without matrix operations.

0
source share

This feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] code takes 90% of the time, try changing your work setting

0
source share

All Articles