I need to calculate Information Gain scores for functions> 100k in> 10k documents to classify text . The code below works fine, but for a complete set of data is very slow - it takes more than an hour on a laptop. Dataset is 20newsgroup, and I use scikit-learn, the chi2 function that is provided in scikit is very fast.
Any idea how to calculate gain information faster for such a dataset?
def information_gain(x, y): def _entropy(values): counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) return - np.sum(probs * np.log(probs)) def _information_gain(feature, y): feature_set_indices = np.nonzero(feature)[1] feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] entropy_x_set = _entropy(y[feature_set_indices]) entropy_x_not_set = _entropy(y[feature_not_set_indices]) return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set) + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set)) feature_size = x.shape[0] feature_range = range(0, feature_size) entropy_before = _entropy(y) information_gain_scores = [] for feature in xT: information_gain_scores.append(_information_gain(feature, y)) return information_gain_scores, []
EDIT:
I combined the internal functions and ran cProfiler as shown below (in a dataset limited to ~ 15k functions and ~ 1k documents):
cProfile.runctx( """for feature in xT: feature_set_indices = np.nonzero(feature)[1] feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices] values = y[feature_set_indices] counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) entropy_x_set = - np.sum(probs * np.log(probs)) values = y[feature_not_set_indices] counts = np.bincount(values) probs = counts[np.nonzero(counts)] / float(len(values)) entropy_x_not_set = - np.sum(probs * np.log(probs)) result = entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set) + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set)) information_gain_scores.append(result)""", globals(), locals())
Result 20 on top of tottime :
ncalls tottime percall cumtime percall filename:lineno(function) 1 60.27 60.27 65.48 65.48 <string>:1(<module>) 16171 1.362 0 2.801 0 csr.py:313(_get_row_slice) 16171 0.523 0 0.892 0 coo.py:201(_check) 16173 0.394 0 0.89 0 compressed.py:101(check_format) 210235 0.297 0 0.297 0 {numpy.core.multiarray.array} 16173 0.287 0 0.331 0 compressed.py:631(prune) 16171 0.197 0 1.529 0 compressed.py:534(tocoo) 16173 0.165 0 1.263 0 compressed.py:20(__init__) 16171 0.139 0 1.669 0 base.py:415(nonzero) 16171 0.124 0 1.201 0 coo.py:111(__init__) 32342 0.123 0 0.123 0 {method 'max' of 'numpy.ndarray' objects} 48513 0.117 0 0.218 0 sputils.py:93(isintlike) 32342 0.114 0 0.114 0 {method 'sum' of 'numpy.ndarray' objects} 16171 0.106 0 3.081 0 csr.py:186(__getitem__) 32342 0.105 0 0.105 0 {numpy.lib._compiled_base.bincount} 32344 0.09 0 0.094 0 base.py:59(set_shape) 210227 0.088 0 0.088 0 {isinstance} 48513 0.081 0 1.777 0 fromnumeric.py:1129(nonzero) 32342 0.078 0 0.078 0 {method 'min' of 'numpy.ndarray' objects} 97032 0.066 0 0.153 0 numeric.py:167(asarray)
It seems like most of the time is spent on _get_row_slice . I'm not quite sure about the first line, it looks like it covers the whole block that I provided cProfile.runctx , although I do not know why there is such a big gap between the first line totime=60.27 and the second tottime=1.362 . Where was the difference? Is it possible to check it in cProfile ?
Basically, it looks like a problem with sparse matrix operations (cutting, obtaining elements) - the solution would probably be to calculate Information gain using matrix algebra (for example, chi2 is implemented in scikit ). But I have no idea how to express this calculation in terms of operations with matrices ... Does anyone have an idea?
performance python scikit-learn machine-learning feature-selection
p_b_garcia
source share