How should I teach machine learning algorithm using data with large class imbalances? (CBM)

Question

How should I teach machine learning algorithm using data with large class imbalances? (CBM)

I am trying to teach my SVM algorithm to use click and conversion data by people who see banners. The main problem is that clicks account for about 0.2% of all data, so there is a big imbalance in it. When I use simple SVM at the testing stage, it always predicts only the “view” class and never “click” or “transform”. On average, it gives 99.8% of correct answers (due to imbalance), but gives 0% of the right forecast if you check "click" or "conversion". How can you tune the SVM algorithm (or choose another one) to take into account the imbalance?

+7

scikit-learn supervised-learning machine-learning svm

rvnikita Aug 6 '13 at 10:49

source share

2 answers

This article describes many methods. One simple (but very bad method for SVM) just copies the minority class until you get the balance:

http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf

+1

denson Jun 12 '15 at 9:31

source share

lejlot · Accepted Answer · 2013-08-06T18:47:47+0000

The most basic approach here is to use the so-called “class weighting scheme” - in the classic SVM formulation, there is a parameter C used to control the skip classification counter. It can be changed to parameters C1 and C2 used for classes 1 and 2, respectively. The most common choice for C1 and C2 for a given C is to place

 C1 = C / n1 C2 = C / n2

where n1 and n2 are sizes of classes 1 and 2, respectively. Thus, you “punish” SVM for missing a less frequent class much more difficult than for the most common missclassification.

Many existing libraries (such as libSVM ) support this mechanism with class_weight parameters.

Example using python and sklearn

 print __doc__ import numpy as np import pylab as pl from sklearn import svm # we create 40 separable points rng = np.random.RandomState(0) n_samples_1 = 1000 n_samples_2 = 100 X = np.r_[1.5 * rng.randn(n_samples_1, 2), 0.5 * rng.randn(n_samples_2, 2) + [2, 2]] y = [0] * (n_samples_1) + [1] * (n_samples_2) # fit the model and get the separating hyperplane clf = svm.SVC(kernel='linear', C=1.0) clf.fit(X, y) w = clf.coef_[0] a = -w[0] / w[1] xx = np.linspace(-5, 5) yy = a * xx - clf.intercept_[0] / w[1] # get the separating hyperplane using weighted classes wclf = svm.SVC(kernel='linear', class_weight={1: 10}) wclf.fit(X, y) ww = wclf.coef_[0] wa = -ww[0] / ww[1] wyy = wa * xx - wclf.intercept_[0] / ww[1] # plot separating hyperplanes and samples h0 = pl.plot(xx, yy, 'k-', label='no weights') h1 = pl.plot(xx, wyy, 'k--', label='with weights') pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired) pl.legend() pl.axis('tight') pl.show()

In particular, in sklearn you can just turn on automatic weighting by setting class_weight='auto' .

Visualization of above code from sklearn documentation

How should I teach machine learning algorithm using data with large class imbalances? (CBM)

More articles: