[scikit learn]: anomaly detection - an alternative for OneClassSVM

I implemented LinearSVC and SVC from the sklearn framework for classifying text. I use TfidfVectorizer to get a sparse representation of the input data, which consists of two different classes (benign data and malicious data). This part works very well, but now I wanted to implement some kind of anomaly detection using the OneClassSVM classifier and training the model with only one class (outliers detection ...). Unfortunately, it does not work with sparse data. Some developers are working on a patch ( https://github.com/scikit-learn/scikit-learn/pull/1586 ), but there are some errors, so there is still no solution for using the OneClassSVM implementation.

Are there any other methods in the sklearn-framework to do something like this? I am looking through examples, but nothing of the kind fits.

Thanks!

+7
python scikit-learn machine-learning svm
source share
2 answers

A little late, but in case someone else is looking for information about this ... There is a third-party anomaly detection module for sklearn: http://www.cit.mak.ac.ug/staff/jquinn/software/lsanomaly.html , based on least squares methods. This should be a replacement plugin for OneClassSVM.

+5
source share

Unfortunately, scikit-learn currently only implements a single-class SVM and robust covariance estimation for outlier detection.

You can try comparing these methods ( as indicated in the document ) by examining the differences in 2d data:

import numpy as np import pylab as pl import matplotlib.font_manager from scipy import stats from sklearn import svm from sklearn.covariance import EllipticEnvelope # Example settings n_samples = 200 outliers_fraction = 0.25 clusters_separation = [0, 1, 2] # define two outlier detection tools to be compared classifiers = { "One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05, kernel="rbf", gamma=0.1), "robust covariance estimator": EllipticEnvelope(contamination=.1)} # Compare given classifiers under given settings xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500)) n_inliers = int((1. - outliers_fraction) * n_samples) n_outliers = int(outliers_fraction * n_samples) ground_truth = np.ones(n_samples, dtype=int) ground_truth[-n_outliers:] = 0 # Fit the problem with varying cluster separation for i, offset in enumerate(clusters_separation): np.random.seed(42) # Data generation X1 = 0.3 * np.random.randn(0.5 * n_inliers, 2) - offset X2 = 0.3 * np.random.randn(0.5 * n_inliers, 2) + offset X = np.r_[X1, X2] # Add outliers X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))] # Fit the model with the One-Class SVM pl.figure(figsize=(10, 5)) for i, (clf_name, clf) in enumerate(classifiers.iteritems()): # fit the data and tag outliers clf.fit(X) y_pred = clf.decision_function(X).ravel() threshold = stats.scoreatpercentile(y_pred, 100 * outliers_fraction) y_pred = y_pred > threshold n_errors = (y_pred != ground_truth).sum() # plot the levels lines and the points Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) subplot = pl.subplot(1, 2, i + 1) subplot.set_title("Outlier detection") subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=pl.cm.Blues_r) a = subplot.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange') b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white') c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black') subplot.axis('tight') subplot.legend( [a.collections[0], b, c], ['learned decision function', 'true inliers', 'true outliers'], prop=matplotlib.font_manager.FontProperties(size=11)) subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors)) subplot.set_xlim((-7, 7)) subplot.set_ylim((-7, 7)) pl.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26) pl.show() 
+1
source share

All Articles