Solution Boundary Graph for High Dimension Data

I am creating a model for a binary classification problem where each of my data points has 300 dimensions (I use 300 functions). I am using PassiveAggressiveClassifier from sklearn. The model works very well.

I want to build the boundary of the solution model. How can i do this?

To get an idea of ​​the data, I draw it in 2D using TSNE. I reduced the size of the data in 2 stages - from 300 to 50, then from 50 to 2 (this is a general recommendation). Below is a snippet of code for it:

from sklearn.manifold import TSNE from sklearn.decomposition import TruncatedSVD X_Train_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(X_train) X_Train_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_Train_reduced) #some convert lists of lists to 2 dataframes (df_train_neg, df_train_pos) depending on the label - #plot the negative points and positive points scatter(df_train_neg.val1, df_train_neg.val2, marker='o', c='red') scatter(df_train_pos.val1, df_train_pos.val2, marker='x', c='green') 

Data plot

I get a decent schedule.

Is there a way that I can add a solution border to this graph that represents the actual solution border of my model in 300x space?

+7
python scikit-learn machine-learning plot data-science
source share
1 answer

One way is to apply a Voronoi shadow texture to your 2D plot, i.e. place it based on proximity to the points of the 2D data (different colors for each predicted class label). See a recent article by Migut et al., 2015 .

This is much simpler than it sounds with meshgrid and scikit KNeighborsClassifier (this is an example of the end with an Iris dataset, replace the first few lines with your model / code):

 import numpy as np, matplotlib.pyplot as plt from sklearn.neighbors.classification import KNeighborsClassifier from sklearn.datasets.base import load_iris from sklearn.manifold.t_sne import TSNE from sklearn.linear_model.logistic import LogisticRegression # replace the below by your data and model iris = load_iris() X,y = iris.data, iris.target X_Train_embedded = TSNE(n_components=2).fit_transform(X) print X_Train_embedded.shape model = LogisticRegression().fit(X,y) y_predicted = model.predict(X) # replace the above by your data and model # create meshgrid resolution = 100 # 100x100 background pixels X2d_xmin, X2d_xmax = np.min(X_Train_embedded[:,0]), np.max(X_Train_embedded[:,0]) X2d_ymin, X2d_ymax = np.min(X_Train_embedded[:,1]), np.max(X_Train_embedded[:,1]) xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution)) # approximate Voronoi tesselation on resolution x resolution grid using 1-NN background_model = KNeighborsClassifier(n_neighbors=1).fit(X_Train_embedded, y_predicted) voronoiBackground = background_model.predict(np.c_[xx.ravel(), yy.ravel()]) voronoiBackground = voronoiBackground.reshape((resolution, resolution)) #plot plt.contourf(xx, yy, voronoiBackground) plt.scatter(X_Train_embedded[:,0], X_Train_embedded[:,1], c=y) plt.show() 

Please note that instead of accurately plotting the border of your solution, it just gives you a rough estimate of the border where the border should lie (especially in regions with few data points, true borders may deviate from this). He will draw a line between two data points belonging to different classes, but will place it in the middle (in this case, there will be a guaranteed decision boundary between these points, but this does not have to be in the middle).

There are also some experimental approaches to better approximating the true boundary of a solution, for example. this one on github

+5
source share