Downsizing with t-SNE

I have two datasets and a test. Two data sets have 30,213 and 30,235 elements, respectively, with 66 sizes each.

I am trying to apply t-SNE scikit to reduce the size to 2. Since the data sets are large and I get a MemoryError if I try to process all the data in one shot, I try to break them into pieces and convert one piece at a time like this:

tsne = manifold.TSNE(n_components=2, perplexity=30, init='pca', random_state=0) X_tsne_train = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_train.shape[0] ) ] ) X_tsne_test = np.array( [ [ 0.0 for j in range( 2 ) ] for i in range( X_test.shape[0] ) ] ) d = ( ( X_train, X_tsne_train ), ( X_test, X_tsne_test ) ) chunk = 5000 for Z in d: x, x_tsne = Z[0], Z[1] pstart, pend = 0, 0 while pend < x.shape[0]: if pend + chunk < x.shape[0]: pend = pstart + chunk else: pend = x.shape[0] print 'pstart = ', pstart, 'pend = ', pend x_part = x[pstart:pend] x_tsne[pstart:pend] += tsne.fit_transform(x_part) pstart = pend 

It works without a MemoryError, but I found that different script runs create different outputs for the same data items. This may be due to the fact that the fitting and conversion operations are performed together on each piece of data. But if I try to put train data using tsne.fit(X_train) , I get a MemoryError . How to properly reduce the size of all data elements in train and test sets to 2 without any inconsistencies between the pieces?

+7
python scikit-learn
source share
1 answer

I'm not quite sure what you mean by “different outputs with the same data elements”, but here are some comments that may help you.

First, t-SNE is not really a “downsizing” method in the same sense as PCA or other methods. It is not possible to take a fixed, studied t-SNE model and apply it to new data. (Note that the class does not have a transform() method, only fit() and fit_transform() .) Thus, you cannot use the set of "train" and "test".

Secondly, every time you call fit_transform() , you get a completely different model. Thus, the value of your reduced sizes does not match from piece to piece. Each piece has its own small small space. The model is different every time, so the data is not projected into the same space.

Thirdly, you do not include the code in which you share the “train” with the “test”. It is possible that when you try to establish a random seed of t-SNE, you do not set a random seed of your train and test unit, which leads to different data divisions and, therefore, to different results in subsequent runs.

Finally, if you want to use t-SNE to visualize your data, you can consider the recommendations on the documentation page and apply PCA to reduce the input dimension from 66 to, say, 15. This will significantly reduce the t-SNE memory area.

TSNE in SKLearn Documents

+2
source share

All Articles