Scigit-Learn Logistic Regression Memory Error

I am trying to use the sklearn 0.11 LogisticRegression object to fit a 200,000 case model with about 80,000 functions. The goal is to classify short text descriptions into 1 out of 800 classes.

When I try to set the classifier, pythonw.exe gives me:

Application error "Instructions on ... reference memory at 0x00000000". The memory could not be written. "

Features are extremely rare, around 10 per observation, and are binary (1 or 0), so from my back of the envelope calculation, my 4 GB of RAM should be able to handle memory requirements, but it doesn't, t seems to be the case. Models are only suitable when I use fewer observations and / or fewer functions.

Anyway, I would like to use even more observations and functions. My naive understanding is that the liblinear library, working behind the scenes, is able to support this. Any ideas on how I could squeeze some more observations?

My code is as follows:

y_vectorizer = LabelVectorizer(y) # my custom vectorizer for labels y = y_vectorizer.fit_transform(y) x_vectorizer = CountVectorizer(binary = True, analyzer = features) x = x_vectorizer.fit_transform(x) clf = LogisticRegression() clf.fit(x, y) 

The features () function passed to the analyzer simply returns a list of strings indicating the functions found in each case.

I am using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM.

+7
source share
1 answer

liblinear (the backing implementation of sklearn.linear_model.LogisticRegression ) will contain its own copy of the data, because it is a C ++ library whose internal memory cannot be directly mapped to a previously distributed sparse matrix in scipy, for example scipy.sparse.csr_matrix or scipy.sparse.csc_matrix .

In your case, I would recommend downloading your data as scipy.sparse.csr_matrix and passing it (with loss='log' if you need a logistic regression model and the ability to call the predict_proba method). SGDClassifier will not copy the input if it already uses the scipy.sparse.csr_matrix memory scipy.sparse.csr_matrix .

Expect it to allocate a dense model of 800 * (80000 + 1) * 8 / (1024 ** 2) = 488 MB in memory (in addition to the size of your input dataset).

Edit: how to optimize memory access for your dataset

To free memory after retrieving data, you can:

 x_vectorizer = CountVectorizer(binary = True, analyzer = features) x = x_vectorizer.fit_transform(x) from sklearn.externals import joblib joblib.dump(x.tocsr(), 'dataset.joblib') 

Then exit this python process (to completely free memory) and in the new process:

 x_csr = joblib.load('dataset.joblib') 

On linux / OSX, you can use the memory card even more efficiently:

 x_csr = joblib.load('dataset.joblib', mmap_mode='c') 
+20
source

All Articles