I am trying to use the sklearn 0.11 LogisticRegression object to fit a 200,000 case model with about 80,000 functions. The goal is to classify short text descriptions into 1 out of 800 classes.
When I try to set the classifier, pythonw.exe gives me:
Application error "Instructions on ... reference memory at 0x00000000". The memory could not be written. "
Features are extremely rare, around 10 per observation, and are binary (1 or 0), so from my back of the envelope calculation, my 4 GB of RAM should be able to handle memory requirements, but it doesn't, t seems to be the case. Models are only suitable when I use fewer observations and / or fewer functions.
Anyway, I would like to use even more observations and functions. My naive understanding is that the liblinear library, working behind the scenes, is able to support this. Any ideas on how I could squeeze some more observations?
My code is as follows:
y_vectorizer = LabelVectorizer(y) # my custom vectorizer for labels y = y_vectorizer.fit_transform(y) x_vectorizer = CountVectorizer(binary = True, analyzer = features) x = x_vectorizer.fit_transform(x) clf = LogisticRegression() clf.fit(x, y)
The features () function passed to the analyzer simply returns a list of strings indicating the functions found in each case.
I am using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM.
Alexander Measure
source share