This is a continuation of my previous question . As Tim Peters suggested, use Managermay not always be the best approach. Unfortunately, I have too much code to create SSCCE . Instead, I will try to give a detailed explanation of my problem. Please feel free to browse the entire code base on Github , but it's a bit of a mess right now.
Background
I am doing research in the field of natural language processing, and I would like to do (something like) dictionary-based anti-aliasing to classify documents. The idea of training a classifier for the correct answer of words and phrases. For example, documents containing the word socialistmost likely relate to politics, and those that contain the phrase lava temperatureare most likely related to geology. The system learns by looking at a small number of pre-tagged examples. Since the language is so diverse, the classifier will never “know” all the possible phrases that it may meet in the production process.
Here is the dictionary. Suppose we have a cheap and easy way to get synonyms for almost any phrase (I will quote it myself because it tastes bad). When a bad classifier comes across a phrase that he does not know about, we could find it in the specified dictionary and tell the classifier "Look, you don’t know about communism, but it's kind of like socialist, and you know about it!". If the dictionary is reasonable, the classifier will generally work better.
Pseudo code
data = Load training and testing documents (300MB on disk)
dictionary = Load dictionary (200MB - 2GB on disk) and place into a `dict` for fast look-ups
Repeat 25 times:
do_work(data, dictionary)
def do_work(data, dictionary)
X = Select a random sample of data
Train a classifier on X
Y = Select a random sample of data
Using dictionary, classify all documents in Y
Write results to disk
Problem
- . Python 2.7 multiprocessing.Pool ( joblib.Parallel, , ). . , - , , , , .
. , Y, , . - .
, ( ) . data dictionary . Ive multiprocessing.managers.BaseManager, , .
? , , :
- MongoDB/CouchDB/memcached , Im . zeromq , .
sqlite , . , -, .
SO + , , , dict, fork() copy-on -write, .