Understanding scikit learn Random forest memory requirements for prediction

Question

Understanding scikit learn Random forest memory requirements for prediction

I have a set of 2000 trained trees with arbitrary regression (from scikit learn Random Forest Regressor with n_estimators=1 ). Learning trees in parallel (50 cores) on a large data set (~ 100000 * 700000 = 70 GB at 8 bits) using multiprocessing , and shared memory works like a charm. Please note that I do not use the built-in multi-core RF support, as I make a choice of function in advance.

Problem: when testing a large matrix (~ 20,000 * 700,000) in parallel, I always run out of memory (I have access to a server with 500 GB of RAM).

My strategy is to have a test matrix in memory and share it between all processes. According to one of the developers , the memory requirement for testing is 2 * n_jobs * sizeof (X), and in my case, another factor * 4 matters because the 8-bit matrix elements grew to float32 inside RF.

In numbers, I think for testing I need:
14 GB for storing the test matrix in memory + 50 (= n_jobs) * 20000 (n_samples) * 700 (= n_features) * 4 (upgrade to floating) * 2 bytes = 14 GB + 5.6 GB = ~ 21 GB of memory.

But it always explodes up to several hundred GB. What am I missing here? (I am in the latest version of scikit-learn, so old memory problems should be fixed)

Observation:
When using only one core, memory usage for testing ranges from 30 to 100 GB (as measured by free )

My code is:

 #---------------- #helper functions def initializeRFtest(*args): global df_test, pt_test #initialize test data and test labels as globals in shared memory df_test, pt_test = args def star_testTree(model_featidx): return predTree(*model_featidx) #end of helper functions #------------------- def RFtest(models, df_test, pt_test, features_idx, no_trees): #test trees in parallel ncores = 50 p = Pool(ncores, initializer=initializeRFtest, initargs=(df_test, pt_test)) args = itertools.izip(models, features_idx) out_list = p.map(star_testTree, args) p.close() p.join() return out_list def predTree(model, feat_idx): #get all indices of samples that meet feature subset requirement nan_rows = np.unique(np.where(df_test.iloc[:,feat_idx] == settings.nan_enc)[0]) all_rows = np.arange(df_test.shape[0]) rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features #predict pred = model.predict(df_test.iloc[rows,feat_idx]) return predicted #main program out = RFtest(models, df_test, pt_test, features_idx, no_trees)

Edit: another observation: When splitting test data, the program runs smoothly with significantly reduced memory usage. This is what I used to run the program.
Code snippet for the updated predTree function:

 def predTree(model, feat_idx): # get all indices of samples that meet feature subset requirement nan_rows = np.unique(np.where(test_df.iloc[:,feat_idx] == settings.nan_enc)[0]) all_rows = np.arange(test_df.shape[0]) rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features # predict height per valid sample chunksize = 500 n_chunks = np.int(math.ceil(np.float(rows.shape[0])/chunksize)) pred = [] for i in range(n_chunks): if n_chunks == 1: pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx]) pred.append(pred_chunked) break if i == n_chunks-1: pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx]) else: pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:(i+1)*chunksize], feat_idx]) print pred_chunked.shape pred.append(pred_chunked) pred = np.concatenate(pred) # populate matrix predicted = np.empty(test_df.shape[0]) predicted.fill(np.nan) predicted[rows] = pred return predicted

+8

python memory-management scikit-learn python-multiprocessing random-forest

Dahlai Jul 01 '16 at 8:52

source share

1 answer

sophros · Answer 1 · 2017-09-20T21:15:14+0000

I am not sure that the memory problem is not related to using itertools.izip in args = itertools.izip(models, features_idx) , which can initiate iterator copies along with its arguments in all threads. Have you tried just using zip ?

Another hypothesis might be inefficient garbage collection - it doesn't work when you need it. I would check if running gc.collect() does not work immediately before model.predict in predTree .

There is also a 3rd potential reason (probably the most reliable). Let me give you Python Frequently Asked Questions about how Python manages memory? :

In current releases of CPython, each new x assignment within the loop will free up a previously allocated resource.

In your chunked function, you do exactly that — reassign pred_chunked .

Understanding scikit learn Random forest memory requirements for prediction

More articles: