I have a set of 2000 trained trees with arbitrary regression (from scikit learn Random Forest Regressor with n_estimators=1 ). Learning trees in parallel (50 cores) on a large data set (~ 100000 * 700000 = 70 GB at 8 bits) using multiprocessing , and shared memory works like a charm. Please note that I do not use the built-in multi-core RF support, as I make a choice of function in advance.
Problem: when testing a large matrix (~ 20,000 * 700,000) in parallel, I always run out of memory (I have access to a server with 500 GB of RAM).
My strategy is to have a test matrix in memory and share it between all processes. According to one of the developers , the memory requirement for testing is 2 * n_jobs * sizeof (X), and in my case, another factor * 4 matters because the 8-bit matrix elements grew to float32 inside RF.
In numbers, I think for testing I need:
14 GB for storing the test matrix in memory + 50 (= n_jobs) * 20000 (n_samples) * 700 (= n_features) * 4 (upgrade to floating) * 2 bytes = 14 GB + 5.6 GB = ~ 21 GB of memory.
But it always explodes up to several hundred GB. What am I missing here? (I am in the latest version of scikit-learn, so old memory problems should be fixed)
Observation:
When using only one core, memory usage for testing ranges from 30 to 100 GB (as measured by free )
My code is:
#---------------- #helper functions def initializeRFtest(*args): global df_test, pt_test #initialize test data and test labels as globals in shared memory df_test, pt_test = args def star_testTree(model_featidx): return predTree(*model_featidx) #end of helper functions #------------------- def RFtest(models, df_test, pt_test, features_idx, no_trees): #test trees in parallel ncores = 50 p = Pool(ncores, initializer=initializeRFtest, initargs=(df_test, pt_test)) args = itertools.izip(models, features_idx) out_list = p.map(star_testTree, args) p.close() p.join() return out_list def predTree(model, feat_idx): #get all indices of samples that meet feature subset requirement nan_rows = np.unique(np.where(df_test.iloc[:,feat_idx] == settings.nan_enc)[0]) all_rows = np.arange(df_test.shape[0]) rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features #predict pred = model.predict(df_test.iloc[rows,feat_idx]) return predicted #main program out = RFtest(models, df_test, pt_test, features_idx, no_trees)
Edit: another observation: When splitting test data, the program runs smoothly with significantly reduced memory usage. This is what I used to run the program.
Code snippet for the updated predTree function:
def predTree(model, feat_idx): # get all indices of samples that meet feature subset requirement nan_rows = np.unique(np.where(test_df.iloc[:,feat_idx] == settings.nan_enc)[0]) all_rows = np.arange(test_df.shape[0]) rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features # predict height per valid sample chunksize = 500 n_chunks = np.int(math.ceil(np.float(rows.shape[0])/chunksize)) pred = [] for i in range(n_chunks): if n_chunks == 1: pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx]) pred.append(pred_chunked) break if i == n_chunks-1: pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx]) else: pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:(i+1)*chunksize], feat_idx]) print pred_chunked.shape pred.append(pred_chunked) pred = np.concatenate(pred) # populate matrix predicted = np.empty(test_df.shape[0]) predicted.fill(np.nan) predicted[rows] = pred return predicted