Sklearn dumping model using joblib, dumps multiple files. Which one is the right model?

Question

Sklearn dumping model using joblib, dumps multiple files. Which one is the right model?

I tried the program for training SVM with sklearn. Here is the code

from sklearn import svm from sklearn import datasets from sklearn.externals import joblib clf = svm.SVC() iris = datasets.load_iris() X, y = iris.data, iris.target clf.fit(X, y) print(clf.predict(X)) joblib.dump(clf, 'clf.pkl')

When I delete the model file, I get this number of files.

['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf. pkl_06.npy ',' clf.pkl_07.npy ',' clf.pkl_08.npy ',' clf.pkl_09.npy ',' clf.pkl_10.npy ',' clf.pkl_11.npy ']

I am embarrassed if I did something wrong. Or is this normal? What are * .npy files. And why is there 11?

+7

python scikit-learn machine-learning joblib

kcc__ Nov 03 '15 at 10:54

source share

1 answer

Ibraim ganiev · Answer 1 · 2015-11-03T13:30:31+0000

To save everything in 1 file, you must set the compression to True or any number (for example, 1).

But you should know that the shared representation of np arrays is necessary for the main functions of joblib dump / load, joblib can load and save objects with np arrays faster than Pickle because of this shared view, and unlike JobLib Pickle can correctly save and load objects using memmap numpy arrays. If you want to have one file serialization of the entire object (and do not want to save memmap np arrays) - I think it would be better to use Pickle, AFAIK, in this case the joblib dump / load function will work at the same speed as the Pickle.

 import numpy as np from scikit-learn.externals import joblib vector = np.arange(0, 10**7) %timeit joblib.dump(vector, 'vector.pkl') # 1 loops, best of 3: 818 ms per loop # file size ~ 80 MB %timeit vector_load = joblib.load('vector.pkl') # 10 loops, best of 3: 47.6 ms per loop # Compressed %timeit joblib.dump(vector, 'vector.pkl', compress=1) # 1 loops, best of 3: 1.58 s per loop # file size ~ 15.1 MB %timeit vector_load = joblib.load('vector.pkl') # 1 loops, best of 3: 442 ms per loop # Pickle %%timeit with open('vector.pkl', 'wb') as f: pickle.dump(vector, f) # 1 loops, best of 3: 927 ms per loop %%timeit with open('vector.pkl', 'rb') as f: vector_load = pickle.load(f) # 10 loops, best of 3: 94.1 ms per loop

Sklearn dumping model using joblib, dumps multiple files. Which one is the right model?

More articles: