My dataset is getting too big, and I'm looking for online learning solutions at sklearn, which they call extracurricular learning.
They offer several classes that use a partially compatible API that basically lets you store a subset of your data in memory and work with it. However, many stages of pre-processing (for example, scaling data) retain the parameters at the stage of their fitting on training data, which are then used for transformations.
For example, if you use the min-max scaler to bind functions to [-1, 1] or standardize your data, the parameters that they learn and ultimately use to transform the data are learned from a subset of the training data they work on, at this iteration.
This means that the parameters obtained during the fitting phase on one subset of the training data may be different from the other subsets of the training data, since they are specifically designed for training. And there is the heart of my question:
How can you combine the parameters obtained during the fitting phase of the pre-processing stage using online training or extra-curricular training, when the parameters studied are a function of the training data?
source
share