How can I combine study assignments with certain studied parameters using sklearn online learning (extraordinary)

My dataset is getting too big, and I'm looking for online learning solutions at sklearn, which they call extracurricular learning.

They offer several classes that use a partially compatible API that basically lets you store a subset of your data in memory and work with it. However, many stages of pre-processing (for example, scaling data) retain the parameters at the stage of their fitting on training data, which are then used for transformations.

For example, if you use the min-max scaler to bind functions to [-1, 1] or standardize your data, the parameters that they learn and ultimately use to transform the data are learned from a subset of the training data they work on, at this iteration.

This means that the parameters obtained during the fitting phase on one subset of the training data may be different from the other subsets of the training data, since they are specifically designed for training. And there is the heart of my question:

How can you combine the parameters obtained during the fitting phase of the pre-processing stage using online training or extra-curricular training, when the parameters studied are a function of the training data?

+4
source share
1 answer

You can put a copy StandardScaleron a sufficiently large subset, which is inserted directly into the RAM (say, a few GB of data), and then re-use the same fixed zoom instance, to convert the rest of the data one party at a time. You should be able to get a good estimate of the average and std values ​​of each function over several thousand samples, so there is no need to calculate the actual fit from the full data just for scaling.

partial_fit StandardScaler, .

StandardScaler partial_fit, (, , ):

  • : standard_scaler.partial_fit()
  • : standard_scaler.transform , model.partial_fit.
+4

All Articles