Gradient forecasts in low latency environments?

Question

Gradient forecasts in low latency environments?

Can anyone recommend a strategy for forecasting using a gradient enhancement model in the range of <10-15ms (the faster the better)?

I used the R gbm package, but the first prediction takes ~ 50 ms (subsequent vectorized forecasts are on average 1 ms, so it seems that the overhead is possible in calling the C ++ library). As a guideline, there will be ~ 10-50 entrances and ~ 50-500 trees. The task is classification, and I need access to predictable probabilities.

I know that there are many libraries, but I was not lucky to find information even in the harsh times of prediction for them. Learning will be done offline, so only predictions should be fast - predictions can also come from a piece of code / library that is completely separate from what the training does (as long as there is a common format for representing trees).

+7

machine-learning classification

lockedoff Jul 2 '12 at 14:33

source share

1 answer

Peter Prettenhofer · Accepted Answer · 2012-10-30T10:46:08+0000

I authored the scikit-learn gradient enhancement module , an implementation of gradient regression trees in Python. I made some efforts to optimize forecasting time, since the method was focused on low-latency environments (in particular, problems with ranking); The prediction program is written in C, but there is still some overhead due to calls to Python functions. Having said that: the prediction time for single data points with ~ 50 functions and about 250 trees should be <1 ms.

In my case, forecasting time is often determined by the cost of retrieving the attributes. I highly recommend profiling to indicate the source of the overhead (if you use Python, I can recommend line_profiler ).

If the source of the overhead is prediction rather than function extraction, you can check if it is possible to do batch predictions instead of predicting individual data points, thus limiting the overhead due to the Python function call (for example, in ranking you often need to clog documents top -K, so you can first perform the function extraction and then run the prediction in the K x n_features matrix.

If this does not help, you should try to limit the number of trees, because the execution cost for forecasting is mostly linear in the number of trees. There are several ways to limit the number of trees without affecting the accuracy of the model:

Correct setting of learning speed; the lower the learning speed, the more trees are required and therefore slower to forecast.
Post-processing GBM with regulation L1 (Lasso); See Elements of Statistical Learning . Section 16.3.1 - use the predictions of each tree as new functions and run the view through the linear linear model L1 - delete these trees that do not receive any weight.
Fully corrective weight updates; instead of only updating the linear search / weight for the most recent tree, update all the trees (see [Warmuth2006] and [Johnson2012]). The best convergence is fewer trees.

If none of the above actions allows you to perform cascades or exit strategies (see [Chen2012])

Literature:

[Warmuth2006] M. Warmuth, J. Liao and G. Ratsch. Fully corrective boosting algorithms that maximize margin. In the materials of the 23rd international conference on mechanical learning, 2006

[Johnson2012] Rie Johnson, Tong Zhang, The study of nonlinear functions using regularized greedy forest, arxiv, 2012.

[Chen2012] Minmin Chen, Zhixiang Xu, Kilian Weinberger, Olivier Chapelle, Dor Kedem, Cascade Classifier for Minimizing Characteristic Costs, JMLR W & CP 22: 218-226, 2012.

Gradient forecasts in low latency environments?

More articles: