VowpalWabbit: Differences and Scalability

Hi guys: I'm trying to figure out how the state of VowpalWabbit is maintained as the size of our input dataset grows. In a typical machine learning environment, if I have 1000 input vectors, I expect you to send them all at once, wait until the completion of the model building phase, and then use the model to create new forecasts.

In VW, it seems that the “online” nature of the algorithm shifts this paradigm to be more sophisticated and able to adjust in real time.

1) How is this modification of the model implemented in real time?

2) Does VW increase the increase in resources relative to the total size of the input over time? That is, as I add more data to my VW model (when it’s small), real-time tuning calculations begin to take longer when the total number of inputs of vector vectors increases to 1000, 10000 or millions?

+8
performance machine-learning scalability vowpalwabbit
source share
2 answers

Just to add to carlosdc a good answer.

Some of the functions that install vowpal wabbit separately and allow you to scale to the size of tera-feature (10 12 ):

Online weight vector: vowpal wabbit supports weight-vector memory, which is essentially a weight vector for the model it is building. This is what you call the "state" in your question.

Size of unlimited data: The size of the weight vector is proportional to the number of attributes (independent input variables), and not to the number of examples (instances). This is what vowpal wabbit does, unlike many other (not online) students, scale in space. Since it is not necessary to load all the data into memory, as a typical student does, he can still learn from data sets that are too large to fit in memory.

Cluster mode: vowpal wabbit supports working on multiple hosts in a cluster, superimposing the structure of the binary tree nodes on the nodes and using the all-reduce abbreviation from leaves to root.

Hash trick: vowpal wabbit uses what is called hashing . All function names are hashed into murmurhash-32 using murmurhash-32 . This has several advantages: it is very simple and economical in time, without having to cope with managing and colliding hash tables, while simultaneously allowing functions to collide periodically. It turns out (in practice) that a small number of collisions of objects in the training set with thousands of different functions is similar to adding an implicit regularization term. This is counter-intuitive, often increasing the accuracy of the model, rather than reducing it. It is also agnostic to the sparseness (or density) of the spatial object. Finally, it allows input function names to be arbitrary strings, unlike most regular students, for whom object names / identifiers must be either a) numeric or b) unique.

Parallelism: vowpal wabbit uses multi-core processors, controlling parsing and training in two separate threads, adding to its speed. This is what allows vw to be able to learn as fast as reading data. It turns out that most of the supported algorithms in vw are counter-intuitively narrowly limited by I / O speed rather than learning speed.

Follow-up and incremental training: vowpal wabbit allows you to save your model to disk during training, and then load the model and continue exploring where you left off with the --save_resume option.

Test-like error assessment: Average losses calculated by tungsten fishing “as it arrives” are always on invisible (out of the sample) data (*). This eliminates the need to worry about pre-planned deductions or cross-check. The error rate that you see during training is a "test".

In addition to linear models: vowpal wabbit supports several algorithms, including matrix factorization (roughly sparse matrix SVD), Latent Dirichlet Allocation (LDA) and much more. It also supports “temporary” processing of terminal interactions (bilinear, quadratic, cubic and stern sigmoid neural networks with a given number of units), classification of several classes (in addition to basic regression and binary classification), etc.

There are official guides on github and many examples on the official vw wiki on github.

(*) An exception is the use of multiple passes with the --passes N option.

+19
source share

VW is a (very) complex implementation of stochastic gradient descent. You can learn more about stochastic gradient descent here.

It turns out that a good implementation of stochastic gradient descent is mainly related to I / O, it goes as fast as you can get the data, so VW has some complex data structures to “compile” the data.

Therefore, the answer to question (1) is to perform stochastic gradient descent, and the answer to question (2) is definitely not like that.

+8
source share

All Articles