Is there any hope of using the Lasagna Adam implementation to factorize probability matrices?

I am implementing models of the probabilistic factorization matrix in anano and would like to use the Adam gradient descent .

My goal is to have code that is as unbroken as possible, which means I don’t want to explicitly track the amounts of β€œm” and β€œv” from Adam's algorithm.

It would seem that this is possible, especially after seeing how Lasagna Adam is realized : he hides the values ​​of "m" and "v" inside the theano.function update rules .

This works when a negative logarithmic likelihood is formed with each term that processes a different quantity. But in probabilistic matrix factorization, each term contains a point product of one hidden user vector and one vector of the hidden element. Thus, if I make an instance of Lasagna Adam on each member, I will have several values ​​of "m" and "v" for the same latent vector, and not how Adam should work.

I am also posted on the Lasagne group , actually twice , where there are a few details and some examples.

I thought of two possible implementations:

  • each existing rating (which means each existing term in the global objective function of the NLL) has its own Adam, updated by a special call to the anano.function function. Unfortunately, this leads to the incorrect use of Adam, since the same latent vector will be associated with different values ​​of "m" and "v" used by the Adam algorithm, and this is not the way Adam should work.
  • Adam's call over the entire objective NLL, which will make the update mechanism, like a simple Gradient Descent, instead of SGD, with all known shortcomings (high computation time, staying at local minima, etc.).

My questions:

  • maybe there is something that I did not understand correctly how Lazan Adam works?

  • Will option number 2 really look like SGD, in the sense that every update of the latent vector will affect another update (in the same Adam call) that uses this updated vector?

  • Do you have any other suggestions on how to implement it?

Any idea on how to solve this problem and avoid manually storing replicated vectors and matrices for the 'v' and 'm' values?

+5
source share
1 answer

It seems that in the document they are assuming that you are optimizing the entire function at the same time using gradient descent:

Then we can perform a gradient descent in Y, V, and W to minimize the objective function given by equation 10.

So, I would say that your option 2 is the right way to implement what they did.

There are not so many difficulties or non-linearities (next to the sigmoid), so you are unlikely to encounter typical optimization problems associated with neural networks, which requires the need for something like Adam. So, as long as it all fits in memory, I guess this approach will work.

If it doesn't fit in memory, maybe there is some way that you could develop a mini version of the loss for optimization. It would also be interesting to know if you can add one or more neural networks to replace some of these conditional probabilities. But this is a little off topic ...

0
source

All Articles