Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM model equations #9

Open
mheilman opened this issue Mar 12, 2015 · 5 comments
Open

LSTM model equations #9

mheilman opened this issue Mar 12, 2015 · 5 comments

Comments

@mheilman
Copy link
Contributor

The code says it implements the version of the LSTM from Graves et al. (2013), which I assume is this http://www.cs.toronto.edu/~graves/icassp_2013.pdf or http://www.cs.toronto.edu/~graves/asru_2013.pdf. However, it looks like the LSTM equations in those papers have both the output layer values and memory cell values from the previous time step as input to the gates.

E.g., in equation 3 of http://www.cs.toronto.edu/~graves/icassp_2013.pdf:

i_t = σ (W_xi xt + W_hi ht−1 + W_ci ct−1 + bi)

However, it looks like the code is doing the following:

i_t = σ (W_xi xt + W_hi ht−1 + bi)

Am I missing something here? Is there another LSTM paper this is based on?

I doubt there's much of a practical difference between these two formulations, but it would be good if the documentation were accuracy. Sorry if I'm misunderstanding something here (also sorry for the messy equations above).

@JonathanRaiman
Copy link
Owner

Michael, You read closely those diagrams :)

Yes, indeed it appears that referencing 2013 and Alex Graves is not as precise as I'd hoped.
There are indeed LSTM networks that use the cell activations (memory) as inputs to their gates, while others from 2013 that reference Alex Graves' architectures (namely In Learning to Execute (http://arxiv.org/pdf/1410.4615v3.pdf), Grammar as a Foreign Language, and other good LSTM papers) but reserve memory cells solely for internal LSTM purposes (I guess this lets the cells "focus" on one thing during the learning). The papers you mention use the cell memories, so I should make that would apparent in the documentation.

In any case, it is informative to see how Andrej Karpathy in https://github.com/karpathy/recurrentjs describes LSTMs (in Javascript), and how Zaremba describes LSTMs here (in Lua) https://github.com/wojciechz/learning_to_execute.

To be fair the most common implementation is the one present here, but potentially a better one is the one you speak of. If you cross-validate one against the other I'd be very interested in hearing if there's a major difference

@JonathanRaiman
Copy link
Owner

Michael,

Quick follow up. Ran a couple models with the two different versions and using the version you talk about most models hit a local minima way sooner in their training. In most cases training time is doubled or tripled to exit it. While the version implemented here (where memory does not feed back to gates) reaches a lower local minima and exits quicker. There may be some coupling with the type of gradient descent run (Adadelta vs Adam vs RMSProp or something else).

If you find a way of training them easily, or some combination that works well I'd be curious to hear about it, but for now it appears that these cannot be used interchangeably without understanding where the optimisation troubles come from.

@mheilman
Copy link
Contributor Author

Thanks for your very detailed reply! I'll let you know if I find anything else useful related to this.

@JonathanRaiman
Copy link
Owner

You might be interested in a more thorough discussion from last week's Arxiv paper.

@mheilman
Copy link
Contributor Author

That's a very useful reference. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants