In this entry, I’d like to discuss the basics of recurrent neural networks (RNNs). This topic has been frequently coming up in my circles and there is a lot of interest around this model when dealing with time series forecasting.
Why sequence models?
Sequence models pose a way to model data that has a temporal dimension (or sequence logic).
These special case of neural networks use most of the same ideas of traditional neural networks but take into account the inter-temporal dependency of the data in a way that something that has already happened can impact something in the future (or something in the future can impact the past such as in bidirectional RNNs) in an efficient way.
The same way that convolutional neural networks share parameters learned in a different part of the image, RNNs share parameters learned in a different point in time.
Types of RNN in terms of architecture
There are different variations of RNNs depending on the nature of the data and the prediction coming from it.
These are some examples of applications of RNNs:
- Speech recognition: Audio to text.
- Music generation: None to audio.
- Sentiment Classification: text to score.
- Machine translation: text to text.
- Video recognition: sequence of images to class.
- Name entity recognition: text to vectors with start and end of entities.
In terms of the mapping, here are some examples of applications of RNNs:
- Many to many (machine translation with encoder-decoder)
- Many to one (sentiment)
- One to many (music generation)
One to one would not take activation from a previous time and thus they don’t constitute RNNs.
The RNN model
The simple case of the RNN model looks like this:
Image from Andrew Ng
Where X is a vector of dimension equal to the number of features or predictors, usually normalized.
Y is a vector of dimension equal to 1 in the case of a single probability of something happening, e.g. ‘hey google’ trigger detection, or the number of words in the corpus which appears in the form of probabilities through a softmax for example.
\(X^{<1>}\) affects \(Y^{<2>}, Y^{<3>}, ...,\) and \(Y^{<T>}\) through the activations in each time step.
There are 2 main values to compute at each step / layer of the RNN:
- \(a^t\) (activation at layer t).
\(a^t=g_1(W_{aa}*a^{t-1}+ W_{ax}*X^t+ b_a)\). Usually \(g_1\) is tanh or relu. - \(y^t\) (the prediction at layer t).
\(y^t=g_2(W_{ya}*a^t + b_y)\). Usually \(g_2\) is sigmoid or softmax.
The gated recurrent unit (GRU)
We would like to remember long term relationships in the data.
For example, in language models, we need to remember that we are talking about a specific subject so we can remember that we are talking in plural or in singular form:
“Walter White, commonly known as Heisenberg, {was?}{were?}”.
When modelling financial data, one would like to remember that sometime in the past there was a recession and that might impact our forecast in the present (because of a structural distress).
Unfortunately, as it usually happens in traditional neural networks, the activations from a past event would gradually vanish as we go into the future, and the model would not be able to remember something relevant that happened a while ago. This is known as the vanishing gradient problem.
GRUs are similar to residual networks in the sense that they can overpass some layers and keep information from activations from previous layers with relative facility.
Take only the \(y^t\) computations of the simple RNN case, the GRU unit replaces the activation computation with the following memory cell computation and update gate computation:
- \(\widetilde {c^t}\) (the preliminary memory cell at layer t).
\(\widetilde {c^t}=g_3(W_{cc}*c^{t-1} + W_{cx}*X^t+ b_c)\). Usually \(g_3\) is tanh. - \(\Gamma_u\) (the update gate at layer).
\(\Gamma_u=g_4(W_{uc}*c^{t-1} + W_{ux}*X^t+ b_u)\). Usually \(g_4\) is sigmoid.
And finally we combine the two to make the final memory cell which is our activation vector that passes to the next time step:
- \(c^t\) (the final memory cell at layer t).
\(c^t=\Gamma_u * \widetilde {c^t}+\Gamma_u *c^{t-1}\)
Since \(c^t\) (the activation) can be the same as many time steps before, this memory cell helps a lot with the vanishing gradient problem.
Long-short-term memory cell (LSTM).
The LSTMs are a more general and powerful case of the GRUs, although they are a more recent invention looking for simplification. They share the preliminary memory cell computation and update gate with GRUs and add:
- a forget gate.
- an output gate
- and now the activation and memory cells are different vectors.
The typical LSTM cell looks like this:
Image from Andrew Ng
I find these diagrams more difficult to understand than the actual computation formulas so here are the complete LSTMs computations:
- \(\widetilde {c^t}=g_3(W_{cc}*a^{t-1} + W_{cx}*X^t+ b_c)\). Usually \(g_3\) is tanh.
- \(\Gamma_u\) (the update gate)
\(\Gamma_u=g_4(W_{uc}*a^{t-1} + W_{ux}*X^t+ b_u)\). Usually \(g_4\) is sigmoid. - \(\Gamma_f\) (the forgate gate)
\(\Gamma_f=g_5(W_{fc}*a^{t-1} + W_{fx}*X^t+ b_f)\). Usually \(g_5\) is sigmoid. - \(\Gamma_o\) (the output gate)
\(\Gamma_o=g_6(W_{oc}*a^{t-1} + W_{ox}*X^t+ b_o)\). Usually \(g_6\) is sigmoid.
With all these previous intermediate values we compute the memory cell, the activation and the prediction:
- \(y^t\) (the prediction at layer t)
\(y^t=g_2(W_{ya}*a^t + b_y)\). Usually \(g_2\) is sigmoid or softmax. - \(c^t\) (the final memory cell at layer t).
\(c^t=\Gamma_u * \widetilde {c^t}+\Gamma_f *c^{t-1}\) - \(a^t\) (the activation at layer t).
\(a^t=\Gamma_o * g_7(c^t)\). Usually \(g_7\) is tanh.
Deep RNNs
It is possible to build deep neural networks using RNNs but because of the temporal dependence of a simple RNN, even a model with 3 hidden layers can be quite computationally expensive.
Image from Andrew Ng
In this diagram, each layer receives an activation from a previous time step and from the previous layer in the same time step.
It is more common to attach deep dense layers on top of each time step in order to produce a prediction, if needed, in order to introduce deep logic into the RNN model.