forked from 390introml/notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
rnn.tex
654 lines (588 loc) · 31.8 KB
/
rnn.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
\chapter{Recurrent Neural Networks}
\label{chap-rnn}
So far, we have limited our attention to domains in which each output
$y$ is assumed to have been generated as a function of an associated
input $x$, and our hypotheses have been ``pure'' functions, in which
the output depends only on the input (and the parameters we have
learned that govern the function's behavior). In the next few
chapters, we are going to consider cases in which our models need to
go beyond functions. In particular, behavior as a function of {\em
time} will be an important concept:
\begin{itemize}
\item In {\em recurrent neural networks}\index{recurrent neural
network}, the hypothesis that we learn is not a function of a single
input, but of the whole sequence of inputs that the predictor has
received.
\item In {\em reinforcement learning}\index{reinforcement learning},
the hypothesis is either a {\em model} of a domain (such as a game)
as a recurrent system or a {\em policy} which is a pure function,
but whose loss is determined by the ways in which the policy
interacts with the domain over time.
\end{itemize}
In this chapter, we introduce {\em state machines}\index{state machine}.
We start with deterministic state machines, and then consider
recurrent neural network (RNN) architectures to model their behavior.
Later, in Chapter~\ref{chap-mdps}, we will study {\em Markov decision
processes} (MDPs) that extend to consider probabilistic (rather than
deterministic) transitions in our state machines.
RNNs and MDPs will enable description and modeling of temporally
sequential patterns of behavior that are important in many domains.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{State machines}
\label{sec-state_machines}
A {\em state machine} \note{This is such a pervasive idea that it has
been given many names in many subareas of computer science, control
theory, physics, etc., including: {\em automaton}, {\it transducer},
{\it dynamical system}, etc.} is a description of a
process (computational, physical, economic) in terms of its potential
sequences of {\em states}.
The {\em state} of a system is defined to be all you would need to
know about the system to predict its future trajectories as well as
possible. It could be the position and velocity of an object or the
locations of your pieces on a game board, or the current traffic
densities on a highway network.
Formally, we define a {\em state machine} as
$(\mathcal{S}, \mathcal{X}, \mathcal{Y}, s_0, f_s, f_o)$
where
\begin{itemize}
\item $\mathcal{S}$ is a finite or infinite set of possible states;
\item $\mathcal{X}$ is a finite or infinite set of possible inputs;
\item $\mathcal{Y}$ is a finite or infinite set of possible outputs;
\item $s_0 \in \mathcal{S}$ is the initial state of the machine;
\item $f_s: \mathcal{S} \times \mathcal{X} \rightarrow \mathcal{S}$ is a
{\em transition function}\index{transition function}, which takes an
input and a previous state and produces a next state;
\item $f_o: \mathcal{S} \rightarrow \mathcal{Y}$ is an {\em output
function}\index{output function}, which takes a state and produces
an output.
\end{itemize}
The basic operation of the state machine is to \note{In some cases, we
will pick a starting state from a set or distribution.} start with
state $s_0$, then iteratively compute for $t \geq 1$:
\begin{align}
s_t & = f_s(s_{t - 1}, x_t) \\
y_t & = f_o(s_t)
\end{align}
\begin{examplebox}
The diagram below illustrates this process. Note that the
``feedback'' connection of $s_t$ back into $f_s$ has to be buffered or
delayed by one time step----otherwise what it is computing would not
generally be well defined.
\begin{center}
\tikzstyle{block} = [draw, fill=blue!20, rectangle, minimum height=3em, minimum width=3em]
\tikzstyle{sum} = [draw, fill=blue!20, circle, node distance=1cm]
\tikzstyle{input} = [coordinate]
\tikzstyle{output} = [coordinate]
\tikzstyle{pinstyle} = [pin edge={to-,thin,black}]
% The block diagram code is probably more verbose than necessary
\begin{tikzpicture}[auto, node distance=2cm,>=latex']
% We start by placing the blocks
\node [input, name=input] {};
\node [block, right of=input, node distance=2cm] (f) {$f_s$};
\node [block, right of=f,
node distance=4cm] (g) {$f_o$};
% We draw an edge between the controller and system block to
% calculate the coordinate u. We need it to place the measurement block.
\draw [->] (f) -- node[name=u] {$s_t$} (g);
\node [output, right of=g, node distance=2cm] (output) {};
\node [coordinate, name=measurements, below of=f, node distance=2cm] (measurements) {Measurements};
% Once the nodes are placed, connecting them is easy.
\draw [draw,->] (input) -- node {$x_t$} (f);
\draw [->] (g) -- node [name=y_t] {$y_t$}(output);
\draw [->] (u) |- (measurements);
\draw [->] (measurements) -| node[pos=0.99] {$-$}
node [near end] {$s_{t - 1}$} (f);
\end{tikzpicture}
\end{center}
\end{examplebox}
So, given a sequence of inputs $x_1, x_2, \dots$ the machine generates a
sequence of outputs
$$ \underbrace{f_o(f_s(s_0, x_1))}_{y_1}, \underbrace{f_o(f_s(f_s(s_0, x_1), x_2
))}_{y_2}, \dots \;\;.$$
We sometimes say that the machine {\em transduces} sequence $x$ into
sequence $y$.
The output at time $t$ can have dependence on inputs from steps $1$ to
$t$.\note{There are a huge
number of major and minor variations on the idea of a state machine.
We'll just work with one specific one in this section and another
one in the next, but don't worry if you see other variations out in
the world!}
One common form is {\em finite state machines}\index{finite state
machine}, in which $\mathcal S$, $\mathcal X$, and $\mathcal Y$ are
all finite sets. They are often described using {\em state transition
diagrams}\index{state transition diagram} such as the one below, in
which nodes stand for states and arcs indicate transitions. Nodes are
labeled by which output they generate and arcs are labeled by which
input causes the transition. \note{All computers can be described, at
the digital level, as finite state machines. Big, but finite!}
\begin{examplebox}
One can verify that the state machine below reads binary strings and
determines the parity of the number of zeros in the given string.
Check for yourself that all input binary strings end in state $S_1$ if
and only if they contain an even number of zeros.
\begin{center}
\includegraphics[scale=1.0]{figures/FSM.png}
\end{center}
\end{examplebox}
Another common structure that is simple but powerful and used in
signal processing and control is {\em linear time-invariant (LTI)
systems}\index{linear time-invariant systems}. In this case, all
the quantities are real-valued vectors: $\mathcal S = \R^m$, $\mathcal
X = \R^l$ and $\mathcal Y = \R^n$. The functions $f_s$ and $f_o$ are
linear functions of their inputs.
The transition function is described by the state matrix $A$ and the input matrix $B$;
the output function is defined by the output matrix $C$, each with compatible dimensions.
In discrete time, they can be
defined by a linear difference equation, like
\begin{align}
s_t & = f_s(s_{t-1}, x_t) = A s_{t-1} + B x_t, \\
y_t & = f_o(s_t) = C s_t, \;\;
\end{align}
and can be implemented using state to store relevant previous
input and output information.
We will study {\it{recurrent neural networks}} which are a lot like a
non-linear version of an LTI system.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Recurrent neural networks}
\label{sec-rnn_model}
In Chapter~\ref{chap-neural_networks}, we studied neural networks and
how the weights of a network can be obtained by training on data, so
that the neural network will model a function that approximates the
relationship between the $(x, y)$ pairs in a supervised-learning
training set. In Section~\ref{sec-state_machines} above, we
introduced state machines to describe sequential temporal
behavior. Here in Section~\ref{sec-rnn_model}, we explore recurrent
neural networks by defining the architecture and weight matrices in a
neural network to enable modeling of such state machines. Then, in
Section~\ref{sec-seq2seq_rnn}, we present a loss function that may be
employed for training {\em sequence to sequence} {\sc rnn}s, and then
consider application to language translation and recognition in
Section~\ref{sec-language}. In Section~\ref{sec-bptt}, we'll see how
to use gradient-descent methods to train the weights of an {\sc rnn}
so that it performs a {\em transduction} that matches as closely as
possible a training set of input-output {\em sequences}.
\bigskip
\begin{comment}
Recall that the basic operation of the state machine is to
start with some state $s_0$, then iteratively compute for $t \geq 1$:
\begin{align}
s_t & = f_s(s_{t - 1}, x_t) \\
y_t & = f_o(s_t)
\end{align}
as illustrated in the diagram below (remembering that there needs to
be a delay on the feedback loop):
\begin{center}
\tikzstyle{block} = [draw, fill=blue!20, rectangle, minimum height=3em, minimum width=3em]
\tikzstyle{sum} = [draw, fill=blue!20, circle, node distance=1cm]
\tikzstyle{input} = [coordinate]
\tikzstyle{output} = [coordinate]
\tikzstyle{pinstyle} = [pin edge={to-,thin,black}]
% The block diagram code is probably more verbose than necessary
\begin{tikzpicture}[auto, node distance=2cm,>=latex']
% We start by placing the blocks
\node [input, name=input] {};
\node [block, right of=input, node distance=2cm] (f_s) {f_s};
\node [block, right of=f,
node distance=4cm] (f_o) {f_o};
% We draw an edge between the controller and system block to
% calculate the coordinate u. We need it to place the measurement block.
\draw [->] (f) -- node[name=u] {\hspace*{1cm} $s_t$} (f_o);
\node [output, right of=f_o, node distance=2cm] (output) {};
\node [coordinate, name=measurements, below of=f, node distance=2cm] (measurements) {Measurements};
% Once the nodes are placed, connecting them is easy.
\draw [draw,->] (input) -- node {$x_t$} (f_s);
\draw [->] (f_o) -- node [name=y_t] {$y_t$}(output);
\draw [->] (u) |- (measurements);
\draw [->] (measurements) -| node[pos=0.98] {$-$}
node [near end] {$s_{t - 1}$} (f_s);
\end{tikzpicture}
\end{center}
So, given a sequence of inputs $x_1, x_2, \dots$ the machine generates a
sequence of outputs
\begin{equation}
\underbrace{f_o(f_s(s_0, x_1))}_{y_1}, \underbrace{f_o(f_s(f_s(s_0, x_1), x_2))}_{y_2}, \dots \;\;.
\end{equation}
\end{comment}
A {\it recurrent neural network}\index{recurrent neural network} is a state machine with neural
networks constituting functions $f_s$ and $f_o$:
\begin{align}
s_t & = f_s\left(W^{sx}x_t + W^{ss}s_{t - 1} + W^{ss}_0\right) \\
y_t & = f_o\left(W^o s_t + W_0^o\right) \;\;.
\end{align}
% \note{We are very sorry! This course material has evolved from
% different sources, which used $W^Tx$ in the forward pass for regular
% feedforward NNs and $Wx$ for the forward pass in {\sc rnn}s. This
% inconsistency doesn't make any technical difference, but is a
% potential source of confusion.
% }
The inputs, states, and outputs are all vector-valued:
\begin{align}
x_t & : \ell \times 1 \\
s_t & : m \times 1 \\
y_t & : v \times 1 \;\;.
\end{align}
The weights in the network, then, are
\begin{align}
W^{sx} & : m \times \ell \\
W^{ss} & : m \times m \\
W^{ss}_0 & : m \times 1 \\
W^{o} & : v \times m \\
W^{o}_0 & : v \times 1
\end{align}
with activation functions $f_s$ and $f_o$.
\question{Check dimensions here to be sure it all works out. Remember
that we apply $f_s$ and $f_o$ elementwise, unless $f_o$ is a
softmax activation.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Sequence-to-sequence RNN}
\label{sec-seq2seq_rnn}
Now, how can we set up an {\sc rnn} to model and be trained to produce
a transduction\index{transduction} of one sequence to another? This problem is sometimes
called {\em sequence-to-sequence} mapping\index{sequence-to-sequence mapping}. You can think of it as a
kind of regression problem: given an input sequence, learn to generate
the corresponding output sequence. \note{One way to think of training
a sequence {\bf classifier} is to reduce it to a transduction
problem, where $y_t = 1$ if the sequence $x_1, \ldots, x_t$ is a
{\em positive} example of the class of sequences and $-1$
otherwise.}
A training set has the form
$\left[\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(q)},
y^{(q)}\right)\right]$, where
\begin{itemize}
\item
$x^{(i)}$ and $y^{(i)}$ are length $n^{(i)}$ sequences;
\item
sequences in the {\it{same pair}} are the same length; and
sequences in different pairs may have different lengths.
\end{itemize}
Next, we need a loss function. We start by defining a loss function
on sequences. There are many possible choices, but usually it makes
sense just to sum up a per-element loss function on each of the output
values, where $g$ is the predicted sequence and $y$ is the actual one:
\begin{equation}
\mathcal{L}_{\text{seq}}\left(g^{(i)}, y^{(i)}\right) = \sum_{t =
1}^{n^{(i)}}\mathcal{L}_\text{elt}\left(g_t^{(i)},
y_t^{(i)}\right) \;\;.
\end{equation}
The per-element loss function $\mathcal{L}_\text{elt}$\note{So it could be {\sc nll},
squared loss, etc.}\ will depend on
the type of $y_t$
and what information it is encoding, in the same way as for a
supervised network.
Then, letting $W =\left(W^{sx}, W^{ss}, W^o, W^{ss}_0,
W_0^o\right)$, our overall goal is to minimize the objective
\begin{equation}
J(W) = \frac{1}{q} \sum_{i = 1}^q\mathcal{L}_{\text{seq}}\left(
\text{RNN}(x^{(i)};W), y^{(i)}\right) \;\;,
\end{equation}
where $\text{RNN}(x; W)$ is the output sequence generated, given
input sequence $x$.
It is typical to choose $f_s$ to be {\it tanh} \note{Remember that it
looks like a sigmoid but ranges from -1 to +1.} but any non-linear
activation function is usable. We choose $f_o$ to align with the
types of our outputs and the loss function, just as we would do in
regular supervised learning.
\section{RNN as a language model}
\label{sec-language}
A {\em language model}\index{language model} is a sequence to sequence {\sc rnn} which is
trained on a token\note{A ``token'' is generally a
character, common word fragment, or a word.} sequence of the form, $c = (c_1, c_2,
\ldots, c_k)$, and is used to predict the next token $c_t, t \leq k$,
given a sequence of the previous $(t-1)$ tokens:
\begin{equation}
c_t = \text{RNN}\left((c_1, c_2, \dots, c_{t - 1}) ; W \right)\;\;
\end{equation}
We can convert this to a sequence-to-sequence training problem by
constructing a data set of $q$ different $(x, y)$ sequence pairs, where we make up
new special tokens, $\text{start}$ and $\text{end}$, to signal the
beginning and end of the sequence:
\begin{align}
x & = (\langle\text{start}\rangle, c_1, c_2, \ldots, c_k) \\
y & = (c_1, c_2, \dots, \langle\text{end}\rangle)
\end{align}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \begin{noticebox}
% The following material on back-propagation through time
% (Section~\ref{sec-bptt}) and vanishing gradients and gating
% mechanisms (Section~\ref{sec-rnn_lstm}) is for your own elucidation.
% It is not being covered in the Fall 2021 semester of 6.036, and will
% not be included in the final exam.
% \end{noticebox}
\section{Back-propagation through time}
\label{sec-bptt}
Now the fun begins! We can now try to find a $W$ to minimize $J$
using gradient descent. We will work through the simplest method,
{\em back-propagation through time}\index{backpropagation through
time} ({\sc bptt}), in detail. This is generally not the best
method to use, but it's relatively easy to understand. In
Section~\ref{lstm} we will sketch alternative methods that are in much
more common use.
\bigskip
\begin{noticebox}
What we want you to take away from this section is that, by
``unrolling'' a recurrent network out to model a particular
sequence, we can treat the whole thing as a feed-forward network
with a lot of parameter sharing. Thus, we can tune the parameters
using stochastic gradient descent, and learn to model sequential
mappings. The concepts here are very important. While the details
are important to get right if you need to implement something, we
present the mathematical details below primarily to convey or
explain the larger concepts.
\end{noticebox}
\begin{examplebox} {\bf Calculus reminder: total derivative} Most of
us are not very careful about the difference between the {\em
partial derivative} and the {\em total derivative}\index{total derivative}. We are going
to use a nice example from the Wikipedia article on partial
derivatives to illustrate the difference.
The volume of a circular cone depends on its height and radius:
\begin{equation}
V(r, h) = \frac{\pi r^2 h}{3}\;\;.
\end{equation}
The partial derivatives of volume with respect to height and radius
are
\begin{equation}
\frac{\partial V}{\partial r} = \frac{2\pi r
h}{3}\;\;\;\text{and}\;\;\;
\frac{\partial V}{\partial h} = \frac{\pi r^2}{3}\;\;.
\end{equation}
They measure the change in $V$ assuming everything is held constant
except the single variable we are changing.
Now assume that we want to preserve the cone's proportions in the sense that the ratio of radius to height stays constant.
Then we can't really change one without changing the other.
In this case, we really have to think about the {\em total derivative}.
If we're interested in the total derivative with respect to $r$, we sum the ``paths'' along which $r$ might influence $V$:
\begin{align}
\frac{dV}{dr} & = \frac{\partial V}{\partial r} + \frac{\partial
V}{\partial h} \frac{dh}{dr} \\
& = \frac{2 \pi r h}{3} + \frac{\pi r^2}{3} \frac{dh}{dr}
\end{align}
Or if we're interested in the total derivative with respect to $h$, we consider how $h$ might influence $V$, either directly or via $r$:
\begin{align}
\frac{dV}{dh} & = \frac{\partial V}{\partial h} + \frac{\partial
V}{\partial r} \frac{dr}{dh} \\
& = \frac{\pi r^2}{3} + \frac{2 \pi r h}{3} \frac{dr}{dh}
\end{align}
Just to be completely concrete, let's think of a right circular cone
with a fixed angle $\alpha = \tan r / h$, so that if we change $r$ or
$h$ then $\alpha$ remains constant. So we have $r = h \tan^{-1}
\alpha$; let constant $c = \tan^{-1} \alpha$, so now $r = c h$.
Thus, we finally have
\begin{align}
\frac{dV}{dr} & = \frac{2 \pi r h}{3} + \frac{\pi r^2}{3} \frac{1}{c} \\
\frac{dV}{dh} & = \frac{\pi r^2}{3} + \frac{2 \pi r h}{3} c \; .
\end{align}
\end{examplebox}
\noindent The {\sc bptt} process goes like this:
\begin{enumerate}[(1)]
\item
Sample a training pair of sequences $(x, y)$; let their length be $n$.
\item
``Unroll" the RNN to be length $n$ (picture for $n = 3$ below), and
initialize $s_0$:
\centerline{\includegraphics[width=\textwidth]{figures/rnn_unrolled_with_offsets.png}}
Now, we can see our problem as one of performing what is almost an
ordinary back-propagation training procedure in a feed-forward neural
network, but with the difference that the weight matrices are shared
among the layers. In many ways, this is similar to what ends up
happening in a convolutional network, except in the conv-net, the
weights are re-used spatially, and here, they are re-used temporally.
\item
Do the {\it forward pass}, to compute the predicted output sequence $g$:
\begin{align}
z_t^1 & = W^{sx}x_t + W^{ss}s_{t - 1} + W^{ss}_0 \\
s_t & = f_s(z_t^1) \\
z_t^2 & = W^os_t + W_0^o \\
g_t & = f_o(z_t^2)
\end{align}
\item
Do {\em backward pass} to compute the gradients. For both $W^{ss}$ and
$W^{sx}$ we need to find
\begin{align}
\frac{d \mathcal{L}_\text{seq}(g,y)}{d W} & = \sum_{u = 1}^n\frac{d \mathcal{L}_\text{elt}(g_u, y_u)}{d W} ~~~~~~~~~~~~~~~~~~ \nonumber
\end{align}
Letting $\mathcal{L}_u = \mathcal{L}_\text{elt}(g_u, y_u)$ and using the {\em total derivative}, which is a sum over all
the ways in which $W$ affects $\mathcal{L}_u$, we have
\begin{align}
~~~~ & = \sum_{u = 1}^n\sum_{t = 1}^n \frac{\partial s_t}{\partial W} \frac{\partial \mathcal{L}_u}{\partial s_t} \nonumber
\end{align}
Re-organizing, we have
\begin{align}
~~~~ & = \sum_{t = 1}^n\frac{\partial s_t}{\partial W} \sum_{u =
1}^n\frac{\partial \mathcal{L}_u}{\partial s_t} \nonumber
\end{align}
Because $s_t\ \text{only affects}\ \mathcal{L}_t, \mathcal{L}_{t + 1}, \dots, \mathcal{L}_n$,
\begin{align}
~~~~~~~~~~~~~~~~~~~~ & = \sum_{t = 1}^n\frac{\partial s_t}{\partial W} \sum_{u = t}^n\frac{\partial \mathcal{L}_u}{\partial s_t} \nonumber \\
& = \sum_{t = 1}^n\frac{\partial s_t}{\partial W}
\left(\frac{\partial \mathcal{L}_t}{\partial s_t} + \underbrace{\sum_{u = t +
1}^n\frac{\partial \mathcal{L}_u}{\partial s_t}}_{\delta^{s_t}}\right) \; .\label{sumeq}
\end{align}
where $\delta^{s_t}$ is the dependence of the future loss (incurred after step $t$) on the
state $S_t$.\note{That is, $\delta^{s_t}$ is how much we can
blame state $s_t$ for all the future element losses.}
We can compute this backwards, with $t$ going from $n$ down to $1$.
The trickiest part is figuring out how early states contribute to later
losses. We define the {\it{future loss}} after step $t$ to be
\begin{equation}
F_t = \sum_{u = t + 1}^{n}\mathcal{L}_\text{elt}(g_u, y_u) \;\;,
\end{equation}
so
\begin{equation}
\delta^{s_t} = \frac{\partial F_t}{\partial s_t}\;\;.
\end{equation}
At the last stage, $F_n = 0$ so $\delta^{s_n} = 0$.
Now, working backwards,
\begin{align}
\delta^{s_{t -1}} & = \frac{\partial}{\partial s_{t - 1}}\sum_{u = t}^n\mathcal{L}_\text{elt}(g_u, y_u) \\
& = \frac{\partial s_t}{\partial s_{t - 1}} \frac{\partial}{\partial s_t}\sum_{u = t}^n\mathcal{L}_\text{elt}(g_u, y_u) \\
& = \frac{\partial s_t}{\partial s_{t - 1}} \frac{\partial}{\partial s_t}\left[\mathcal{L}_\text{elt}(g_t, y_t) + \sum_{u = t + 1}^n\mathcal{L}_\text{elt}(g_u, y_u)\right] \\
& = \frac{\partial s_t}{\partial s_{t - 1}} \left[\frac{\partial \mathcal{L}_\text{elt}(g_t, y_t)}{\partial s_t} + \delta^{s_t}\right]
\end{align}
Now, we can use the chain rule again to find the dependence of the
element loss at time $t$ on the state at that same time,
\begin{equation}
\underbrace{\frac{\partial \mathcal{L}_\text{elt}(g_t,
y_t)}{\partial s_t}}_{(m \times 1)} = \underbrace{\frac{\partial
z_t^2}{\partial s_t}}_{(m \times v)} ~
\underbrace{\frac{\partial \mathcal{L}_\text{elt}(g_t, y_t)}{\partial
z_t^2}}_{(v \times 1)}\;\;,
\end{equation}
and the dependence of the state at time $t$ on the state at the
previous time,
%noting that we are performing an {\em elementwise}
%multiplication between $W^T_{ss}$ and the vector of ${f^1}'$ values, $\partial
%s_t /\partial z^1_t$:
\iffalse
\note{There are two ways to think about $\partial s_t / \partial
z_t$: here, we take the view that it is an $m \times 1$ vector and
we multiply each column of $W^T$ by it. Another, equally good,
view, is that it is an $m \times m$ diagonal matrix, with the values
along the diagonal, and then this operation is a matrix multiply.
Our software implementation will take the first view.
}
\fi
\begin{equation}
\underbrace{\frac{\partial s_t}{\partial s_{t - 1}}}_{(m \times m)}
= \underbrace{\frac{\partial z_t^1}{\partial s_{t - 1}}}_{(m \times
m)} \underbrace{\frac{\partial s_t}{\partial z_t^1}}_{(m
\times m)} = {W^{ss}}^T \frac{\partial s_t}{\partial z_t^1}
\end{equation} \\
\begin{examplebox} Note that $\partial s_t /\partial z^1_t$
is formally an $m \times m$ diagonal matrix, with the values
along the diagonal being $f_{s}'(z_{t,i}^1)$, $1 \leq i \leq m$. But since this is a diagonal matrix, one could represent it as an $m \times 1$ vector $f_{s}'(z_t^1)$. In that case the product of the matrix ${W^{ss}}^T$ by the vector $f_{s}'(z_t^1)$, denoted ${W^{ss}}^T * f_{s}'(z_t^1)$, should be interpreted as follows: take the first column of the matrix ${W^{ss}}^T$ and multiply each of its elements by the first element of the vector $\partial s_t /\partial z^1_t$, then take the second column of the matrix ${W^{ss}}^T$ and multiply each of its elements by the second element of the vector $\partial s_t /\partial z^1_t$, and so on and so forth ...
\end{examplebox}
Putting this all together, we end up with
\begin{equation}
\delta^{s_{t - 1}} = \underbrace{{W^{ss}}^T \frac{\partial s_t}{\partial z_t^1}}_{\frac{\partial s_t}{\partial s_{t - 1}}} ~ \underbrace{\left({W^o}^T\frac{\partial{\mathcal{L}_t}}{\partial z_t^2} + \delta^{s_t}\right)}_{\frac{\partial F_{t - 1}}{\partial s_t}}
\end{equation}
We're almost there! Now, we can describe the actual weight updates.
Using Eq.~\ref{sumeq} and recalling the definition of
$\delta^{s_t} = \partial F_t / \partial s_t$,
as we iterate backwards, we can accumulate the terms in Eq.~\ref{sumeq}
to get the gradient for the whole loss.
\iffalse
\begin{align}
\frac{ d \mathcal{L}_\text{seq}}{d W^{ss}} & +=
\frac{\partial F_{t-1}}{\partial W^{ss}} =
\frac{\partial z^1_t}{\partial W^{ss}} \frac{\partial s_t}{\partial z^1_t}
\frac{\partial F_{t-1}}{\partial s_t} \\
\frac{ d \mathcal{L}_\text{seq}}{d W^{sx}} & +=
\frac{\partial F_{t-1}}{\partial W^{sx}} =
\frac{\partial z^1_t}{\partial W^{sx}} \frac{\partial s_t}{\partial z^1_t}
\frac{\partial F_{t-1}}{\partial s_t}
\end{align}
\fi
\begin{align}
\frac{d \mathcal{L}_\text{seq}}{d W^{ss}} & = \sum_{t=1}^n
\frac{d \mathcal{L}_\text{elt}(g_t, y_t)}{d W^{ss}} = \sum_{t=1}^n \frac{\partial z_t^1}{\partial W^{ss}} \frac{\partial s_t}{\partial z_t^1} \frac{\partial F_{t - 1}}{\partial s_t} \\
\frac{d \mathcal{L}_\text{seq}}{d W^{sx}} & = \sum_{t=1}^n
\frac{d \mathcal{L}_\text{elt}(g_t, y_t)}{d W^{sx}} = \sum_{t=1}^n
\frac{\partial z_t^1}{\partial W^{sx}} \frac{\partial s_t}{\partial z_t^1} \frac{\partial F_{t-1}}{\partial s_t}
\end{align}
We can handle $W^o$ separately; it's easier because it does not
affect future losses in the way that the other weight matrices do:
\begin{equation}
\frac{d\mathcal{L}_\text{seq}}{d W^o} = \sum_{t = 1}^n\frac{d \mathcal{L}_t}{d W^o} = \sum_{t = 1}^n\frac{\partial \mathcal{L}_t}{\partial
z_t^2} \frac{\partial z_t^2}{\partial W^o}
\end{equation}
Assuming we have $\frac{\partial \mathcal{L}_t}{\partial z_t^2} = (g_t - y_t)$,
(which ends up being true for squared loss, softmax-NLL, etc.), then
\begin{equation}
\underbrace{\frac{d \mathcal{L}_\text{seq}}{d W^o}}_{v \times m} = \sum_{t=1}^n \underbrace{(g_t - y_t)}_{v \times 1} ~ \underbrace{s_t^T}_{1 \times m} \; .
\end{equation}
Whew!
\end{enumerate}
\question{Derive the updates for the offsets $W^{ss}_0$ and $W^o_0$.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Vanishing gradients and gating mechanisms}
\label{lstm}
\label{sec-rnn_lstm}
Let's take a careful look at the backward propagation of the gradient
along the sequence:
\begin{equation}
\delta^{s_{t -1}} = \frac{\partial s_t}{\partial s_{t - 1}} ~
\left[\frac{\partial \mathcal{L}_\text{elt}(g_t, y_t)}{\partial s_t}
+ \delta^{s_t}\right]\;\;.
\end{equation}
Consider a case where only the output at the end of the sequence is
incorrect, but it depends critically, via the weights, on the input
at time 1. In this case, we will multiply the loss at step $n$ by
\begin{equation}
\frac{\partial s_2}{\partial s_1} \frac{\partial s_3}{\partial
s_2} \cdots \frac{\partial s_n}{\partial s_{n-1}}\;\;.
\end{equation}
In general, this quantity will either grow or shrink exponentially
with the length of the sequence, and make it very difficult to train.
\question{The last time we talked about exploding and vanishing
gradients, it was to justify per-weight adaptive step sizes. Why is
that not a solution to the problem this time?}
An important insight that really made recurrent networks work well on
long sequences is the idea of {\em gating}\index{recurrent neural network!gating}.
\subsection{Simple gated recurrent networks}
A computer only ever updates some parts of its memory on each
computation cycle. We can take this idea and use it to make our
networks more able to retain state values over time and to make the
gradients better-behaved. We will add a new component to our network,
called a {\em gating network}\index{gating network}. Let $g_t$ be a $m \times 1$ vector of
values and let $W^{gx}$ and $W^{gs}$ be $m \times l$ and $m \times m$
weight matrices, respectively. We will compute $g_t$ as
\note{It can have an offset, too, but we are omitting it for simplicity.}
\begin{equation}
g_t = \text{sigmoid}(W^{gx} x_t + W^{gs} s_{t-1})
\end{equation}
and then change the computation of $s_t$ to be
\begin{equation}
s_t = (1 - g_t) * s_{t-1} + g_t * f_s(W^{sx}x_t + W^{ss}
s_{t-1} + W^{ss}_0)\;\;,
\end{equation}
where $*$ is component-wise multiplication. We can see, here, that
the output of the gating network is deciding, for each dimension of
the state, how much it should be updated now. This mechanism makes it
much easier for the network to learn to, for example, ``store'' some
information in some dimension of the state, and then not change it
during future state updates, or change it only under certain
conditions on the input or other aspects of the state.
\question{Why is it important that the activation function for $g$ be
a sigmoid?}
\subsection{Long short-term memory}
The idea of gating networks can be applied to make a state machine
that is even more like a computer memory, resulting in a type of
network called an {\sc lstm}\index{long short-term memory} for ``long short-term memory.''
\note{Yet another awesome name for a neural network!} We won't go
into the details here, but the basic idea is that there is a memory
cell (really, our state vector) and three (!) gating networks. The
{\em input} gate selects (using a ``soft'' selection as in the gated
network above) which dimensions of the state will be updated with new
values; the {\em forget} gate decides which dimensions of the state
will have its old values moved toward 0, and the {\em output} gate
decides which dimensions of the state will be used to compute the
output value. These networks have been used in applications like
language translation with really amazing results. A diagram of the
architecture is shown below:
\begin{center}
\includegraphics[scale = 0.7]{figures/lstm.png}
\end{center}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: