Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adadelta_12.9 #83

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions Ch12_Optimization_Algorithms/Adadelta.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Adadelta"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to RMSProp, Adadelta is another common optimization algorithm that\n",
"helps improve the chances of finding useful solutions at later stages of\n",
"iteration, which is difficult to do when using the Adagrad algorithm for the\n",
"same purpose :cite:`Zeiler.2012`. The interesting thing is that there is no learning rate\n",
"hyperparameter in the Adadelta algorithm."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Algorithm\n",
"\n",
"Like RMSProp, the Adadelta algorithm uses the variable $\\boldsymbol{s}_t$, which is an EWMA on the squares of elements in mini-batch stochastic gradient $\\boldsymbol{g}_t$. At time step 0, all the elements are initialized to 0.\n",
"Given the hyperparameter $0 \\leq \\rho < 1$ (counterpart of $\\gamma$ in RMSProp), at time step $t>0$, compute using the same method as RMSProp:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\\boldsymbol{s}_t \\leftarrow \\rho \\boldsymbol{s}_{t-1} + (1 - \\rho) \\boldsymbol{g}_t \\odot \\boldsymbol{g}_t. $$\n",
"\n",
"Unlike RMSProp, Adadelta maintains an additional state variable, $\\Delta\\boldsymbol{x}_t$ the elements of which are also initialized to 0 at time step 0. We use $\\Delta\\boldsymbol{x}_{t-1}$ to compute the variation of the independent variable:\n",
"\n",
"$$ \\boldsymbol{g}_t' \\leftarrow \\sqrt{\\frac{\\Delta\\boldsymbol{x}_{t-1} + \\epsilon}{\\boldsymbol{s}_t + \\epsilon}} \\odot \\boldsymbol{g}_t, $$\n",
"\n",
"Here, $\\epsilon$ is a constant added to maintain the numerical stability, such as $10^{-5}$. Next, we update the independent variable:\n",
"\n",
"$$\\boldsymbol{x}_t \\leftarrow \\boldsymbol{x}_{t-1} - \\boldsymbol{g}'_t. $$\n",
"\n",
"Finally, we use $\\Delta\\boldsymbol{x}$ to record the EWMA on the squares of elements in $\\boldsymbol{g}'$, which is the variation of the independent variable.\n",
"\n",
"$$\\Delta\\boldsymbol{x}_t \\leftarrow \\rho \\Delta\\boldsymbol{x}_{t-1} + (1 - \\rho) \\boldsymbol{g}'_t \\odot \\boldsymbol{g}'_t. $$\n",
"\n",
"As we can see, if the impact of $\\epsilon$ is not considered here, Adadelta differs from RMSProp in its replacement of the hyperparameter $\\eta$ with $\\sqrt{\\Delta\\boldsymbol{x}_{t-1}}$.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementation from Scratch\n",
"\n",
"Adadelta needs to maintain two state variables for each independent variable, $\\boldsymbol{s}_t$ and $\\Delta\\boldsymbol{x}_t$. We use the formula from the algorithm to implement Adadelta."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import d2l\n",
"from d2l import load_array\n",
"\n",
"def init_adadelta_states(feature_dim):\n",
" s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)\n",
" delta_w, delta_b = torch.zeros((feature_dim, 1)), torch.zeros(1)\n",
" return ((s_w, delta_w), (s_b, delta_b))\n",
"\n",
"def adadelta(params, states, hyperparams):\n",
" rho, eps = hyperparams['rho'], 1e-5\n",
" for p, (s, delta) in zip(params, states):\n",
" p = p.type(torch.FloatTensor)\n",
" s[:] = rho * s + ((1 - rho) * p* p)\n",
" g = ((delta + eps).sqrt() / (s + eps).sqrt()) * (p)\n",
" p[:] -= g\n",
" delta[:] = rho * delta + (1 - rho) * g * g"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we train the model with the hyperparameter $\\rho=0.9$."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.259, 0.075 sec/epoch\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 252x180 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"data_iter, feature_dim = d2l.get_data_ch10(batch_size=10)\n",
"d2l.train_ch10(torch.optim.Adadelta, {'rho': 0.9}, data_iter, feature_dim);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"* Adadelta has no learning rate hyperparameter, it uses an EWMA on the squares of elements in the variation of the independent variable to replace the learning rate.\n",
"\n",
"## Exercises\n",
"\n",
"* Adjust the value of $\\rho$ and observe the experimental results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}