Home

Awesome

Deep Learning Tuning Playbook

This is not an officially supported Google product.

Varun Godbole<sup></sup>, George E. Dahl<sup></sup>, Justin Gilmer<sup></sup>, Christopher J. Shallue<sup></sup>, Zachary Nado<sup></sup>

† Google Research, Brain Team

‡ Harvard University

Table of Contents

Who is this document for?

This document is for engineers and researchers (both individuals and teams) interested in maximizing the performance of deep learning models. We assume basic knowledge of machine learning and deep learning concepts.

Our emphasis is on the process of hyperparameter tuning. We touch on other aspects of deep learning training, such as pipeline implementation and optimization, but our treatment of those aspects is not intended to be complete.

We assume the machine learning problem is a supervised learning problem or something that looks a lot like one (e.g. self-supervised). That said, some of the prescriptions in this document may also apply to other types of problems.

Why a tuning playbook?

Currently, there is an astonishing amount of toil and guesswork involved in actually getting deep neural networks to work well in practice. Even worse, the actual recipes people use to get good results with deep learning are rarely documented. Papers gloss over the process that led to their final results in order to present a cleaner story, and machine learning engineers working on commercial problems rarely have time to take a step back and generalize their process. Textbooks tend to eschew practical guidance and prioritize fundamental principles, even if their authors have the necessary experience in applied work to provide useful advice. When preparing to create this document, we couldn't find any comprehensive attempt to actually explain how to get good results with deep learning. Instead, we found snippets of advice in blog posts and on social media, tricks peeking out of the appendix of research papers, occasional case studies about one particular project or pipeline, and a lot of confusion. There is a vast gulf between the results achieved by deep learning experts and less skilled practitioners using superficially similar methods. At the same time, these very experts readily admit some of what they do might not be well-justified. As deep learning matures and has a larger impact on the world, the community needs more resources covering useful recipes, including all the practical details that can be so critical for obtaining good results.

We are a team of five researchers and engineers who have worked in deep learning for many years, some of us since as early as 2006. We have applied deep learning to problems in everything from speech recognition to astronomy, and learned a lot along the way. This document grew out of our own experience training neural networks, teaching new machine learning engineers, and advising our colleagues on the practice of deep learning. Although it has been gratifying to see deep learning go from a machine learning approach practiced by a handful of academic labs to a technology powering products used by billions of people, deep learning is still in its infancy as an engineering discipline and we hope this document encourages others to help systematize the field's experimental protocols.

This document came about as we tried to crystalize our own approach to deep learning and thus it represents the opinions of the authors at the time of writing, not any sort of objective truth. Our own struggles with hyperparameter tuning made it a particular focus of our guidance, but we also cover other important issues we have encountered in our work (or seen go wrong). Our intention is for this work to be a living document that grows and evolves as our beliefs change. For example, the material on debugging and mitigating training failures would not have been possible for us to write two years ago since it is based on recent results and ongoing investigations. Inevitably, some of our advice will need to be updated to account for new results and improved workflows. We do not know the optimal deep learning recipe, but until the community starts writing down and debating different procedures, we cannot hope to find it. To that end, we would encourage readers who find issues with our advice to produce alternative recommendations, along with convincing evidence, so we can update the playbook. We would also love to see alternative guides and playbooks that might have different recommendations so we can work towards best practices as a community. Finally, any sections marked with a 🤖 emoji are places we would like to do more research. Only after trying to write this playbook did it become completely clear how many interesting and neglected research questions can be found in the deep learning practitioner's workflow.

Guide for starting a new project

Many of the decisions we make over the course of tuning can be made once at the beginning of a project and only occasionally revisited when circumstances change.

Our guidance below makes the following assumptions:

Choosing the model architecture

Summary: When starting a new project, try to reuse a model that already works.

Choosing the optimizer

Summary: Start with the most popular optimizer for the type of problem at hand.

Choosing the batch size

Summary: The batch size governs the training speed and shouldn't be used to directly tune the validation set performance. Often, the ideal batch size will be the largest batch size supported by the available hardware.

Determining the feasible batch sizes and estimating training throughput

<details><summary><em>[Click to expand]</em></summary> <br> <p align="center">training throughput = (# examples processed per second)</p> <p align="center">or, equivalently, the <em>time per step</em>.</p> <p align="center">time per step = (batch size) / (training throughput)</p> </details>

Choosing the batch size to minimize training time

<details><summary><em>[Click to expand]</em></summary> <br> <p align="center">Training time = (time per step) x (total number of steps)</p> </details>

Choosing the batch size to minimize resource consumption

<details><summary><em>[Click to expand]</em></summary> <br> <p align="center">Resource consumption = (resource consumption per step) x (total number of steps)</p> </details>

Changing the batch size requires re-tuning most hyperparameters

<details><summary><em>[Click to expand]</em></summary> <br> </details>

How batch norm interacts with the batch size

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Choosing the initial configuration

A scientific approach to improving model performance

For the purposes of this document, the ultimate goal of machine learning development is to maximize the utility of the deployed model. Even though many aspects of the development process differ between applications (e.g. length of time, available computing resources, type of model), we can typically use the same basic steps and principles on any problem.

Our guidance below makes the following assumptions:

The incremental tuning strategy

Summary: Start with a simple configuration and incrementally make improvements while building up insight into the problem. Make sure that any improvement is based on strong evidence to avoid adding unnecessary complexity.

At a high level, our incremental tuning strategy involves repeating the following four steps:

  1. Identify an appropriately-scoped goal for the next round of experiments.
  2. Design and run a set of experiments that makes progress towards this goal.
  3. Learn what we can from the results.
  4. Consider whether to launch the new best configuration.

The remainder of this section will consider this strategy in much greater detail.

Exploration vs exploitation

Summary: Most of the time, our primary goal is to gain insight into the problem.

Choosing the goal for the next round of experiments

Summary: Each round of experiments should have a clear goal and be sufficiently narrow in scope that the experiments can actually make progress towards the goal.

Designing the next round of experiments

Summary: Identify which hyperparameters are scientific, nuisance, and fixed hyperparameters for the experimental goal. Create a sequence of studies to compare different values of the scientific hyperparameters while optimizing over the nuisance hyperparameters. Choose the search space of nuisance hyperparameters to balance resource costs with scientific value.

Identifying scientific, nuisance, and fixed hyperparameters

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Creating a set of studies

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Striking a balance between informative and affordable experiments

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Extracting insight from experimental results

Summary: In addition to trying to achieve the original scientific goal of each group of experiments, go through a checklist of additional questions and, if issues are discovered, revise the experiments and rerun them.

Identifying bad search space boundaries

<details><summary><em>[Click to expand]</em></summary> <br> <p align="center" id="figure-1"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/bad_search_space.png" width="49%" alt="Example of bad search space boundaries"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/good_search_space.png" width="49%" alt="Example of good search space boundaries"> </p> <p align="center"><b>Figure 1:</b> Examples of bad search space boundaries and acceptable search space boundaries.</p> </details>

Not sampling enough points in the search space

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Examining the training curves

<details><summary><em>[Click to expand]</em></summary> <br>

Summary: Examining the training curves is an easy way to identify common failure modes and can help us prioritize what actions to take next.

</details>

Detecting whether a change is useful with isolation plots

<details><summary><em>[Click to expand]</em></summary> <br> <p align="center" id="figure-2"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/isolation_plot.png" width="49%" alt="Isolation plot that investigates the best value of weight decay for ResNet-50 trained on ImageNet."> </p> <p align="center"><b>Figure 2:</b> Isolation plot that investigates the best value of weight decay for ResNet-50 trained on ImageNet.</p> </details>

Automate generically useful plots

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Determining whether to adopt a training pipeline change or hyperparameter configuration

Summary: When deciding whether to make a change to our model or training procedure or adopt a new hyperparameter configuration going forward, we need to be aware of the different sources of variation in our results.

After exploration concludes

Summary: Bayesian optimization tools are a compelling option once we’re done exploring for good search spaces and have decided what hyperparameters even should be tuned at all.

Determining the number of steps for each training run

Deciding how long to train when training is not compute-bound

Algorithm for picking an initial candidate for max_train_steps using a learning rate sweep

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Deciding how long to train when training is compute-bound

Round 1

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Round 2

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Additional guidance for the training pipeline

Optimizing the input pipeline

Summary: The causes and interventions of input-bound pipelines are highly task-dependent; use a profiler and look out for common issues.

Evaluating model performance

Summary: Run evaluation at larger batch sizes than training. Run evaluations at regular step intervals, not regular time intervals.

Evaluation settings

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Setting up periodic evaluations

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Choosing a sample for periodic evaluation

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Saving checkpoints and retrospectively selecting the best checkpoint

Summary: Run training for a fixed number of steps and retrospectively choose the best checkpoint from the run.

Setting up experiment tracking

Summary: When tracking different experiments, make sure to note a number of essentials like the best performance of a checkpoint in the study, and a short description of the study.

Batch normalization implementation details

Summary: Nowadays batch norm can often be replaced with LayerNorm, but in cases where it cannot, there are tricky details when changing the batch size or number of hosts.

Considerations for multi-host pipelines

Summary: for logging, evals, RNGs, checkpointing, and data sharding, multi-host training can make it very easy to introduce bugs!

FAQs

What is the best learning rate decay schedule family?

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Which learning rate decay should I use as a default?

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Why do some papers have complicated learning rate schedules?

<details><summary><em>[Click to expand]</em></summary> <br> </details>

How should Adam’s hyperparameters be tuned?

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?

<details><summary><em>[Click to expand]</em></summary> </details>

Where can I find an implementation of quasi-random search?

<details><summary><em>[Click to expand]</em></summary> <br> </details>

How many trials are needed to get good results with quasi-random search?

<details><summary><em>[Click to expand]</em></summary> <br> <p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/have_we_sampled_enough.png" width="49%" alt="A box plot showing the importance of sampling enough"> </p> <p align="center"><b>Figure 3:</b> A ResNet-50 was tuned on ImageNet with 100 trials. Via bootstrapping, different amounts of tuning budget were simulated. Box plots of the best performances for each trial budget are plotted above. </details>

How can optimization failures be debugged and mitigated?

<details><summary><em>[Click to expand]</em></summary> <br>

Summary: If the model is experiencing optimization difficulties, it’s important to fix them before trying other things. Diagnosing and correcting training failures is an active area of research.

<p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/stride_instability.png" width="80%" alt="Changing the strides in a single residual block in a WideResnet results in training instability."> </p> <p align="center"><b>Figure 4:</b> Changing the strides in a single residual block (2x2 -> 1x1) in a WideResnet results in training instability. This does not degrade performance at low learning rates, but high learning rates no longer train well due to the instability. Applying 1000 steps of learning rate warmup resolves this particular instance of instability, allowing stable training at max learning rate of .1.</p>

Identifying unstable workloads

NOTE: Some models show very early instability followed by a recovery that results in slow but stable training. Common evaluation schedules can miss these issues by not evaluating frequently enough!

To check for this, we can train for an abbreviated run of just ~500 steps using lr = 2 * current best, but evaluate every step.

<p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/more_frequent_evals.png" width="80%" alt="Illustration of the value of more frequent evaluations at the start of training."> </p> <p align="center"><b>Figure 5:</b> Illustration of the value of more frequent evaluations at the start of training. Useful if there’s a suspicion that the model suffers from early training instability.</p>

Potential fixes for common instability patterns

Learning rate warmup

<p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/instability_during_warmup.png" width="80%" alt="An example of instability during a warmup period (note the horizontal axis log scale)."> </p> <p align="center"><b>Figure 6:</b> An example of instability during a warmup period (note the horizontal axis log scale). 40k steps of warmup was needed for successful training in this case.</p>
When to apply learning rate warmup
<p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/axis_model_with_instability.png" width="49%" alt="Axis plot for model with instability"> </p> <p align="center"><b>Figure 7a:</b> An example of a hyperparameter axis plot for a model exhibiting training instability. The best learning rate is at the edge of what is feasible. An "infeasible" trial is defined as one that either produces NaNs or uncharacteristically high values of the loss.</p> <p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/loss_model_with_instability.png" width="49%" alt="Loss curve for model with instability"> </p> <p align="center"><b>Figure 7b:</b> The training loss of a model trained with a learning rate where we see instability.</p>
How to apply learning rate warmup
<p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/beneficial_effect_warmup.png" width="80%" alt="Beneficial effect of warmup on training instabilities"> </p> <p align="center"><b>Figure 8:</b> Beneficial effect of learning rate warmup on addressing training instabilities.</p>

Gradient clipping

<p align="center"> <img src="https://raw.githubusercontent.com/google-research/tuning_playbook/main/assets/gradient_clipping.png" width="80%" alt="Gradient clipping on early training instabilities"> </p> <p align="center"><b>Figure 9:</b> Illustration of gradient clipping correcting early training instability.</p> </details>

Why do you call the learning rate and other optimization parameters hyperparameters? They are not parameters of any prior distribution.

<details><summary><em>[Click to expand]</em></summary> <br> </details>

Why shouldn't the batch size be tuned to directly improve validation set performance?

<details><summary><em>[Click to expand]</em></summary> <br> </details>

What are the update rules for all the popular optimization algorithms?

<details><summary><em>[Click to expand]</em></summary> <br>

Stochastic gradient descent (SGD)

$$\theta_{t+1} = \theta_{t} - \eta_t \nabla \mathcal{l}(\theta_t)$$

Momentum

$$v_0 = 0$$

$$v_{t+1} = \gamma v_{t} + \nabla \mathcal{l}(\theta_t)$$

$$\theta_{t+1} = \theta_{t} - \eta_t v_{t+1}$$

Nesterov

$$v_0 = 0$$

$$v_{t+1} = \gamma v_{t} + \nabla \mathcal{l}(\theta_t)$$

$$\theta_{t+1} = \theta_{t} - \eta_t( \gamma v_{t+1} + \nabla \mathcal{l}(\theta_{t}))$$

RMSProp

$$v_0 = 1 \text{,} m_0 = 0$$

$$v_{t+1} = \rho v_{t} + (1 - \rho) \nabla \mathcal{l}(\theta_t)^2$$

$$m_{t+1} = \gamma m_{t} + \frac{\eta_t}{\sqrt{v_{t+1} + \epsilon}}\nabla \mathcal{l}(\theta_t)$$

$$\theta_{t+1} = \theta_{t} - m_{t+1}$$

ADAM

$$m_0 = 0 \text{,} v_0 = 0$$

$$m_{t+1} = \beta_1 m_{t} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)$$

$$v_{t+1} = \beta_2 v_{t} + (1 - \beta_2) \nabla \mathcal{l}(\theta_t)^2$$

$$b_{t+1} = \frac{\sqrt{1 - \beta_2^{t+1}}}{1 - \beta_1^{t+1}}$$

$$\theta_{t+1} = \theta_{t} - \alpha_t \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon} b_{t+1}$$

NADAM

$$m_0 = 0 \text{,} v_0 = 0$$

$$m_{t+1} = \beta_1 m_{t} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)$$

$$v_{t+1} = \beta_2 v_{t} + (1 - \beta_2) \nabla \mathcal{l} (\theta_t)^2$$

$$b_{t+1} = \frac{\sqrt{1 - \beta_2^{t+1}}}{1 - \beta_1^{t+1}}$$

$$\theta_{t+1} = \theta_{t} - \alpha_t \frac{\beta_1 m_{t+1} + (1 - \beta_1) \nabla \mathcal{l} (\theta_t)}{\sqrt{v_{t+1}} + \epsilon} b_{t+1}$$

</details>

Acknowledgments

Citing

@misc{tuningplaybookgithub,
  author = {Varun Godbole and George E. Dahl and Justin Gilmer and Christopher J. Shallue and Zachary Nado},
  title = {Deep Learning Tuning Playbook},
  url = {http://github.com/google-research/tuning_playbook},
  year = {2023},
  note = {Version 1.0}
}

Contributing

Contributor License Agreement

Contributions to this project must be accompanied by a Contributor License Agreement (CLA). You (or your employer) retain the copyright to your contribution; this simply gives us permission to use and redistribute your contributions as part of the project. Head over to https://cla.developers.google.com/ to see your current agreements on file or to sign a new one.

You generally only need to submit a CLA once, so if you've already submitted one (even if it was for a different project), you probably don't need to do it again.

Code Reviews

All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.

Community Guidelines

This project follows Google's Open Source Community Guidelines.

Footnotes

  1. Ben Recht and Kevin Jamieson pointed out how strong 2X-budget random search is as a baseline (the Hyperband paper makes similar arguments), but it is certainly possible to find search spaces and problems where state-of-the-art Bayesian optimization techniques crush random search that has 2X the budget. However, in our experience beating 2X-budget random search gets much harder in the high-parallelism regime since Bayesian optimization has no opportunity to observe the results of previous trials.