Fix `checkpointable_layers` Logic #6881

Quentin-Anthony · 2024-12-17T00:11:21Z

Problem

There's an edge-case in DeepSpeed, where if all three of the following are true:

Deepspeed activation checkpointing is applied
The user passes checkpointable_layers (e.g. https://github.com/EleutherAI/gpt-neox/blob/f5325805678c2b9e35aae4528283e0132c5f5bbc/megatron/model/gpt2_model.py#L175)
The user's model class contains GPT2ModelPipe or GPTModelPipe`

Then the checkpointable_layers will not be activation checkpointed.

Reason

This is because in the current logic, _is_checkpointable will short-circuit to just return layers matching ParallelTransformerLayerPipe in the case of self.__class__.__name__ in ('GPTModelPipe', 'GPT2ModelPipe'). See

DeepSpeed/deepspeed/runtime/pipe/module.py

Line 653 in da771ed

    
           return all('ParallelTransformerLayerPipe' in f.__class__.__name__ for f in funcs)

Proposed Fixes

I think that checkpointable_layers should always be checked for, and added logic to this effect. I also found the documentation for checkpointable_layers confusing and contradictory, so I updated the docstring. Lastly, I added a unit test for checkpointable_layers.

tests/unit/runtime/activation_checkpointing/test_activation_checkpointing.py

fix checkpointable layers logic and docstring. Add unit test.

6334229

Quentin-Anthony requested review from loadams, tjruwase and tohtana as code owners December 17, 2024 00:11

tjruwase approved these changes Dec 17, 2024

View reviewed changes

loadams reviewed Dec 17, 2024

View reviewed changes

tests/unit/runtime/activation_checkpointing/test_activation_checkpointing.py Show resolved Hide resolved

Quentin-Anthony and others added 6 commits December 18, 2024 12:24

fix act recomp test

b212177

Merge branch 'master' into qanthony/fix-act-recomp

d6e6f80

Merge branch 'master' into qanthony/fix-act-recomp

02141d3

Merge branch 'master' into qanthony/fix-act-recomp

b46b5aa

Merge branch 'master' into qanthony/fix-act-recomp

cd6fa6d

Merge branch 'master' into qanthony/fix-act-recomp

be5392e

loadams enabled auto-merge January 3, 2025 00:59

Merge branch 'master' into qanthony/fix-act-recomp

30fca29

loadams added this pull request to the merge queue Jan 4, 2025

Merged via the queue into microsoft:master with commit 0dbbb70 Jan 4, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `checkpointable_layers` Logic #6881

Fix `checkpointable_layers` Logic #6881

Quentin-Anthony commented Dec 17, 2024

Fix checkpointable_layers Logic #6881

Fix checkpointable_layers Logic #6881

Conversation

Quentin-Anthony commented Dec 17, 2024

Fix `checkpointable_layers` Logic #6881

Fix `checkpointable_layers` Logic #6881