-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume at the end of the last trained epoch #547
Conversation
Review changes with SemanticDiff. Analyzed 1 of 1 files.
|
04d9d0d
to
7cce58c
Compare
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #547 +/- ##
=======================================
Coverage 74.63% 74.63%
=======================================
Files 46 46
Lines 3130 3130
Branches 510 510
=======================================
Hits 2336 2336
Misses 693 693
Partials 101 101 ☔ View full report in Codecov by Sentry. |
Yes , confirming that the fin-tune checkpoint it resuming from the end of the previous run. ( 50 steps ahead) VS how it was definitely overlapping before. I will open a new ticket for the 50 steps ahead but will close this since it is now resolved. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good Samuel.
PR Goal?
Fix proper resuming of
text-to-spec
training.The state at the end of the last epoch wasn't saved and resuming would be performed from the last saved checkpoint that was the last checkpoint used for validation. This was producing staggered runs as shown in
tensorboard
.Fixes?
#534
Feedback sought?
merge approval
Priority?
low
Tests added?
None
How to test?
Check the state of the loops
python -c 'import torch; import json; m = torch.load("logs_and_checkpoints/FeaturePredictionExperiment/save_on_train_epoch_end/checkpoints/last.ckpt", map_location=torch.device("cpu")); print(json.dumps(m["loops"]["fit_loop"]["epoch_loop.batch_progress"], indent=2))'
Which will yield something like the following. You want to look at
current
's values. This run used 11790 training examples split across batches of 16 examples thus, one epoch is 11790/16 ~ 736 batches per epoch. If, instead, we see 500, the defaultval_check_interval
, this would mean that we didn't save at the end of the epoch.Try resuming for a second epoch.
srun everyvoice train text-to-spec \ config/everyvoice-text-to-spec.yaml \ --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt" \ --config-args training.max_epochs=2 \
Use
tensorboard
and check that the second run's training is NOT staggered with your first run.Confidence?
Good
Version change?
No
Related PRs?
None