Resume at the end of the last trained epoch #547

SamuelLarkin · 2024-09-12T15:38:02Z

PR Goal?

Fix proper resuming of text-to-spec training.
The state at the end of the last epoch wasn't saved and resuming would be performed from the last saved checkpoint that was the last checkpoint used for validation. This was producing staggered runs as shown in tensorboard.

Fixes?

#534

Feedback sought?

merge approval

Priority?

low

Tests added?

None

How to test?

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.max_epochs=1 \

Check the state of the loops

python -c 'import torch; import json; m = torch.load("logs_and_checkpoints/FeaturePredictionExperiment/save_on_train_epoch_end/checkpoints/last.ckpt", map_location=torch.device("cpu")); print(json.dumps(m["loops"]["fit_loop"]["epoch_loop.batch_progress"], indent=2))'

Which will yield something like the following. You want to look at current's values. This run used 11790 training examples split across batches of 16 examples thus, one epoch is 11790/16 ~ 736 batches per epoch. If, instead, we see 500, the default val_check_interval, this would mean that we didn't save at the end of the epoch.

{
  "total": {
    "ready": 4421,
    "completed": 4421,
    "started": 4421,
    "processed": 4421
  },
  "current": {
    "ready": 736,
    "completed": 736,
    "started": 736,
    "processed": 736
  },
  "is_last_batch": true
}

Try resuming for a second epoch.

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt" \
      --config-args training.max_epochs=2 \

Use tensorboard and check that the second run's training is NOT staggered with your first run.

tensorboard --port=2024 --logdir=logs_and_checkpoints  --bind_all

Confidence?

Good

Version change?

No

Related PRs?

None

semanticdiff-com · 2024-09-12T15:38:05Z

Review changes with SemanticDiff.

Analyzed 1 of 1 files.

	Filename	Status
✔️	everyvoice/base_cli/helpers.py	Analyzed

github-actions · 2024-09-17T13:31:31Z

CLI load time: 0:00.23
Pull Request HEAD: 7cce58cb74a59ca919153ce22f72e49f4ee64024
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

codecov · 2024-09-17T13:31:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.63%. Comparing base (3a36240) to head (7cce58c).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #547   +/-   ##
=======================================
  Coverage   74.63%   74.63%           
=======================================
  Files          46       46           
  Lines        3130     3130           
  Branches      510      510           
=======================================
  Hits         2336     2336           
  Misses        693      693           
  Partials      101      101

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

marctessier · 2024-09-18T15:03:27Z

Yes , confirming that the fin-tune checkpoint it resuming from the end of the previous run. ( 50 steps ahead) VS how it was definitely overlapping before.

I will open a new ticket for the 50 steps ahead but will close this since it is now resolved. :-)

marctessier

Look good Samuel.

fix: saving the model at the end of each epoch

7cce58c

SamuelLarkin force-pushed the dev.sl/534_resume branch from 04d9d0d to 7cce58c Compare September 17, 2024 13:27

SamuelLarkin requested review from joanise and marctessier September 17, 2024 13:33

SamuelLarkin changed the title ~~[WIP] dev.sl/534 resume~~ Resume at the end of the last trained epoch Sep 17, 2024

marctessier closed this Sep 18, 2024

marctessier reopened this Sep 18, 2024

marctessier approved these changes Sep 18, 2024

View reviewed changes

marctessier mentioned this pull request Sep 18, 2024

When doing a FP fine-tune , in Tensorboard the next round start 50 steps ahead. #550

Open

SamuelLarkin merged commit eb460a2 into main Sep 18, 2024
8 checks passed

SamuelLarkin deleted the dev.sl/534_resume branch September 18, 2024 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume at the end of the last trained epoch #547

Resume at the end of the last trained epoch #547

SamuelLarkin commented Sep 12, 2024 •

edited

Loading

semanticdiff-com bot commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 17, 2024 •

edited

Loading

codecov bot commented Sep 17, 2024 •

edited

Loading

marctessier commented Sep 18, 2024

marctessier left a comment

Resume at the end of the last trained epoch #547

Resume at the end of the last trained epoch #547

Conversation

SamuelLarkin commented Sep 12, 2024 • edited Loading

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

Related PRs?

semanticdiff-com bot commented Sep 12, 2024 • edited Loading

github-actions bot commented Sep 17, 2024 • edited Loading

codecov bot commented Sep 17, 2024 • edited Loading

Codecov Report

marctessier commented Sep 18, 2024

marctessier left a comment

Choose a reason for hiding this comment

SamuelLarkin commented Sep 12, 2024 •

edited

Loading

semanticdiff-com bot commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 17, 2024 •

edited

Loading

codecov bot commented Sep 17, 2024 •

edited

Loading