You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was thinking about this issue. In order to get some training loss value, we need to run at least on batch but then if we use up one batch, we are no longer at step 0 for that resumed run. If we want to remove the gap, should we save the last losses then when we reload the model, we could send those saved values to tensorboard to bridge the gap.
Note that, when we resume, pytorch lightning actually performs one epoch of evaluation then records it to tensorboard then actually proceed to resume training. If we were to use the losses calculated during that first evaluation phase, we could get losses' value at step 0 but they would most likely not align with the training losses' value calculated at the end of the final run aka the run that is prior to resuming aka the values of the last checkpoint we are currently resuming from.
After today's meeting, we agreed that probably the thing to do is to document the fact that when resuming, we need to expect those gap.
Before documenting this, let's give a quick try where we run training until the first training losses are logged to tensorboard then reset the training iterator. This may not be a great solution because we might get the same losses twice and make the graph even more confusing.
Bug description
When doing a FP fine-tune , in Tensorboard it looks like the next round start 50 steps ahead.
See image
How to reproduce the bug
Error messages and logs
No error message.
Environment
Standard ENV , nothing special. This will be for after PR 547 is merged. ( Resume at the end of the last trained epoch #547
, Issue #534 )
More info
none
The text was updated successfully, but these errors were encountered: