fix: handle cases in which losses are nan during the training #535

gcroci2 · 2023-12-19T10:07:50Z

During the training (in the trainer.Trainer.train() method), if the loss is nan (such as when using very few datapoints, like in #528), checkpoint_model is never created, thus the crashes and gives the following error:

UnboundLocalError: local variable 'checkpoint_model' referenced before assignment

This PR adds checks for handling such cases and integration tests.

I used tmpdir_factory fixture for generating the hdf5 files needed for the checks only once per tests' session.
I used pytest.mark.parametrize for passing in multiple inputs to the test function.

handle cases in which losses are nan

bbdf64c

gcroci2 self-assigned this Dec 19, 2023

gcroci2 linked an issue Dec 19, 2023 that may be closed by this pull request

Fix UnboundLocalError: local variable 'checkpoint_model' referenced before assignment #533

Closed

4 tasks

gcroci2 mentioned this pull request Dec 19, 2023

Fix UnboundLocalError: local variable 'checkpoint_model' referenced before assignment #533

Closed

4 tasks

gcroci2 added 4 commits December 19, 2023 18:18

improve logic for warning about nan losses

992bc48

user warnings for nan loss cases instead of logging

50bbe82

add integration test for nan losses cases

b6b6be7

fix prospector error

1b3b6b6

gcroci2 merged commit f9681aa into main Dec 20, 2023
6 checks passed

gcroci2 deleted the hotfix_533_unboundlocalerror_gcroci2 branch December 20, 2023 15:39

gcroci2 mentioned this pull request Dec 21, 2023

docs: add Dockerfile and .yml for building conda env #528

Merged

gcroci2 added the JOSS label Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle cases in which losses are nan during the training #535

fix: handle cases in which losses are nan during the training #535

gcroci2 commented Dec 19, 2023 •

edited

Loading

fix: handle cases in which losses are nan during the training #535

fix: handle cases in which losses are nan during the training #535

Conversation

gcroci2 commented Dec 19, 2023 • edited Loading

gcroci2 commented Dec 19, 2023 •

edited

Loading