Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add save and export to training servicer #228

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

thodkatz
Copy link
Collaborator

@thodkatz thodkatz commented Dec 12, 2024

It builds upon #227 .

Adding the options to save and export a model for the training service.

  • Save: pytorch state dict, including the optimizer
  • Export: save it as a bioimageio package (.zip)

When we save, the training is paused, and then we resume. For the export, we just pause.

- Supported operations: start, resume, pause, shutdown
- pytorch-3dunet package is used as the framework to create the models
… failed

I caught an edge case, where events are blocked, because we have exited the training, and the tasks of the queue would remain unprocessed.
Creating and closing processes and threads can be quite time consuming
resulting to test timeouts if the tests performs a lot of actions.
Applying monkeypatch to a parent process won't propagated to a child process if the start method is spawn (macos) instead of fork (linux)
- To fix test on windows, convert label data to float64
The should stop callbacks are boolean, so we need to aggregate their return value. Previously the return value wasn't taken into account, and the callbacks were returning none
The enum is used for validation check before triggering one of them. Previously I was checking if the queue was alive, but that won't be enough, for example if you want to perform resume, while you are resumed, the queue is operational, but the action shouldn't be valid.
@thodkatz thodkatz force-pushed the add-save-export-to-training-servicer branch 3 times, most recently from 0e7735a to b303405 Compare December 19, 2024 10:58
Move NamedInt and Tensor proto to a separate file so training proto can
use as well
- The inference servicer had a procedure to list the available devices.
  This is needed or the training servicer as well. So list devices is
  decoupled to be shared.
If the training is running or paused, the forward, will retain the state
after completion. But it requires to pause so we can release memory and
do the forward pass.
@thodkatz thodkatz force-pushed the add-save-export-to-training-servicer branch from b303405 to fc70f78 Compare December 20, 2024 21:54
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 62.05074% with 359 lines in your changes missing coverage. Please review.

Project coverage is 62.86%. Comparing base (5ea5d3a) to head (e4bc4d6).

Files with missing lines Patch % Lines
tiktorch/trainer.py 43.51% 122 Missing ⚠️
tiktorch/server/session/backend/supervisor.py 69.37% 49 Missing ⚠️
tiktorch/proto/training_pb2_grpc.py 55.14% 48 Missing ⚠️
tiktorch/proto/training_pb2.py 27.02% 27 Missing ⚠️
tiktorch/proto/utils_pb2.py 30.00% 21 Missing ⚠️
tiktorch/proto/inference_pb2.py 20.00% 20 Missing ⚠️
tiktorch/server/session/backend/commands.py 78.40% 19 Missing ⚠️
tiktorch/server/session/backend/base.py 67.34% 16 Missing ⚠️
tiktorch/server/session/process.py 68.88% 14 Missing ⚠️
tiktorch/server/session/rpc_interface.py 71.87% 9 Missing ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
- Coverage   64.60%   62.86%   -1.75%     
==========================================
  Files          40       47       +7     
  Lines        2195     2838     +643     
==========================================
+ Hits         1418     1784     +366     
- Misses        777     1054     +277     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Since both inference and training servicers have common the concept of
id, the training session id was replaced with the model session one used
for inference. This model session protobuf interfaced moved to a
separate utils proto file.

The PredictRequest being common, can be leveraged for abstraction.
@thodkatz thodkatz force-pushed the add-save-export-to-training-servicer branch 2 times, most recently from f45ebcb to 5d7f533 Compare December 21, 2024 10:52
- If the model was initial paused or running, save after completion
  retain the state, while temporarily pausing to perform the save.
- The export will pause the training if not paused before.
@thodkatz thodkatz force-pushed the add-save-export-to-training-servicer branch from 5d7f533 to e4bc4d6 Compare December 21, 2024 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant