Why is ASR model goes to train mode in the training loop #72

amitaie · 2023-01-18T12:57:02Z

Hey, I saw that the ASR model that is under "model" also transfer to train mode in the beginning of the training loop, why is that?
I tried to leave it in eval mode as it is in the initialisation but I got an error.

yl4579 · 2023-01-18T19:49:28Z

It is in eval model all the way long: https://github.com/yl4579/StarGANv2-VC/blob/main/train.py#L85

amitaie · 2023-01-18T20:41:54Z

But if I understood correctly the ASR model is part of the model that comes back from build_model method, and in the training loop goes back to train mode: https://github.com/yl4579/StarGANv2-VC/blob/main/trainer.py#L156

yl4579 · 2023-01-19T04:35:28Z

I think you are right. That's probably a mistake. What was the error you got?

amitaie · 2023-01-19T09:30:46Z

i'm not using the same code, i did a lot of changes in order to insert it to my repo and way of work, i'll try to reproduce it on the origin code, but I think it will be the same error which is:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [9, 256, 96]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

On the same subject, isn't the F0 model should be on eval mode as well?

yl4579 · 2023-01-19T22:57:09Z

I believe there is no difference between the train and eval mode for the ASR model, at least the part we are using here. The part we are using (the CNN part) has no batch norm or dropout. For F0 model, it does make a difference, so does this problem happen as well when you set F0 model to eval mode?

amitaie · 2023-01-22T10:35:59Z

I will run few checks on the F0 model and the ASR model and report the insights that i have.
But the ASR model does use dropout, and also normalization (group norm, not batch norm). the CNN use ConvBlock which has both of them - https://github.com/yl4579/StarGANv2-VC/blob/main/Utils/ASR/layers.py#L105

yl4579 · 2023-01-22T21:40:38Z

I think you are right, though the train/eval mode does not affect group norm. It does affect dropout though, so you can set dropout to 0 without changing the train/eval mode. For the F0 model, it might be more difficult to fix. You will have to set those batch norm layers specifically to eval mode if setting the entire model to eval mode doesn't work. Let me know if it works.

amitaie · 2023-01-29T09:09:39Z

Took me some time but I have some results.
I manged to fix the bug and change the ASR model to eval, i needed to fix small in-place line in the ASR code.
I trained few models for examining the differences between running on eval or not. Mainly there is two things, the first is the change of the dropout/batchnorm (eval mode) and second the compute of the grads.

Changing the F0 and the ASR to eval and no_grads saved me abut 10%~ of running time and cuda memory, that I think mainly because of the no_grads.
Changing the ASR to eval mode caused the loss to converge to ~6 while with no eval it converged to ~10 (!), from listening to the results after 150 epochs i didn't heard any differences but need to explore it on cases that are more on the edge of the model.

here some tensorboard results:

yl4579 · 2023-01-29T21:51:53Z

Thanks for letting me know. Can you make a pull request to modify these things for this repo? Or maybe indicate where the problem is, and I can make the fix.

mayank-git-hub · 2023-05-16T12:12:12Z

I have created a pull request addressing the issues with the ASR model. Putting the JDC network under eval mode is not so trivial and requires setting each individual to eval mode as mentioned by @yl4579 .

mayank-git-hub · 2023-05-16T12:14:01Z

Setting the dropouts to 0 does not have audible changes when working with speech signals (Except for the changes in the loss values) but does indeed have improvements when working with other modalities.

yl4579 closed this as completed Jan 18, 2023

yl4579 reopened this Jan 19, 2023

yl4579 added the bug Something isn't working label Jan 31, 2023

mayank-git-hub mentioned this issue May 16, 2023

Eval mode for ASR model and compatibility with PyTorch > 1.7 #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is ASR model goes to train mode in the training loop #72

Why is ASR model goes to train mode in the training loop #72

amitaie commented Jan 18, 2023

yl4579 commented Jan 18, 2023

amitaie commented Jan 18, 2023 •

edited

Loading

yl4579 commented Jan 19, 2023

amitaie commented Jan 19, 2023

yl4579 commented Jan 19, 2023

amitaie commented Jan 22, 2023

yl4579 commented Jan 22, 2023

amitaie commented Jan 29, 2023 •

edited

Loading

yl4579 commented Jan 29, 2023

mayank-git-hub commented May 16, 2023

mayank-git-hub commented May 16, 2023

Why is ASR model goes to train mode in the training loop #72

Why is ASR model goes to train mode in the training loop #72

Comments

amitaie commented Jan 18, 2023

yl4579 commented Jan 18, 2023

amitaie commented Jan 18, 2023 • edited Loading

yl4579 commented Jan 19, 2023

amitaie commented Jan 19, 2023

yl4579 commented Jan 19, 2023

amitaie commented Jan 22, 2023

yl4579 commented Jan 22, 2023

amitaie commented Jan 29, 2023 • edited Loading

yl4579 commented Jan 29, 2023

mayank-git-hub commented May 16, 2023

mayank-git-hub commented May 16, 2023

amitaie commented Jan 18, 2023 •

edited

Loading

amitaie commented Jan 29, 2023 •

edited

Loading