-
-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cross-entropy loss with negative reward/advantage resulting in nan values #90
Comments
Hi @nutpen85, I've run into the same problem and eventually found a (partial) solution. It didn't make it into the book because I only learned about it recently. Your diagnosis of the root cause is right, the exp in the softmax function is prone to extreme values. Here's what I ended up doing:
lmk if this helps. Here's a couple examples from an unrelated project: Policy output with regularization and no activation: https://github.com/macfergus/rlbridge/blob/master/rlbridge/bots/conv/model.py#L76 |
Hi @macfergus |
Hi again. I finally found some time to continue with your book. This time I ran into a problem in chapters 10 and 12, where you have the policy and the actor-critic agents (same problem for both). After calling the train function the model fit starts as expected. However, after some steps the loss becomes negative, more negative, extremely negative and then nan. So, why does that happen?
I think it's because of the cross-entropy loss in combination with negative rewards/advantages. Those were introduced to punish bad moves and to lower their probabilities. Now, when the prediction value of such a move is really small (e.g. 1e-10), the log in the cross entropy will make a huge value out of it. This is then multiplied by the negative label, resulting in a huge negative value. I mean, technically, the direction is fine. However, as soon as the loss reaches nan values, the model becomes useless, because you can't optimize any further.
I don't know much about the theory of using softmax and cross-entropy loss and negative rewards. So, probably I'm simply missing something. Does anyone have an idea?
The text was updated successfully, but these errors were encountered: