-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node: Modify error handling for CheckPending method in the Governor to prevent process restarts #4011
node: Modify error handling for CheckPending method in the Governor to prevent process restarts #4011
Conversation
Previous rollouts of the Flow Cancel feature contained issues when calculating the Governor usage when usage was near the daily limit. This caused an invariant to be violated. However, this was propagated to the processor code and resulted in the processor restarting the entire process. Instead, the Governor should simply fail-closed and report that there is no remaining capacity, causing further VAAs to be queued until the usage diminishes over time. The circumstances leading to the invariant violations are not addressed in this commit. Instead this commit reworks the way errors are handled by the CheckPending, making careful choices about when the process should or should not be killed. - Change "invariant" error handling: instead of causing the process to die, log an error and skip further for a single chain while allowing processing for other chains to continue - Change other less severe error cases to log warnings instead of returning errors. - Generally prevent flow-cancel related issues from affecting normal Governor operations. Instead the flow cancel transfers should simply not be populated and thus result in "GovernorV1" behavior. - Add documentation to CheckPendingForTime to explain the dangers of returning an error - Reword error messages to be more precise and include more relevant fields. Add documentation explaining when the process should and should not die
node/pkg/governor/governor.go
Outdated
// An error here might mean that the sum of emitted transfers for a chain has exceeded | ||
// the Governor's daily limit for this chain, or that the previous total value could | ||
// not be calculated. Although this is a serious issue, we do not want to return an error | ||
// here because it will cause the Governor to be restarted | ||
// by the processor. Instead, skip processing pending transfers for this chain while | ||
// continuing to process other chains as normal. | ||
// Note that the sliding 24h window and/or incoming flow canceling transfers will | ||
// reduce `prevTotalValue` over time. Therefore we expect this state to be temporary. | ||
gov.logger.Error("error when attempting to trim and sum transfers", zap.Error(err)) | ||
gov.logger.Error("refusing to release transfers for this chain until Governor usage is reduced", | ||
zap.Stringer("chainId", chainId), | ||
zap.Uint64("prevTotalValue", prevTotalValue), | ||
zap.Error(err)) | ||
gov.msgsToPublish = msgsToPublish | ||
return nil, err | ||
// Skip further processing for this chain entry | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure about this handling, I'm wondering whether we'd be better off changing TrimAndSumValueForChain
?
The invariant there isn't actually too hard to hit. If we peg the limit and a flow cancelling transfer is older than 24 hours and falls off the list, then the sum calculated will be greater than the daily limit for that chain. I don't think we can consider that an invariant in this case.
I think we should move that invariant check to before the trimming of old transfers happens? I believe it should hold in that case?
I think the idea of skipping processing for a chain in the event of an error is the right thing to do (rather than crashing the process), but ideally we don't hit this code path very often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, another thought about the invariant, is the token entry price during this processing cached or is the computed value calculated on the fly? If we're close to the daily transfer limit and the token price is increased with a coingecko price update then we could hit the invariant also even though we were under the limit at the original processing time. Something to think about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I was also wondering how best to handle this.
I don't think we can consider that an invariant in this case.
I think you're right. Maybe this should be thought of more as a Warning with the idea that it's a temporary state that should be resolved by 24h going by or by incoming flow.
I think we should move that invariant check to before the trimming of old transfers happens?
As written it's one process - the sum happens at the same time as trim so we can't get prevTotalValue
until after we sum. In theory we could store this in the node or split trimming and summing but it's more of a significant refactor.
is the token entry price during this processing cached or is the computed value calculated on the fly
Both, technically. I think the token Entry is refreshed when the process is restarted by reading from the mainnet token list. But then the computed value of a transfer is re-calculated on each iteration.
If we're close to the daily transfer limit and the token price is increased with a coingecko price update then we could hit the invariant also even though we were under the limit at the original processing time
I think that's correct, nice catch. With this additional case I'm in favour of not considering it as an invariant. I can refactor here to basically just prevent new transfers from going through when the Governor is at its limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Dirk on this. I don't think it can be considered an invariant anymore with the changes we've made with flow canceling (and even prior to that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested this locally in Tilt with a violated invariant. It appears to log the error message and continue on, as opposed to just crash looping like before.
- Remove 'invariant error' in TrimAndSumValueForChain as it can occur somewhat regularly with the addition of the flow cancel feature - Return dailyLimit in error condition rather than 0 so that future transfers will be queued - Do not cap the sum returned from TrimAndSumValueForChain: instead allow it to exceed the daily limit. - Modify unit tests to reflect this - Add unit tests for overflow/underflow scenarios in the TrimAndSumValue functions
@@ -686,16 +686,8 @@ func (gov *ChainGovernor) CheckPendingForTime(now time.Time) ([]*common.MessageP | |||
foundOne := false | |||
prevTotalValue, err := gov.TrimAndSumValueForChain(ce, startTime) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error statement is now impossible to hit, right? TrimAndSumValueForChain
will always return a value instead of an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, there's another error case for that function relating to overflow/underflow for int64 calculations. Much, much less likely though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems sane to me! Just changing where the handling of the overflow of the dailylimit is done.
These changes have been incorporated into #4016 |
Note: this PR is based on the
v2.24.8
tag.Previous rollouts of the Flow Cancel feature contained issues when calculating the Governor usage when usage was near the daily limit. This caused an invariant to be violated. However, this was propagated to the processor code and resulted in the processor restarting the entire process. Instead, the Governor should simply fail-closed and report that there is no remaining capacity, causing further VAAs to be queued until the usage diminishes over time.
The circumstances leading to the invariant violations are not addressed in this commit. Instead this commit reworks the way errors are handled by the CheckPending, making careful choices about when the process should or should not be killed.