Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: Modify error handling for CheckPending method in the Governor to prevent process restarts #4011

Closed

Conversation

johnsaigle
Copy link
Contributor

Note: this PR is based on the v2.24.8 tag.


Previous rollouts of the Flow Cancel feature contained issues when calculating the Governor usage when usage was near the daily limit. This caused an invariant to be violated. However, this was propagated to the processor code and resulted in the processor restarting the entire process. Instead, the Governor should simply fail-closed and report that there is no remaining capacity, causing further VAAs to be queued until the usage diminishes over time.
The circumstances leading to the invariant violations are not addressed in this commit. Instead this commit reworks the way errors are handled by the CheckPending, making careful choices about when the process should or should not be killed.

  • Change "invariant" error handling: instead of causing the process to die, log an error and skip further for a single chain while allowing processing for other chains to continue
  • Change other less severe error cases to log warnings instead of returning errors.
  • Generally prevent flow-cancel related issues from affecting normal Governor operations. Instead the flow cancel transfers should simply not be populated and thus result in "GovernorV1" behavior.
  • Add documentation to CheckPendingForTime to explain the dangers of returning an error
  • Reword error messages to be more precise and include more relevant fields. Add documentation explaining when the process should and should not die

Previous rollouts of the Flow Cancel feature contained issues when
calculating the Governor usage when usage was near the daily limit. This
caused an invariant to be violated. However, this was propagated to the
processor code and resulted in the processor restarting the entire
process. Instead, the Governor should simply fail-closed and report that
there is no remaining capacity, causing further VAAs to be queued until
the usage diminishes over time.
The circumstances leading to the invariant violations are not addressed
in this commit. Instead this commit reworks the way errors are handled
by the CheckPending, making careful choices about when the process
should or should not be killed.

- Change "invariant" error handling: instead of causing the process to
  die, log an error and skip further for a single chain while allowing
  processing for other chains to continue
- Change other less severe error cases to log warnings instead of
  returning errors.
- Generally prevent flow-cancel related issues from affecting normal
  Governor operations. Instead the flow cancel transfers should simply
  not be populated and thus result in "GovernorV1" behavior.
- Add documentation to CheckPendingForTime to explain the dangers of
  returning an error
- Reword error messages to be more precise and include more relevant
  fields. Add documentation explaining when the process should and
  should not die
Comment on lines 689 to 704
// An error here might mean that the sum of emitted transfers for a chain has exceeded
// the Governor's daily limit for this chain, or that the previous total value could
// not be calculated. Although this is a serious issue, we do not want to return an error
// here because it will cause the Governor to be restarted
// by the processor. Instead, skip processing pending transfers for this chain while
// continuing to process other chains as normal.
// Note that the sliding 24h window and/or incoming flow canceling transfers will
// reduce `prevTotalValue` over time. Therefore we expect this state to be temporary.
gov.logger.Error("error when attempting to trim and sum transfers", zap.Error(err))
gov.logger.Error("refusing to release transfers for this chain until Governor usage is reduced",
zap.Stringer("chainId", chainId),
zap.Uint64("prevTotalValue", prevTotalValue),
zap.Error(err))
gov.msgsToPublish = msgsToPublish
return nil, err
// Skip further processing for this chain entry
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about this handling, I'm wondering whether we'd be better off changing TrimAndSumValueForChain?

The invariant there isn't actually too hard to hit. If we peg the limit and a flow cancelling transfer is older than 24 hours and falls off the list, then the sum calculated will be greater than the daily limit for that chain. I don't think we can consider that an invariant in this case.

I think we should move that invariant check to before the trimming of old transfers happens? I believe it should hold in that case?

I think the idea of skipping processing for a chain in the event of an error is the right thing to do (rather than crashing the process), but ideally we don't hit this code path very often

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another thought about the invariant, is the token entry price during this processing cached or is the computed value calculated on the fly? If we're close to the daily transfer limit and the token price is increased with a coingecko price update then we could hit the invariant also even though we were under the limit at the original processing time. Something to think about

Copy link
Contributor Author

@johnsaigle johnsaigle Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I was also wondering how best to handle this.

I don't think we can consider that an invariant in this case.

I think you're right. Maybe this should be thought of more as a Warning with the idea that it's a temporary state that should be resolved by 24h going by or by incoming flow.

I think we should move that invariant check to before the trimming of old transfers happens?

As written it's one process - the sum happens at the same time as trim so we can't get prevTotalValue until after we sum. In theory we could store this in the node or split trimming and summing but it's more of a significant refactor.

is the token entry price during this processing cached or is the computed value calculated on the fly

Both, technically. I think the token Entry is refreshed when the process is restarted by reading from the mainnet token list. But then the computed value of a transfer is re-calculated on each iteration.

If we're close to the daily transfer limit and the token price is increased with a coingecko price update then we could hit the invariant also even though we were under the limit at the original processing time

I think that's correct, nice catch. With this additional case I'm in favour of not considering it as an invariant. I can refactor here to basically just prevent new transfers from going through when the Governor is at its limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Dirk on this. I don't think it can be considered an invariant anymore with the changes we've made with flow canceling (and even prior to that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this locally in Tilt with a violated invariant. It appears to log the error message and continue on, as opposed to just crash looping like before.

- Remove 'invariant error' in TrimAndSumValueForChain as it can occur
  somewhat regularly with the addition of the flow cancel feature
- Return dailyLimit in error condition rather than 0 so that future
  transfers will be queued
- Do not cap the sum returned from TrimAndSumValueForChain: instead
  allow it to exceed the daily limit.
- Modify unit tests to reflect this
- Add unit tests for overflow/underflow scenarios in the TrimAndSumValue
  functions
@johnsaigle johnsaigle requested a review from djb15 July 12, 2024 14:54
@@ -686,16 +686,8 @@ func (gov *ChainGovernor) CheckPendingForTime(now time.Time) ([]*common.MessageP
foundOne := false
prevTotalValue, err := gov.TrimAndSumValueForChain(ce, startTime)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error statement is now impossible to hit, right? TrimAndSumValueForChain will always return a value instead of an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, there's another error case for that function relating to overflow/underflow for int64 calculations. Much, much less likely though.

Copy link
Contributor

@mdulin2 mdulin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sane to me! Just changing where the handling of the overflow of the dailylimit is done.

@johnsaigle
Copy link
Contributor Author

These changes have been incorporated into #4016

@johnsaigle johnsaigle closed this Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants