-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD-0165: Async Vote Execution #165
base: main
Are you sure you want to change the base?
Changes from 23 commits
8bce07a
74f3a33
4b50153
a8aad6c
55fae1d
ca682d7
93b2506
a78f256
67a1f45
e38b645
7dd9801
60a66af
5432353
40d4be6
1b3ed81
4e83392
35ec617
93372ea
7daaf7a
b7fc403
f4278f4
7b3de13
8ac7180
437d66d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
--- | ||
simd: '0165' | ||
title: Async Vote Execution | ||
authors: | ||
- Wen Xu | ||
category: Standard | ||
type: Core | ||
status: Idea | ||
created: 2024-08-11 | ||
feature: null | ||
supersedes: null | ||
superseded-by: null | ||
extends: null | ||
--- | ||
|
||
## Summary | ||
|
||
Optimistically execute all vote transactions in a block to determine fork | ||
selection in consensus early on, before all the transactions in the block | ||
are fully executed and the actual fee payers for vote transactions are | ||
checked. | ||
|
||
This allows us to more quickly converge on one chain of blocks, so that | ||
validators don't have to execute any blocks not on selected fork. This saves | ||
CPU and memory resource needed in replay, it also ensures that the cluster | ||
will have fewer forks that are caused by slow transaction execution. | ||
|
||
## Motivation | ||
|
||
Currently the vote transactions and non-vote transactions are mixed together in | ||
a block, a block is considered in consensus only after the whole block has been | ||
frozen and all transactions in the block have been verified and executed. This | ||
is a problem because slow running non-vote transactions may affect affect the | ||
ability of consensus to pick the correct fork. It may also mean that the leader | ||
will more often build on a minority fork so the blocks it packed will be | ||
discarded later. | ||
|
||
With different hardware and running environment, there will always be some | ||
difference on speed of transaction execution between validators. Generally | ||
speaking, because vote transactions are so simple, the variation between vote | ||
execution should be smaller than that between non-vote executions. Also the | ||
vote transactions are very simple and lock-free, so they normally execute | ||
faster than non-vote transactions. Therefore, if we only execute vote | ||
transactions in a block before voting on the block, it is more likely | ||
validators can reach consensus faster. | ||
|
||
Even with async vote execution, forks can still happen because of | ||
various other situations, like network partitions or mis-configured validators. | ||
This work just reduces the chances of forks caused by variance in non-vote | ||
transaction executions. | ||
|
||
The non-vote transactions do need to be executed eventually. Even though it's | ||
hard to make sure everyone executes every block within 400ms, on average | ||
majority of the cluster should be able to keep up. | ||
|
||
## Alternatives Considered | ||
|
||
### Separating vote and non-vote transactions into different domains | ||
|
||
An earlier proposal of Async Execution proposes that we separate vote and | ||
non-vote transactions into different domains, so that we can execute them | ||
independently. The main concerns were: | ||
|
||
* We need to introduce one bit in AccountsDB for every account, this | ||
complicates the implementation | ||
|
||
* Topping off the vote fee payer accounts becomes difficult. We need to add a | ||
bounce account to move fees from user domain to vote domain, and the process | ||
may take one epoch | ||
|
||
## New Terminology | ||
|
||
* `Vote Only Bankhash`: The hash calculated after executing only vote | ||
transactions in a block without checking fee payers. The exact calculation | ||
algorithm is listed in next section. | ||
* `Replay Tip Bankhash`: The bankhash as we know it today. It is calculated | ||
after executing all transactions in a block, checking fee payers for all. | ||
|
||
## Detailed Design | ||
|
||
### Allow leader to skip execution of transactions (Bankless Leader) | ||
|
||
There is already on-going effort to totally skip execution of all transactions | ||
when leader pack new blocks. See SIMD 82, SIMD 83, and related trackers: | ||
https://github.com/anza-xyz/agave/issues/2502 | ||
|
||
Theoretically we could reap some benefit without Bankless Leader, the leader | ||
packs as normal, while other validators only replay votes first, then later | ||
execute other transactions and compare with the bankhash of the leader. But in | ||
such a setup we gain smaller speedup without much benefits, it is a possible | ||
route during rollouts though. | ||
|
||
### Calculate vote only hash executing votes only and vote on selected forks | ||
|
||
Two new fields will be added to `TowerSync` vote transaction: | ||
|
||
* `replay_tip_hash`: This is the hash as we know it today. | ||
* `replay_tip_slot`: This is the slot where the replay tip hash is calculated. | ||
|
||
The `hash` and `slot` in the `TowerSync` transaction will be updated to | ||
the vote only hash. The vote only hash is calculated as follows: | ||
|
||
1. Sort all vote accounts with non-zero stake in the current or previous | ||
epoch by vote account pubkey. | ||
|
||
2. Calculate vote account hash by calculating sha256 hash of (vote account | ||
pubkey, serialized vote state) in the order given. | ||
|
||
3. Calculate vote only hash by calculating sha256 hash of the following in | ||
the given order: | ||
|
||
* vote only hash of the parent bank | ||
* vote account hash calculated above | ||
* block-id of the current bank | ||
|
||
This step is optimistic in the sense that validators do not check the fee | ||
payers when executing the vote transactions in a block. They assume vote | ||
transactions will not fail due to insufficient fees, apply the execution | ||
results to select the correct fork, then immediately vote on the bank with | ||
only the hash result of the vote transactions. | ||
|
||
This is safe because the list of validators and their corresponding stake | ||
uses the leader scheduler epoch stakes, which is calculated at the beginning | ||
of last Epoch. Because full execution is never behind the optimistic execution | ||
by more than one Epoch, the epoch stakes used is stable and correct. | ||
|
||
To make sure the vote casted would be the same as that after replaying the | ||
whole block, we need to be consistent on whether we mark the block dead, so | ||
that the ephemeral hash vote doesn't vote on a block which will be marked | ||
dead later. Currently a block can be dead for the following reasons: | ||
|
||
1. Unable to load data from blockstore | ||
2. Invalid block (wrong number of ticks, duplicate block, bad last fec, etc) | ||
3. Invalid transaction | ||
|
||
For the first two, the same check can be performed computing vote only | ||
hash. We will add a new check that the new root must be vote only replayed | ||
and fully replayed, this may mean the tower has more than 32 slots | ||
occasionally. | ||
|
||
The only operation we can't check is invalid transaction, since we will skip | ||
all non-vote transaction execution, there is no way we can check for validity | ||
of those. The intention of this check was to prevent spams. We will remove | ||
this check and rely on economic incentives so that the leader can perform | ||
appropriate checks. | ||
|
||
The vote only execution will operate exclusively on replicated vote states | ||
stored outside the accounts DB, so vote only execution and full execution can | ||
happen asynchronously in any order. The vote authority of each vote account | ||
will be copied from accounts DB at the beginning of each epoch, this means | ||
in the future vote authority change will take two epochs instead of one | ||
epoch. | ||
|
||
### Replay the full block on selected forks asynchronously | ||
|
||
There is no protocol enforced order of block replay for various validator | ||
implemenations, new vote transactions could be sent when the vote only hash | ||
or replay tip hash changes. | ||
|
||
Once a validator has determined the fork it will vote on, it can prioritize | ||
replaying blocks on the selected fork. The replay process is the same as today, | ||
all transactions (vote and non-vote) will be executed to determine the final | ||
bankhash. The computed bankhash will be attached to vote instructions. So we | ||
can still detect non-determinism (same of set of instructions leading to | ||
different results) like today, only that maybe at a later time. | ||
|
||
To guarantee the blockchain will halt when full replay is having problems, we | ||
propose: | ||
|
||
1. If full replay is behind vote only replay by more than 1/2 epoch or vice | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: "vice versa" can full replay run before vote only replay? I think we should ensure that full replay cannot run on a block that has not been vote only replayed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no reason a full replay can't start before the bank is vote only frozen, they write into completely different set of states. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. makes sense, we should outline the interaction here:
maybe point out that if we haven't completed vote replay we replay all available forks equally. also outline what happens in the case that we're full replaying a fork but end up vote replaying and voting on a different fork. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think what you described is reasonable, but I prefer not to specify in what order the full replay should happen in the protocol. Anza and Jump may decide to use different fork weight algorithm to pick which block to send to full replay next, I think the protocol would still work. And from what I heard, playing multiple forks at the same time or playing into the distant future is also a possibility. |
||
versa, stop producing new blocks until the lagging replay catches up. Also set | ||
up monitoring if the distance between two replays are growing larger. | ||
|
||
2. If more than 1/3 of the validators send a different final hash for a block | ||
with the same vote only hash, panic and prompt for further debugging. | ||
|
||
In this step the validators will check the fee payers of vote transactions. So | ||
each vote transaction is executed twice: once in the optimistic voting stage | ||
*without* checking fee payer, and once in this stage *with* checking fee payer. | ||
If a staked validator does not have vote fee covered for specific votes, we | ||
will not accept the vote today, while in the future we accept the vote in fork | ||
selection, but does not actually give vote credits because the transaction | ||
failed. | ||
|
||
### Enable Async Vote Executions | ||
|
||
1. The leader will no longer execute any transactions before broadcasting | ||
the block it packed. We do have block-id (Merkle tree root) to ensure | ||
everyone receives the same block. | ||
2. Upon receiving a new block, the validator executes only the vote | ||
transactions without checking the fee payers. The result is immediately | ||
applied in consensus to select a fork. Then votes are sent out for the | ||
selected fork with the `Vote only bankhash` for the tip of the fork and the | ||
most recent `Replay tip bankhash`. Note that the fork selection will only | ||
be picked based on most recent `Vote only bankhash` and associated slot. | ||
`Replay tip bankhash` is used mostly for commitmment aggregation and security | ||
checks described below. | ||
3. The blocks on the selected forks are scheduled to be replayed. When | ||
a block is replayed, all transactions are executed with fee payers checked. | ||
This is the same as the replay we use today. | ||
4. Optimisticly confirmed or finalized on `Vote only bankhash` and `Replay tip | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might want to specify the listener changes here, we expect clients to track commitment statuses on both hashes? What does it mean to be OC/finalized on the replay tip bankhash, will invalid fee payer votes count towards this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also specify changes to the commitment service, do we still aggregate when we vote on the block / root a block? or is there a separate pathway taken only when we full replay the block. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to User visible changes. |
||
bankhash` will be both exposed through RPC so users can select accordingly. | ||
5. Add assertion that confirmed `Replay tip bankhash` is not too far away from | ||
the confirmed `Vote only bankhash` (currently proposed at 1/2 of the Epoch) | ||
6. Add alerts if `Replay tip bankhash` differs when the `Vote only bankhash` is | ||
the same. This is potentially an event worthy of cluster restart. If more than | ||
1/3 of the validators claim a different `Replay tip bankhash`, halt and exit. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specify how the feature program will look with APE:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to user visible changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specify the modifications to fork choice. Fork choice will now only read vote only replayed votes. Needs to be keyed by block-id or vote only hash instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's covered in "The result is immediately applied in consensus to select a fork." I think whether we key our internal data structure by block-id or hash is implementation details we don't need in SIMD. I've added that to point 2 in "Enable Async Vote Executions". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specify the modifications to repair, we will now ingest vote only replayed votes in repair weighting. Also changes to repair peer selection, is EpochSlots updated after vote only replay now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EpochSlots just specified which slots I have to serve repair right? I don't see why we can't update it after vote only replay. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think repair weighting is also internal choice Anza and Jump may have different choices on, we surely should do what you suggested but we don't need it in the SIMD if it's not visible on the line. |
||
### User visible changes | ||
|
||
Because we confirm or finalize blocks based on `Vote only bankhash`, the | ||
following changes will be visible to users: | ||
|
||
1. New RPC Commitmment Levels: | ||
|
||
Right now we have 3 commitmment levels users can specify in RPC: | ||
Processed/Confirmed/Finalized. These commitmment will be calculated based on | ||
`Vote only bankhash`. There will be two additional commitmment levels: | ||
|
||
* ReplayTipConfirmed: The highest slot where supermajority of the cluster | ||
voted on with the same `Replay Tip Bankhash`. Votes with invalid fee payers | ||
still count toward this confirmation level. | ||
* ReplayTipFinalized: The highest slot where the block is Finalized and | ||
ReplayTipConfirmed, recognized by a supermajority of the cluster. | ||
|
||
2. Feature activation: | ||
|
||
Feature activations where the vote program isn't affected still work as | ||
before. Feature activations where vote program is affected will require | ||
two epochs to activate. When a feature affecting vote program is activated | ||
across block boundary, we can be sure the feature is activated only when | ||
the first block in the epoch is fully replayed and confirmed. Because the | ||
`Replay tip` block is never more than one Epoch away from `Vote only tip`, | ||
it's safe to assume vote program related feature is activated after one | ||
full epoch. | ||
|
||
## Impact | ||
|
||
Since we will eliminate the impact of non-vote transaction execution speed, | ||
we should expect to see fewer forking and late blocks. | ||
|
||
## Security Considerations | ||
|
||
We do need to monitor and address the possibility of bankhash mismatches | ||
when the tip of the fork is far away from the slot where bankhash mismatch | ||
happened. | ||
|
||
## Backward Compatibility | ||
|
||
Most of the changes would require feature gates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this part? We avoid setting the root if the block has not been fully replayed, so the tower can contain thousands of slots?
Why can't we set the root like we do today and modify pruning to keep up to the latest fully replayed root? Essentially moving the smr to prune by to be the replayed-smr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to save the slots which should be kicked out of tower but haven't been fully replayed somewhere. We can only save snapshot on a fully replayed slot, and we need to keep ancestor relationships between blocks. It could be somewhere not on the tower, are you mostly worried about the space cost if we keep it on the tower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have all that in bank forks right? the tower is just a list of slot #s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do. I think the main question is whether we want it in vote transactions. We do want replay tip in there, but maybe not the full list of slots.