-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD-0165: Async Vote Execution #165
base: main
Are you sure you want to change the base?
Changes from 17 commits
8bce07a
74f3a33
4b50153
a8aad6c
55fae1d
ca682d7
93b2506
a78f256
67a1f45
e38b645
7dd9801
60a66af
5432353
40d4be6
1b3ed81
4e83392
35ec617
93372ea
7daaf7a
b7fc403
f4278f4
7b3de13
8ac7180
437d66d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
--- | ||
simd: '0165' | ||
title: Async Vote Execution | ||
authors: | ||
- Wen Xu | ||
category: Standard | ||
type: Core | ||
status: Idea | ||
created: 2024-08-11 | ||
feature: null | ||
supersedes: null | ||
superseded-by: null | ||
extends: null | ||
--- | ||
|
||
## Summary | ||
|
||
Optimistically execute all vote transactions in a block to determine fork | ||
selection in consensus early on, before all the transactions in the block | ||
are fully executed and the actual fee payers for vote transactions are | ||
checked. | ||
|
||
This allows us to more quickly converge on one chain of blocks, so that | ||
validators don't have to execute any blocks not on selected fork. This saves | ||
CPU and memory resource needed in replay, it also ensures that the cluster | ||
will have fewer forks that are caused by slow transaction execution. | ||
|
||
## Motivation | ||
|
||
Currently the vote transactions and non-vote transactions are mixed together in | ||
a block, the vote transactions are only processed in consensus when the whole | ||
block has been frozen and all transactions in the block have been verified and | ||
executed. This is a problem because slow running non-vote transactions may | ||
affect how fast the votes are processed and then affect the ability of | ||
consensus to pick the correct fork. It may also mean that the leader will more | ||
often build on a minority fork so the blocks it packed will be discarded later. | ||
|
||
With different hardware and running environment, there will always be some | ||
difference on speed of transaction execution between validators. Generally | ||
speaking, because vote transactions are so simple, the variation between vote | ||
execution should be smaller than that between non-vote executions. Also the | ||
vote transactions are very simple and lock-free, so they normally execute | ||
faster than non-vote transactions. Therefore, if we only execute vote | ||
transactions in a block before voting on the block, it is more likely | ||
validators can reach consensus faster. | ||
|
||
Even with async vote execution, forks can still happen because of | ||
various other situations, like network partitions or mis-configured validators. | ||
This work just reduces the chances of forks caused by variance in non-vote | ||
transaction executions. | ||
|
||
The non-vote transactions do need to be executed eventually. Even though it's | ||
hard to make sure everyone executes every block within 400ms, on average | ||
majority of the cluster should be able to keep up. | ||
|
||
## Alternatives Considered | ||
|
||
### Separating vote and non-vote transactions into different domains | ||
|
||
An earlier proposal of Async Execution proposes that we separate vote and | ||
non-vote transactions into different domains, so that we can execute them | ||
independently. The main concerns were: | ||
|
||
* We need to introduce one bit in AccountsDB for every account, this | ||
complicates the implementation | ||
|
||
* Topping off the vote fee payer accounts becomes difficult. We need to add a | ||
bounce account to move fees from user domain to vote domain, and the process | ||
may take one epoch | ||
|
||
## New Terminology | ||
|
||
* `Vote Only Bankhash`: The hash calculated after executing only vote | ||
transactions in a block without checking fee payers. The exact calculation | ||
algorithm is listed in next section. | ||
* `Replay Tip Bankhash`: The bankhash as we know it today. It is calculated | ||
after executing all transactions in a block, checking fee payers for all. | ||
|
||
## Detailed Design | ||
|
||
### Allow leader to skip execution of transactions (Bankless Leader) | ||
|
||
There is already on-going effort to totally skip execution of all transactions | ||
when leader pack new blocks. See SIMD 82, SIMD 83, and related trackers: | ||
https://github.com/anza-xyz/agave/issues/2502 | ||
|
||
Theoretically we could reap some benefit without Bankless Leader, the leader | ||
packs as normal, while other validators only replay votes first, then later | ||
execute other transactions and compare with the bankhash of the leader. But in | ||
such a setup we gain smaller speedup without much benefits, it is a possible | ||
route during rollouts though. | ||
|
||
### Calculate vote only hash executing votes only and vote on selected forks | ||
|
||
Two new fields will be added to `TowerSync` vote transaction: | ||
|
||
* `replay_tip_hash`: This is the hash as we know it today. | ||
* `replay_tip_slot`: This is the slot where the replay tip hash is calculated. | ||
|
||
The `hash` and `slot` in the `TowerSync` transaction will be updated to | ||
the vote only hash. The vote only hash is calculated as follows: | ||
|
||
1. Sort all vote accounts with non-zero stake in the current epoch by | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. non-zero in the previous epoch right? since stakes are offset by 1 epoch? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I meant to say: has power to vote in the current or previous epoch. I think I need previous epoch as well for the epoch handoff? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what I meant is that consensus uses the stakes from the previous epoch, so if we're only interested in processing consensus votes we should match the logic here. Basically only consider vote accounts with non-zero stake in the previous epoch If we ever implement the handoff then we should consider the previous previous epoch as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should invent a name for this thing, this is so confusing. I think you and me both mean "that epoch stakes we use for voting in this Epoch", right? Then let's say in the future we decide to use epoch stakes of Epoch X for voting in Epoch X+2 we don't need to chase down all documents again and change it from "previous epoch" to "previous previous epoch". We just stick to one name. I'm confused: why don't we need stakes for the previous epoch (previous previous epoch in your description) if we don't implement handoff? |
||
vote account pubkey. | ||
|
||
2. Calculate vote account hash by hashing (vote account pubkey, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: specify the hash fn There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specified sha256. |
||
serialized vote state) in the order given. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we have to hash the whole vote state? only the tower / root will change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A bit worried the vote authority isn't copied correctly in some implementation, not a lot else other than the tower/root I guess. |
||
|
||
3. Calculate vote only hash by hashing: | ||
|
||
* vote only hash of the parent bank | ||
* vote account hash calculated above | ||
* block-id of the current bank | ||
|
||
This step is optimistic in the sense that validators do not check the fee | ||
payers when executing the vote transactions in a block. They assume vote | ||
transactions will not fail due to insufficient fees, apply the execution | ||
results to select the correct fork, then immediately vote on the bank with | ||
only the hash result of the vote transactions. | ||
|
||
This is safe because the list of validators and their corresponding stake | ||
uses the leader scheduler epoch stakes, which is calculated at the beginning | ||
of last Epoch. Because full execution is never behind the optimistic execution | ||
by more than one Epoch, the epoch stakes used is stable and correct. | ||
|
||
To make sure the vote casted would be the same as that after replaying the | ||
whole block, we need to be consistent on whether we mark the block dead, so | ||
that the ephemeral hash vote doesn't vote on a block which will be marked | ||
dead later. Currently a block can be dead for the following reasons: | ||
|
||
1. Unable to load data from blockstore | ||
2. Invalid block (wrong number of ticks, duplicate block, bad last fec, etc) | ||
3. Error while set root | ||
4. Invalid transaction | ||
|
||
For the first two, the same check can be performed computing ephemeral hash. | ||
We will set root on a bank only when it has full hash computed later, so the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does "set root" mean in this context? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is SetRootError in BlockstoreProcessorError. I think we will keep this check in the new world. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it doesn't mark the block dead though, just kills the replay thread. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm that's true, removed this check |
||
behavior will be the same as now. | ||
|
||
The only operation we can't check is invalid transaction, since we will skip | ||
all non-vote transaction execution, there is no way we can check for validity | ||
of those. The intention of this check was to prevent spams. We will remove | ||
this check and rely on economic incentives so that the leader can perform | ||
appropriate checks. | ||
|
||
The vote only execution will operate exclusively on replicated vote states | ||
stored outside the accounts DB, so vote only execution and full execution can | ||
happen asynchronously in any order. The vote authority of each vote account | ||
will be copied from accounts DB at the beginning of each epoch, this means | ||
in the future vote authority change will take two epochs instead of one | ||
epoch. | ||
|
||
### Replay the full block on selected forks asynchronously | ||
|
||
There is no protocol enforced order of block replay for various validator | ||
implemenations, new vote transactions could be sent when the vote only hash | ||
or replay tip hash changes. | ||
|
||
Once a validator has determined the fork it will vote on, it can prioritize | ||
replaying blocks on the selected fork. The replay process is the same as today, | ||
all transactions (vote and non-vote) will be executed to determine the final | ||
bankhash. The computed bankhash will be attached to vote instructions. So we | ||
can still detect non-determinism (same of set of instructions leading to | ||
different results) like today, only that maybe at a later time. | ||
|
||
To guarantee the blockchain will halt when full replay is having problems, we | ||
propose: | ||
|
||
1. If full replay is behind vote only replay by more than 1/2 epoch or vice | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: "vice versa" can full replay run before vote only replay? I think we should ensure that full replay cannot run on a block that has not been vote only replayed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no reason a full replay can't start before the bank is vote only frozen, they write into completely different set of states. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. makes sense, we should outline the interaction here:
maybe point out that if we haven't completed vote replay we replay all available forks equally. also outline what happens in the case that we're full replaying a fork but end up vote replaying and voting on a different fork. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think what you described is reasonable, but I prefer not to specify in what order the full replay should happen in the protocol. Anza and Jump may decide to use different fork weight algorithm to pick which block to send to full replay next, I think the protocol would still work. And from what I heard, playing multiple forks at the same time or playing into the distant future is also a possibility. |
||
versa, stop producing new blocks until the lagging replay catches up. Also set | ||
up monitoring if the distance between two replays are growing larger. | ||
|
||
2. If more than 1/3 of the validators send a different final hash for a block | ||
with the same vote only hash, panic and prompt for further debugging. | ||
|
||
In this step the validators will check the fee payers of vote transactions. So | ||
each vote transaction is executed twice: once in the optimistic voting stage | ||
*without* checking fee payer, and once in this stage *with* checking fee payer. | ||
If a staked validator does not have vote fee covered for specific votes, we | ||
will not accept the vote today, while in the future we accept the vote in fork | ||
selection, but does not actually give vote credits because the transaction | ||
failed. | ||
|
||
### Enable Async Vote Executions | ||
|
||
1. The leader will no longer execute any transactions before broadcasting | ||
the block it packed. We do have block-id (Merkle tree root) to ensure | ||
everyone receives the same block. | ||
2. Upon receiving a new block, the validator executes only the vote | ||
transactions without checking the fee payers. The result is immediately | ||
applied in consensus to select a fork. Then votes are sent out for the | ||
selected fork with the `Vote only bankhash` for the tip of the fork and the | ||
most recent `Replay tip bankhash`. | ||
3. The blocks on the selected forks are scheduled to be replayed. When | ||
a block is replayed, all transactions are executed with fee payers checked. | ||
This is the same as the replay we use today. | ||
4. Optimisticly confirmed or finalized on `Vote only bankhash` and `Replay tip | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might want to specify the listener changes here, we expect clients to track commitment statuses on both hashes? What does it mean to be OC/finalized on the replay tip bankhash, will invalid fee payer votes count towards this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also specify changes to the commitment service, do we still aggregate when we vote on the block / root a block? or is there a separate pathway taken only when we full replay the block. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to User visible changes. |
||
bankhash` will be both exposed through RPC so users can select accordingly. | ||
5. Add assertion that confirmed `Replay tip bankhash` is not too far away from | ||
the confirmed `Vote only bankhash` (currently proposed at 1/2 of the Epoch) | ||
6. Add alerts if `Replay tip bankhash` differs when the `Vote only bankhash` is | ||
the same. This is potentially an event worthy of cluster restart. If more than | ||
1/3 of the validators claim a different `Replay tip bankhash`, halt and exit. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specify how the feature program will look with APE:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to user visible changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specify the modifications to fork choice. Fork choice will now only read vote only replayed votes. Needs to be keyed by block-id or vote only hash instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's covered in "The result is immediately applied in consensus to select a fork." I think whether we key our internal data structure by block-id or hash is implementation details we don't need in SIMD. I've added that to point 2 in "Enable Async Vote Executions". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Specify the modifications to repair, we will now ingest vote only replayed votes in repair weighting. Also changes to repair peer selection, is EpochSlots updated after vote only replay now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EpochSlots just specified which slots I have to serve repair right? I don't see why we can't update it after vote only replay. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think repair weighting is also internal choice Anza and Jump may have different choices on, we surely should do what you suggested but we don't need it in the SIMD if it's not visible on the line. |
||
## Impact | ||
|
||
Since we will eliminate the impact of non-vote transaction execution speed, | ||
we should expect to see fewer forking and late blocks. | ||
|
||
## Security Considerations | ||
|
||
We do need to monitor and address the possibility of bankhash mismatches | ||
when the tip of the fork is far away from the slot where bankhash mismatch | ||
happened. | ||
|
||
## Backward Compatibility | ||
|
||
Most of the changes would require feature gates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: votes are processed per entry batch, doesn't have to wait for the bank to be frozen. as a result we still partially process votes even if the block later ends up dead.
e.g. agave's impl https://github.com/anza-xyz/agave/blob/1f06fbdbe3b72f330f50ad93a15c1116f5021392/ledger/src/blockstore_processor.rs#L177
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, now I think about it, it's less about processing votes as fast as possible, more about a lot of things don't happen until the block is frozen, changes this part.