Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD-0165: Async Vote Execution #165

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
8bce07a
Initial draft of Async Vote Execution SIMD.
wen-coding Aug 12, 2024
74f3a33
Change SIMD number.
wen-coding Aug 12, 2024
4b50153
Make linter happy.
wen-coding Aug 12, 2024
a8aad6c
Make linter happy.
wen-coding Aug 16, 2024
55fae1d
Update the plan for clock calculation.
wen-coding Aug 26, 2024
ca682d7
Update the proposal to reflect optimistic vote execution plan we disc…
wen-coding Sep 19, 2024
93b2506
Update dash to asterisk.
wen-coding Sep 19, 2024
a78f256
Rename VED and UED hash to Ephemeral and Final hash.
wen-coding Sep 20, 2024
67a1f45
Change title to reflect that we calculate Ephemeral hash.
wen-coding Sep 20, 2024
e38b645
Explain the checks we will perform during ephemeral hash computation.
wen-coding Sep 27, 2024
7dd9801
Add new fields in TowerSync.
wen-coding Sep 27, 2024
60a66af
Update that which block (slot, hash) on the vote transaction is.
wen-coding Sep 27, 2024
5432353
Clarify that new votes should be sent out when either hash changes.
wen-coding Sep 27, 2024
40d4be6
Propose to halt and exit if >1/3 disagrees on final bankhash.
wen-coding Oct 25, 2024
1b3ed81
Change status to Idea.
wen-coding Oct 25, 2024
4e83392
Update calculation of vote only hash and others.
wen-coding Nov 28, 2024
35ec617
Fix some small problems.
wen-coding Dec 11, 2024
93372ea
Address review comments.
wen-coding Dec 11, 2024
7daaf7a
Specify the hash function used.
wen-coding Dec 12, 2024
b7fc403
Clarify the set root error.
wen-coding Dec 12, 2024
f4278f4
Set root will not cause the block to be marked dead.
wen-coding Dec 12, 2024
7b3de13
Add the user visible changes section.
wen-coding Dec 12, 2024
8ac7180
Clarify we only use vote only hash in fork selection.
wen-coding Dec 12, 2024
437d66d
EpochSlots should be updated when banks are vote only frozen.
wen-coding Dec 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 251 additions & 0 deletions proposals/0165-async-vote-execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
---
simd: '0165'
title: Async Vote Execution
authors:
- Wen Xu
category: Standard
type: Core
status: Idea
created: 2024-08-11
feature: null
supersedes: null
superseded-by: null
extends: null
---

## Summary

Optimistically execute all vote transactions in a block to determine fork
selection in consensus early on, before all the transactions in the block
are fully executed and the actual fee payers for vote transactions are
checked.

This allows us to more quickly converge on one chain of blocks, so that
validators don't have to execute any blocks not on selected fork. This saves
CPU and memory resource needed in replay, it also ensures that the cluster
will have fewer forks that are caused by slow transaction execution.

## Motivation

Currently the vote transactions and non-vote transactions are mixed together in
a block, a block is considered in consensus only after the whole block has been
frozen and all transactions in the block have been verified and executed. This
is a problem because slow running non-vote transactions may affect affect the
ability of consensus to pick the correct fork. It may also mean that the leader
will more often build on a minority fork so the blocks it packed will be
discarded later.

With different hardware and running environment, there will always be some
difference on speed of transaction execution between validators. Generally
speaking, because vote transactions are so simple, the variation between vote
execution should be smaller than that between non-vote executions. Also the
vote transactions are very simple and lock-free, so they normally execute
faster than non-vote transactions. Therefore, if we only execute vote
transactions in a block before voting on the block, it is more likely
validators can reach consensus faster.

Even with async vote execution, forks can still happen because of
various other situations, like network partitions or mis-configured validators.
This work just reduces the chances of forks caused by variance in non-vote
transaction executions.

The non-vote transactions do need to be executed eventually. Even though it's
hard to make sure everyone executes every block within 400ms, on average
majority of the cluster should be able to keep up.

## Alternatives Considered

### Separating vote and non-vote transactions into different domains

An earlier proposal of Async Execution proposes that we separate vote and
non-vote transactions into different domains, so that we can execute them
independently. The main concerns were:

* We need to introduce one bit in AccountsDB for every account, this
complicates the implementation

* Topping off the vote fee payer accounts becomes difficult. We need to add a
bounce account to move fees from user domain to vote domain, and the process
may take one epoch

## New Terminology

* `Vote Only Bankhash`: The hash calculated after executing only vote
transactions in a block without checking fee payers. The exact calculation
algorithm is listed in next section.
* `Replay Tip Bankhash`: The bankhash as we know it today. It is calculated
after executing all transactions in a block, checking fee payers for all.

## Detailed Design

### Allow leader to skip execution of transactions (Bankless Leader)

There is already on-going effort to totally skip execution of all transactions
when leader pack new blocks. See SIMD 82, SIMD 83, and related trackers:
https://github.com/anza-xyz/agave/issues/2502

Theoretically we could reap some benefit without Bankless Leader, the leader
packs as normal, while other validators only replay votes first, then later
execute other transactions and compare with the bankhash of the leader. But in
such a setup we gain smaller speedup without much benefits, it is a possible
route during rollouts though.

### Calculate vote only hash executing votes only and vote on selected forks

Two new fields will be added to `TowerSync` vote transaction:

* `replay_tip_hash`: This is the hash as we know it today.
* `replay_tip_slot`: This is the slot where the replay tip hash is calculated.

The `hash` and `slot` in the `TowerSync` transaction will be updated to
the vote only hash. The vote only hash is calculated as follows:

1. Sort all vote accounts with non-zero stake in the current or previous
epoch by vote account pubkey.

2. Calculate vote account hash by calculating sha256 hash of (vote account
pubkey, serialized vote state) in the order given.

3. Calculate vote only hash by calculating sha256 hash of the following in
the given order:

* vote only hash of the parent bank
* vote account hash calculated above
* block-id of the current bank

This step is optimistic in the sense that validators do not check the fee
payers when executing the vote transactions in a block. They assume vote
transactions will not fail due to insufficient fees, apply the execution
results to select the correct fork, then immediately vote on the bank with
only the hash result of the vote transactions.

This is safe because the list of validators and their corresponding stake
uses the leader scheduler epoch stakes, which is calculated at the beginning
of last Epoch. Because full execution is never behind the optimistic execution
by more than one Epoch, the epoch stakes used is stable and correct.

To make sure the vote casted would be the same as that after replaying the
whole block, we need to be consistent on whether we mark the block dead, so
that the ephemeral hash vote doesn't vote on a block which will be marked
dead later. Currently a block can be dead for the following reasons:

1. Unable to load data from blockstore
2. Invalid block (wrong number of ticks, duplicate block, bad last fec, etc)
3. Invalid transaction

For the first two, the same check can be performed computing vote only
hash. We will add a new check that the new root must be vote only replayed
and fully replayed, this may mean the tower has more than 32 slots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this part? We avoid setting the root if the block has not been fully replayed, so the tower can contain thousands of slots?
Why can't we set the root like we do today and modify pruning to keep up to the latest fully replayed root? Essentially moving the smr to prune by to be the replayed-smr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to save the slots which should be kicked out of tower but haven't been fully replayed somewhere. We can only save snapshot on a fully replayed slot, and we need to keep ancestor relationships between blocks. It could be somewhere not on the tower, are you mostly worried about the space cost if we keep it on the tower?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have all that in bank forks right? the tower is just a list of slot #s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do. I think the main question is whether we want it in vote transactions. We do want replay tip in there, but maybe not the full list of slots.

occasionally.

The only operation we can't check is invalid transaction, since we will skip
all non-vote transaction execution, there is no way we can check for validity
of those. The intention of this check was to prevent spams. We will remove
this check and rely on economic incentives so that the leader can perform
appropriate checks.

The vote only execution will operate exclusively on replicated vote states
stored outside the accounts DB, so vote only execution and full execution can
happen asynchronously in any order. The vote authority of each vote account
will be copied from accounts DB at the beginning of each epoch, this means
in the future vote authority change will take two epochs instead of one
epoch.

### Replay the full block on selected forks asynchronously

There is no protocol enforced order of block replay for various validator
implemenations, new vote transactions could be sent when the vote only hash
or replay tip hash changes.

Once a validator has determined the fork it will vote on, it can prioritize
replaying blocks on the selected fork. The replay process is the same as today,
all transactions (vote and non-vote) will be executed to determine the final
bankhash. The computed bankhash will be attached to vote instructions. So we
can still detect non-determinism (same of set of instructions leading to
different results) like today, only that maybe at a later time.

To guarantee the blockchain will halt when full replay is having problems, we
propose:

1. If full replay is behind vote only replay by more than 1/2 epoch or vice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "vice versa" can full replay run before vote only replay? I think we should ensure that full replay cannot run on a block that has not been vote only replayed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason a full replay can't start before the bank is vote only frozen, they write into completely different set of states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, we should outline the interaction here:

Once a validator has determined the fork it will vote on, it can prioritize
replaying blocks on the selected fork

maybe point out that if we haven't completed vote replay we replay all available forks equally. also outline what happens in the case that we're full replaying a fork but end up vote replaying and voting on a different fork.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you described is reasonable, but I prefer not to specify in what order the full replay should happen in the protocol. Anza and Jump may decide to use different fork weight algorithm to pick which block to send to full replay next, I think the protocol would still work. And from what I heard, playing multiple forks at the same time or playing into the distant future is also a possibility.

versa, stop producing new blocks until the lagging replay catches up. Also set
up monitoring if the distance between two replays are growing larger.

2. If more than 1/3 of the validators send a different final hash for a block
with the same vote only hash, panic and prompt for further debugging.

In this step the validators will check the fee payers of vote transactions. So
each vote transaction is executed twice: once in the optimistic voting stage
*without* checking fee payer, and once in this stage *with* checking fee payer.
If a staked validator does not have vote fee covered for specific votes, we
will not accept the vote today, while in the future we accept the vote in fork
selection, but does not actually give vote credits because the transaction
failed.

### Enable Async Vote Executions

1. The leader will no longer execute any transactions before broadcasting
the block it packed. We do have block-id (Merkle tree root) to ensure
everyone receives the same block.
2. Upon receiving a new block, the validator executes only the vote
transactions without checking the fee payers. The result is immediately
applied in consensus to select a fork. Then votes are sent out for the
selected fork with the `Vote only bankhash` for the tip of the fork and the
most recent `Replay tip bankhash`. Note that the fork selection will only
be picked based on most recent `Vote only bankhash` and associated slot.
The EpochSlots will also be updated when the banks have completed vote
only replay. `Replay tip bankhash` is used mostly for commitmment aggregation
and security checks described below.
3. The blocks on the selected forks are scheduled to be replayed. When
a block is replayed, all transactions are executed with fee payers checked.
This is the same as the replay we use today.
4. Optimisticly confirmed or finalized on `Vote only bankhash` and `Replay tip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to specify the listener changes here, we expect clients to track commitment statuses on both hashes? What does it mean to be OC/finalized on the replay tip bankhash, will invalid fee payer votes count towards this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also specify changes to the commitment service, do we still aggregate when we vote on the block / root a block? or is there a separate pathway taken only when we full replay the block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to User visible changes.

bankhash` will be both exposed through RPC so users can select accordingly.
5. Add assertion that confirmed `Replay tip bankhash` is not too far away from
the confirmed `Vote only bankhash` (currently proposed at 1/2 of the Epoch)
6. Add alerts if `Replay tip bankhash` differs when the `Vote only bankhash` is
the same. This is potentially an event worthy of cluster restart. If more than
1/3 of the validators claim a different `Replay tip bankhash`, halt and exit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify how the feature program will look with APE:

  • Normal pathway features will work same as before
  • APE features will require 2 epochs of advance activation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to user visible changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify the modifications to fork choice. Fork choice will now only read vote only replayed votes. Needs to be keyed by block-id or vote only hash instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's covered in "The result is immediately applied in consensus to select a fork." I think whether we key our internal data structure by block-id or hash is implementation details we don't need in SIMD. I've added that to point 2 in "Enable Async Vote Executions".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify the modifications to repair, we will now ingest vote only replayed votes in repair weighting. Also changes to repair peer selection, is EpochSlots updated after vote only replay now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EpochSlots just specified which slots I have to serve repair right? I don't see why we can't update it after vote only replay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think repair weighting is also internal choice Anza and Jump may have different choices on, we surely should do what you suggested but we don't need it in the SIMD if it's not visible on the line.
Added that we will add banks to EpochSlots after vote only replay.

### User visible changes

Because we confirm or finalize blocks based on `Vote only bankhash`, the
following changes will be visible to users:

1. New RPC Commitmment Levels:

Right now we have 3 commitmment levels users can specify in RPC:
Processed/Confirmed/Finalized. These commitmment will be calculated based on
`Vote only bankhash`. There will be two additional commitmment levels:

* ReplayTipConfirmed: The highest slot where supermajority of the cluster
voted on with the same `Replay Tip Bankhash`. Votes with invalid fee payers
still count toward this confirmation level.
* ReplayTipFinalized: The highest slot where the block is Finalized and
ReplayTipConfirmed, recognized by a supermajority of the cluster.

2. Feature activation:

Feature activations where the vote program isn't affected still work as
before. Feature activations where vote program is affected will require
two epochs to activate. When a feature affecting vote program is activated
across block boundary, we can be sure the feature is activated only when
the first block in the epoch is fully replayed and confirmed. Because the
`Replay tip` block is never more than one Epoch away from `Vote only tip`,
it's safe to assume vote program related feature is activated after one
full epoch.

## Impact

Since we will eliminate the impact of non-vote transaction execution speed,
we should expect to see fewer forking and late blocks.

## Security Considerations

We do need to monitor and address the possibility of bankhash mismatches
when the tip of the fork is far away from the slot where bankhash mismatch
happened.

## Backward Compatibility

Most of the changes would require feature gates.
Loading