Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport router health checks #501

Merged
merged 2 commits into from
Dec 3, 2024

Conversation

Serpentian
Copy link
Contributor

@Serpentian Serpentian commented Dec 3, 2024

Backport of the https://github.com/tarantool/vshard-ee/pull/20. No changes in code. Changed the commit msg in the first one (commit hash) and the reason for NO_DOC in the second one

This commit fixes the flakiness of the failover/cluster_changes
test, which was caused by the commit 9fc976d ("router: calls
affect temporary prioritized replica"). It started to check
`replica.net_sequential_ok` during `up_replica_priority`. However,
during the configuration of the instance, `net_sequential_ok` is 0,
so when failover fiber didn't manage to ping the instance
(e.g. connection have not been created yet) router throwed
SUBOPTIMAL_REPLICA alert.

Let's disable health checkers during router configuration
(when prioritized replica is not set at all).

Closes tarantool#495

NO_DOC=bugfix
NO_TEST=<covered by failover/cluster_changes>
Before this patch router didn't take into account the state of
box.info.replication of the storage, when routing requests to it.
From now on router automatically lowers the priority of replica,
when router supposes, that connection from the master to a replica
is dead (status or idle > 30) or too slow (lag is > 30 sec).

We also change REPLICA_NOACTIVITY_TIMEOUT from 5 minutes to 30 seconds.
This is needed to speed up how quickly a replica notices the master's
change. Before the patch the non-master never knew, where the master
currently is. Now, since we try to check status of the master's upstream,
we need to find this master in service_info via conn_manager. Since after
that replica doesn't do any requests to master, the connection is collected
by conn_manager in collect_idle_conns after 30 seconds. Then router's
failover calls service_info one more time and non-master locates master,
which may have already changed.

This patch allows to increase the consistency of read requests and
decreases the probability of reading a stale data.

Closes tarantool#453
Closes tarantool#487

NO_DOC=bugfix
Copy link
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the port, and I am happy that you got this patch open-sourced!

@Gerold103 Gerold103 merged commit e1c806e into tarantool:master Dec 3, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants