[ARO-11484] Fix fixetcd GA #4034

bitoku · 2025-01-01T14:03:41Z

Which issue this PR addresses:

Fixes https://issues.redhat.com/browse/ARO-11484

What this PR does / why we need it:

This PR fixes fixetcd GA.

Fix the label selector
Change the image to ubi9 because it got glibc error with ubi8.
Change the job to a pod. It only runs one pod, so there's no reason to use a job. Also its watcher returns a job object not a pod object.

Also I added e2e, but it takes so long and it can't run in parallel, so I make it regression test.
It doesn't run by default in CI.

E2E test is intended to ensure the master replacement SOP is valid because I couldn't reproduce the etcd issue with 100% possiblity.
I think it's enough solution until we find the reliable reproducible scenario.

Test plan for issue:

e2e

Is there any documentation that needs to be updated for this PR?

no

How do you know this will function as expected in production?

e2e, master replacement & fixetcd GA

kimorris27

Mostly LGTM, and it was thoughtful to add an E2E test. I made some small suggestions and have one other thing to point out.

There's a part of the user story that I don't see addressed in the PR: I want to add a conditional check for when the node IP address remains the same, and delete the existing etcd Pod if it's in a crashloop. Is that part of the story still needed?

pkg/frontend/fixetcd.go

test/e2e/etcd.go

bitoku · 2025-01-02T16:23:13Z

@kimorris27 Thank you for taking a look!

There's a part of the user story that I don't see addressed in the PR: I want to add a conditional check for when the node IP address remains the same, and delete the existing etcd Pod if it's in a crashloop. Is that part of the story still needed?

I don't think it's actually needed.
During the test, I got the error many times when the pod's IP address is unchanged, and fixetcd API fixed the issue.
Fixetcd just deletes the etcd member, but it seems when there's a change in etcd, etcd-operator automatically redeploys all etcd pods.

You might be able to reproduce it by running e2e. When you get 200 from the fixetcd api, the etcd pod should be CrashLoopBackoff.

kimorris27

LGTM given the responses to my original comments. The only issue I see now is that it looks like some unit tests in pkg/frontend need to be updated to reflect the changes to fixetcd.go, but maybe someone else will need to pick up this work first?

Improve etcd error handling and add e2e test for master recovery

6435a57

bitoku requested review from jewzaam, bennerv, hawkowl, rogbas, petrkotas, jharrington22, cblecker, cadenmarchese, UlrichSchlueter, SudoBrendan, yjst2012, jaitaiwan, anshulvermapatel, hlipsig, tiguelu, mociarain, kimorris27, tsatam and fahlmant as code owners January 1, 2025 14:03

kimorris27 requested changes Jan 2, 2025

View reviewed changes

Clarify the logs

a1bdb19

kimorris27 requested changes Jan 3, 2025

View reviewed changes

fix unit tests

0753a4a

kimorris27 approved these changes Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARO-11484] Fix fixetcd GA #4034

[ARO-11484] Fix fixetcd GA #4034

bitoku commented Jan 1, 2025 •

edited

Loading

kimorris27 left a comment

bitoku commented Jan 2, 2025 •

edited

Loading

kimorris27 left a comment

[ARO-11484] Fix fixetcd GA #4034

Are you sure you want to change the base?

[ARO-11484] Fix fixetcd GA #4034

Conversation

bitoku commented Jan 1, 2025 • edited Loading

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

kimorris27 left a comment

Choose a reason for hiding this comment

bitoku commented Jan 2, 2025 • edited Loading

kimorris27 left a comment

Choose a reason for hiding this comment

bitoku commented Jan 1, 2025 •

edited

Loading

bitoku commented Jan 2, 2025 •

edited

Loading