Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix audit/unenroll calls when agent runs fleet-server #6085

Merged
merged 26 commits into from
Dec 17, 2024

Conversation

michel-laterman
Copy link
Contributor

What does this PR do?

Call audit/unenroll (during uninstall) before components are stopped if the elastic-agent is running a fleet-server instance.

Why is it important?

Agents running fleet-server instances would always get an error because fleet-server would be stopped by the time the notify call is attempted.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

Enroll an agent in a policy with the fleet-server integration and uninstall the agent

Related issues

@michel-laterman michel-laterman added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify labels Nov 19, 2024
@michel-laterman michel-laterman force-pushed the fix-audit-unenroll-fleet-server branch from dce24e2 to 9950ed5 Compare November 20, 2024 21:21
if cfg.Fleet.Client.Transport != nil {
cfg.Fleet.Client.Transport.TLS.VerificationMode = tlscommon.VerifyCertificate
}
cfg.Fleet.Client.Hosts = []string{cfg.Fleet.Client.Host}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's going on here; but Client.Hosts had the remote (cloud) fleet-server address, and Client.Host was the internal port.
I'm going to take some time tomorrow to verify that our normal agent operations still use the internal port when the agent is running fleet-server.

Copy link
Contributor Author

@michel-laterman michel-laterman Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agent does connect to localhost:8221; but strangely enough fleet-server does not populate the server.address attribute in HTTP logs; i'll open a fleet-server PR to fix this
pr here: elastic/fleet-server#4142

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg.Fleet.Client.Hosts = []string{cfg.Fleet.Client.Host} is a bit weird. This is perhaps something else that could be encapsulated into the output of an RPC call to the running agent service since agent itself must know that these represent two different things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid using an RPC call because we don't want to introduce the requirement that an agent is running for the uninstall to succeed.

Copy link
Member

@pchila pchila Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think is happening here:

I guess that the merged result of that is not quite right in this scenario and we end up with the discrepancy between Host and Hosts

From discussion with @nchaulet at the time, it was determined that Fleet will send back only Hosts field while the separate Host, Protocol, Path is only handled by agent when initializing config to store command line args. This refactor has been done in context of mTLS from policy support PR as an attempt to clean up the policy change handler.

I filed an issue to have a look and possible remote Host, Protocol and Path fields from the agent config at the time but maybe this needs further evaluation/discussion taking into account fleet bootstrap scenarios

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link to that issue and add this explanation to the code? This is going to be really not obvious to someone reading this code without the context of this PR.

@michel-laterman michel-laterman marked this pull request as ready for review November 21, 2024 19:55
@michel-laterman michel-laterman requested a review from a team as a code owner November 21, 2024 19:55
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

internal/pkg/agent/install/uninstall.go Outdated Show resolved Hide resolved
internal/pkg/agent/install/uninstall.go Outdated Show resolved Hide resolved
@pierrehilbert
Copy link
Contributor

It's a detail for now but we should probably also plan to backport this to 8.17

@cmacknz
Copy link
Member

cmacknz commented Nov 26, 2024

This needs a regression test :)

@michel-laterman michel-laterman added the backport-8.17 Automated backport with mergify label Nov 29, 2024
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Linux with and without the fix and it solves the issue with audit/unenroll when running a local Fleet server

I left a small comment for the Host/Hosts discrepancy we have trying to give more context, not sure if we want to follow that up with a dedicated issue

Copy link
Contributor

mergify bot commented Dec 11, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-audit-unenroll-fleet-server upstream/fix-audit-unenroll-fleet-server
git merge upstream/main
git push upstream fix-audit-unenroll-fleet-server

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it with an unprivileged agent on my Mac and it worked this time. Thanks for the follow up and additional test coverage!

❯ sudo elastic-development-agent uninstall
Elastic Agent will be uninstalled from your system at /Library/Elastic/Agent-Development. Do you want to continue? [Y/n]:y
[=== ] Done  [2s]
Elastic Agent has been uninstalled.

@michel-laterman michel-laterman enabled auto-merge (squash) December 11, 2024 23:25
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michel-laterman
Copy link
Contributor Author

@cmacknz @ycombinator, can one of you please force this in once buildkite completes? we have integration test coverage but not unit tests

@ycombinator ycombinator disabled auto-merge December 13, 2024 00:23
@michel-laterman
Copy link
Contributor Author

buildkite test this

Copy link

Quality Gate failed Quality Gate failed

Failed conditions
3.1% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@cmacknz
Copy link
Member

cmacknz commented Dec 17, 2024

Force merging now that CI has passed.

@cmacknz cmacknz merged commit f321d8a into elastic:main Dec 17, 2024
12 of 13 checks passed
mergify bot pushed a commit that referenced this pull request Dec 17, 2024
* Fix audit/unenroll calls when agent runs fleet-server

* Fix integration test

* Increase output for failed test

* move notification attempt

* move notification attempt

* audit/uninstall will call localhost with certonly tls

* Fix linter

* remove tls setting

* Add PR link to changelog

* stop checking if we need to notify in an error is encountered

* Shorten non-fatal error messages to prevent output issues in terminals

* Don't load agentinfo, use ID value from config instead

* Add unprivilged tests to audit unenroll integration tests

* Add comments for setting hosts

(cherry picked from commit f321d8a)

# Conflicts:
#	internal/pkg/agent/install/uninstall.go
mergify bot pushed a commit that referenced this pull request Dec 17, 2024
* Fix audit/unenroll calls when agent runs fleet-server

* Fix integration test

* Increase output for failed test

* move notification attempt

* move notification attempt

* audit/uninstall will call localhost with certonly tls

* Fix linter

* remove tls setting

* Add PR link to changelog

* stop checking if we need to notify in an error is encountered

* Shorten non-fatal error messages to prevent output issues in terminals

* Don't load agentinfo, use ID value from config instead

* Add unprivilged tests to audit unenroll integration tests

* Add comments for setting hosts

(cherry picked from commit f321d8a)
mergify bot pushed a commit that referenced this pull request Dec 17, 2024
* Fix audit/unenroll calls when agent runs fleet-server

* Fix integration test

* Increase output for failed test

* move notification attempt

* move notification attempt

* audit/uninstall will call localhost with certonly tls

* Fix linter

* remove tls setting

* Add PR link to changelog

* stop checking if we need to notify in an error is encountered

* Shorten non-fatal error messages to prevent output issues in terminals

* Don't load agentinfo, use ID value from config instead

* Add unprivilged tests to audit unenroll integration tests

* Add comments for setting hosts

(cherry picked from commit f321d8a)

# Conflicts:
#	internal/pkg/agent/install/uninstall.go
michel-laterman added a commit that referenced this pull request Dec 19, 2024
…-server (#6365)

* Fix audit/unenroll calls when agent runs fleet-server (#6085)

* Fix audit/unenroll calls when agent runs fleet-server

* Fix integration test

* Increase output for failed test

* move notification attempt

* move notification attempt

* audit/uninstall will call localhost with certonly tls

* Fix linter

* remove tls setting

* Add PR link to changelog

* stop checking if we need to notify in an error is encountered

* Shorten non-fatal error messages to prevent output issues in terminals

* Don't load agentinfo, use ID value from config instead

* Add unprivilged tests to audit unenroll integration tests

* Add comments for setting hosts

(cherry picked from commit f321d8a)

# Conflicts:
#	internal/pkg/agent/install/uninstall.go

* Fix merge conflict

---------

Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Co-authored-by: michel-laterman <michel.laterman@elastic.co>
michel-laterman added a commit that referenced this pull request Dec 19, 2024
* Fix audit/unenroll calls when agent runs fleet-server

* Fix integration test

* Increase output for failed test

* move notification attempt

* move notification attempt

* audit/uninstall will call localhost with certonly tls

* Fix linter

* remove tls setting

* Add PR link to changelog

* stop checking if we need to notify in an error is encountered

* Shorten non-fatal error messages to prevent output issues in terminals

* Don't load agentinfo, use ID value from config instead

* Add unprivilged tests to audit unenroll integration tests

* Add comments for setting hosts

(cherry picked from commit f321d8a)

Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
michel-laterman added a commit that referenced this pull request Dec 26, 2024
…-server (#6363)

* Fix audit/unenroll calls when agent runs fleet-server (#6085)

* Fix audit/unenroll calls when agent runs fleet-server

* Fix integration test

* Increase output for failed test

* move notification attempt

* move notification attempt

* audit/uninstall will call localhost with certonly tls

* Fix linter

* remove tls setting

* Add PR link to changelog

* stop checking if we need to notify in an error is encountered

* Shorten non-fatal error messages to prevent output issues in terminals

* Don't load agentinfo, use ID value from config instead

* Add unprivilged tests to audit unenroll integration tests

* Add comments for setting hosts

(cherry picked from commit f321d8a)

# Conflicts:
#	internal/pkg/agent/install/uninstall.go

* Fix merge conflict

---------

Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Co-authored-by: michel-laterman <michel.laterman@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

elastic-agent installed as Fleet server remains stuck in uninstalling with retry errors for over 4-5 minutes.
7 participants