-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix audit/unenroll calls when agent runs fleet-server #6085
Fix audit/unenroll calls when agent runs fleet-server #6085
Conversation
dce24e2
to
9950ed5
Compare
if cfg.Fleet.Client.Transport != nil { | ||
cfg.Fleet.Client.Transport.TLS.VerificationMode = tlscommon.VerifyCertificate | ||
} | ||
cfg.Fleet.Client.Hosts = []string{cfg.Fleet.Client.Host} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's going on here; but Client.Hosts
had the remote (cloud) fleet-server address, and Client.Host
was the internal port.
I'm going to take some time tomorrow to verify that our normal agent operations still use the internal port when the agent is running fleet-server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agent does connect to localhost:8221
; but strangely enough fleet-server does not populate the server.address
attribute in HTTP logs; i'll open a fleet-server PR to fix this
pr here: elastic/fleet-server#4142
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfg.Fleet.Client.Hosts = []string{cfg.Fleet.Client.Host}
is a bit weird. This is perhaps something else that could be encapsulated into the output of an RPC call to the running agent service since agent itself must know that these represent two different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid using an RPC call because we don't want to introduce the requirement that an agent is running for the uninstall to succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I think is happening here:
- elastic agent sets
Host
from the enroll command - in the policy_change handler elastic-agent uses only
Hosts
field coming from Fleet and tries to reset the other fields.
I guess that the merged result of that is not quite right in this scenario and we end up with the discrepancy between Host
and Hosts
From discussion with @nchaulet at the time, it was determined that Fleet will send back only Hosts
field while the separate Host
, Protocol
, Path
is only handled by agent when initializing config to store command line args. This refactor has been done in context of mTLS from policy support PR as an attempt to clean up the policy change handler.
I filed an issue to have a look and possible remote Host, Protocol and Path fields from the agent config at the time but maybe this needs further evaluation/discussion taking into account fleet bootstrap scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we link to that issue and add this explanation to the code? This is going to be really not obvious to someone reading this code without the context of this PR.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
It's a detail for now but we should probably also plan to backport this to 8.17 |
This needs a regression test :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on Linux with and without the fix and it solves the issue with audit/unenroll when running a local Fleet server
I left a small comment for the Host
/Hosts
discrepancy we have trying to give more context, not sure if we want to follow that up with a dedicated issue
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it with an unprivileged agent on my Mac and it worked this time. Thanks for the follow up and additional test coverage!
❯ sudo elastic-development-agent uninstall
Elastic Agent will be uninstalled from your system at /Library/Elastic/Agent-Development. Do you want to continue? [Y/n]:y
[=== ] Done [2s]
Elastic Agent has been uninstalled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@cmacknz @ycombinator, can one of you please force this in once buildkite completes? we have integration test coverage but not unit tests |
buildkite test this |
Quality Gate failedFailed conditions |
Force merging now that CI has passed. |
* Fix audit/unenroll calls when agent runs fleet-server * Fix integration test * Increase output for failed test * move notification attempt * move notification attempt * audit/uninstall will call localhost with certonly tls * Fix linter * remove tls setting * Add PR link to changelog * stop checking if we need to notify in an error is encountered * Shorten non-fatal error messages to prevent output issues in terminals * Don't load agentinfo, use ID value from config instead * Add unprivilged tests to audit unenroll integration tests * Add comments for setting hosts (cherry picked from commit f321d8a) # Conflicts: # internal/pkg/agent/install/uninstall.go
* Fix audit/unenroll calls when agent runs fleet-server * Fix integration test * Increase output for failed test * move notification attempt * move notification attempt * audit/uninstall will call localhost with certonly tls * Fix linter * remove tls setting * Add PR link to changelog * stop checking if we need to notify in an error is encountered * Shorten non-fatal error messages to prevent output issues in terminals * Don't load agentinfo, use ID value from config instead * Add unprivilged tests to audit unenroll integration tests * Add comments for setting hosts (cherry picked from commit f321d8a)
* Fix audit/unenroll calls when agent runs fleet-server * Fix integration test * Increase output for failed test * move notification attempt * move notification attempt * audit/uninstall will call localhost with certonly tls * Fix linter * remove tls setting * Add PR link to changelog * stop checking if we need to notify in an error is encountered * Shorten non-fatal error messages to prevent output issues in terminals * Don't load agentinfo, use ID value from config instead * Add unprivilged tests to audit unenroll integration tests * Add comments for setting hosts (cherry picked from commit f321d8a) # Conflicts: # internal/pkg/agent/install/uninstall.go
…-server (#6365) * Fix audit/unenroll calls when agent runs fleet-server (#6085) * Fix audit/unenroll calls when agent runs fleet-server * Fix integration test * Increase output for failed test * move notification attempt * move notification attempt * audit/uninstall will call localhost with certonly tls * Fix linter * remove tls setting * Add PR link to changelog * stop checking if we need to notify in an error is encountered * Shorten non-fatal error messages to prevent output issues in terminals * Don't load agentinfo, use ID value from config instead * Add unprivilged tests to audit unenroll integration tests * Add comments for setting hosts (cherry picked from commit f321d8a) # Conflicts: # internal/pkg/agent/install/uninstall.go * Fix merge conflict --------- Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com> Co-authored-by: michel-laterman <michel.laterman@elastic.co>
* Fix audit/unenroll calls when agent runs fleet-server * Fix integration test * Increase output for failed test * move notification attempt * move notification attempt * audit/uninstall will call localhost with certonly tls * Fix linter * remove tls setting * Add PR link to changelog * stop checking if we need to notify in an error is encountered * Shorten non-fatal error messages to prevent output issues in terminals * Don't load agentinfo, use ID value from config instead * Add unprivilged tests to audit unenroll integration tests * Add comments for setting hosts (cherry picked from commit f321d8a) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
…-server (#6363) * Fix audit/unenroll calls when agent runs fleet-server (#6085) * Fix audit/unenroll calls when agent runs fleet-server * Fix integration test * Increase output for failed test * move notification attempt * move notification attempt * audit/uninstall will call localhost with certonly tls * Fix linter * remove tls setting * Add PR link to changelog * stop checking if we need to notify in an error is encountered * Shorten non-fatal error messages to prevent output issues in terminals * Don't load agentinfo, use ID value from config instead * Add unprivilged tests to audit unenroll integration tests * Add comments for setting hosts (cherry picked from commit f321d8a) # Conflicts: # internal/pkg/agent/install/uninstall.go * Fix merge conflict --------- Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com> Co-authored-by: michel-laterman <michel.laterman@elastic.co>
What does this PR do?
Call audit/unenroll (during uninstall) before components are stopped if the elastic-agent is running a fleet-server instance.
Why is it important?
Agents running fleet-server instances would always get an error because fleet-server would be stopped by the time the notify call is attempted.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature works./changelog/fragments
using the changelog toolHow to test this PR locally
Enroll an agent in a policy with the fleet-server integration and uninstall the agent
Related issues