-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for running EDOT inside of running Elastic Agent #5767
Conversation
This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
|
|
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request is now in conflicts. Could you fix it? 🙏
|
2 similar comments
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request is now in conflicts. Could you fix it? 🙏
|
This is ready for a review. Don't be scare too much by the size of this change, most of that comes from NOTICE.txt, the generate control protocol and the rename of To answer the incoming question of why rename |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, the mentioned change makes this PR a bit uglier that it needs to be.
I imagined our first take on hybrid approach pretty much like this, i have no major things to point out besides missing keyword and verifying windows works.
Tested this locally, seems fine. Status reported nicely on component level
@@ -2,8 +2,6 @@ | |||
// or more contributor license agreements. Licensed under the Elastic License 2.0; | |||
// you may not use this file except in compliance with the Elastic License 2.0. | |||
|
|||
//go:build !windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you verified Agent installed as a windows service starts properly without any timing issues? it was a bit difficult to verify last time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linking issue #4976 cc @leehinman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not seeing those issues in this PR. This PR is running on Windows with the integration testing framework and all of those are passing just fine. Those would not be passing if this was not working correctly.
internal/pkg/config/config.go
Outdated
@@ -39,14 +51,20 @@ var DefaultOptions = []interface{}{ | |||
ucfg.VarExp, | |||
VarSkipKeys("inputs"), | |||
ucfg.IgnoreCommas, | |||
OTelKeys("receivers", "processors", "exporters", "extensions", "service"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing connectors
refer to unmarshaller https://github.com/open-telemetry/opentelemetry-collector/blob/8e522ad950de6326a0841d7e1bef808bbc0d3537/otelcol/unmarshaler.go#L18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added! Thanks for the catch.
// Unpack unpacks a struct to Config. | ||
func (c *Config) Unpack(to interface{}, opts ...interface{}) error { | ||
// UnpackTo unpacks this config into to with the given options. | ||
func (c *Config) UnpackTo(to interface{}, opts ...interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as said PR would be o simpler without this one, if we could extract this to refactoring PR we could get reviews for this one faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I probably should have split it out. Being that you already gave a +1 I would prefer to leave it as it. Splitting that out is going to take more work, then just getting this merged.
return p.uri | ||
} | ||
|
||
func (p *Provider) replaceCanceller(replace context.CancelFunc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks wild
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, why is this necessary? If there is no way to get rid of it, please put the explanation in the code as comments.
httpsprovider.NewFactory(), | ||
}, | ||
ConverterFactories: []confmap.ConverterFactory{ | ||
expandconverter.NewFactory(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for dropping, we should pay more attention to upstream changelog
c212a8e
to
e787a19
Compare
@michalpristas For some reason GitHub is not saying your approval counted, could you possibly approve again to see if that lets me have an approval. |
You need more than @michalpristas's approval. |
@pierrehilbert That is fine, but for some reason even @michalpristas is not showing as approved to GitHub. |
I tried this locally and the first thing I noticed is that the format of the JSON output by In isolation in the context of this PR, this isn't a problem. It will be a problem for Beats receivers though, as unless Fleet knows how to parse the new format we are going to lose the input health reporting features. So we either have to:
I think I'd prefer 2 since the point of the Beats receivers project is to make the agent transparently use the OTel collector. Additional this has a better chance of getting the state reporting that is happening directly in Filebeat inputs working, see elastic/beats#39209 for where this was added. I'll for this into a separate tracking issue unless we can resolve it in this PR.
{
"info": {
"id": "8c884936-b6d7-4ff7-af1f-c15da3752384",
"version": "9.0.0",
"commit": "e787a19a6a6cb0f570cdf4f16c8839e20b0feb23",
"build_time": "2024-11-18 19:12:02 +0000 UTC",
"snapshot": true,
"pid": 22023,
"unprivileged": false,
"is_managed": false
},
"state": 2,
"message": "Running",
"components": [],
"FleetState": 6,
"FleetMessage": "Not enrolled into Fleet",
"collector": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483851-05:00",
"components": {
"extensions": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.482935-05:00",
"components": {
"extension:health_check": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.482525-05:00"
},
"extension:pprof": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.482935-05:00"
}
}
},
"pipeline:logs": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483683-05:00",
"components": {
"exporter:otlp": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483204-05:00"
},
"processor:batch": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483252-05:00"
},
"receiver:otlp": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483683-05:00"
}
}
},
"pipeline:metrics": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483851-05:00",
"components": {
"exporter:otlp": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483369-05:00"
},
"processor:batch": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483702-05:00"
},
"receiver:otlp": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483851-05:00"
}
}
},
"pipeline:traces": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483829-05:00",
"components": {
"exporter:otlp": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.4838-05:00"
},
"processor:batch": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483812-05:00"
},
"receiver:otlp": {
"status": 2,
"timestamp": "2024-11-18T14:52:56.483829-05:00"
}
}
}
}
}
} I enrolled an agent running a collector configuration into Fleet, which was not interesting because the collector configuration was immediately replaced with the standard agent configuration from Fleet. |
Looking in the diagnostics I see the otel configuration is separated out, and pre-config.yaml and computed-config.yaml are both mostly empty: cat diag/computed-config.yaml
host:
id: 48DA13D6-B83B-5C71-A4F3-494E674F9F37
path:
config: /Library/Elastic/Agent-Development
data: /Library/Elastic/Agent-Development/data
home: /Library/Elastic/Agent-Development/data/elastic-agent-9.0.0-SNAPSHOT-e787a1
logs: /Library/Elastic/Agent-Development
runtime:
arch: arm64
native_arch: arm64
os: darwin
osinfo:
family: darwin
major: 14
minor: 7
patch: 1
type: macos
version: 14.7.1 What would we expect computed-config to look like if we used variables in the otel configuration? Is that even supported here? If it is, how would we debug it? I also see
|
@@ -280,6 +287,36 @@ func (f *FleetGateway) convertToCheckinComponents(components []runtime.Component | |||
checkinComponents = append(checkinComponents, checkinComponent) | |||
} | |||
|
|||
// OTel status is placed as a component for each top-level component in OTel | |||
// and each subcomponent is a unit. | |||
if otelStatus != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this in the status output as well? For the most part elastic-agent status --output=json
mirrors what we would send to Fleet and right now it doesn't.
I am also worried that the mapping of inputs back to integrations might be broken by this remapping, thus breaking the health reporting UIs in Fleet. We'd probably need to let fleet send otel configurations. Perhaps we can do this with the override API to test this out since they are separate sections of the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this would be broken in Fleet. This is sending it exactly in the same format that Fleet understand today, so this just works transparently without any changes to Fleet. Which is what we discussed and what was implemented here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fleet won't reject it, I am worried about how Fleet maps inputs back to integrations, since Fleet is integration focused and agent is only inputs.
We would need to test an integration implemented with Beats receivers and make sure the status reporting still works the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know there was a link between those two. I was just under the impression that the UI just rendered the content of the components
field that is shipped to Fleet. If it does some referencing between the two (which I don't think it does), then it would be better to just let Fleet understand the component
field.
Being this is experimental that is something that can easily change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks into the meta information in the inputs, possibly it only needs the input ID to be what was in the agent policy for the component and it could figure it out.
I agree this isn't blocking and is something we could check easily with a test.
I don't love the elastic-agent status output=json
format diverging from what gets sent to Fleet, I have used the fact that it is in the checkin format a few times for debugging, but I won't block the PR on it.
// prevent aggregate statuses from flapping between StatusStarting and StatusOK | ||
// as components are started individually by the service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just not do this filtering? Was it just annoying or causing an actual problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It causes the reporting to not really be in sync and looks very strange. This is also following the same pattern of the healthcheckv2extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, can you link to that code directly as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
name string | ||
} | ||
|
||
func (fe *forceExtension) Convert(ctx context.Context, conf *confmap.Conf) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing and why is it necessary? Please add the explanation as a comment in the code directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add explanation in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Can we start some hybrid agent documentation in https://github.com/elastic/elastic-agent/tree/main/docs? Probably just an example configuration is enough to start, and then when the time comes we can give that to our tech writers to produce official documentation. |
It is only different as at the top-level there is a new field called
As you see later in your code review this is not sent to Fleet this way. The Fleet information is not in any way decoded from JSON marshalling.
Nope, this is already done in this PR.
We could do that, but I don't think that is a good idea long term. The reason I made a new
They should 100% only be reporting status using otel. The goal is to one day only run the collector, they should add status reporting through the collector.
I don't see an issue. Adding new fields is not considered breaking. API's add new fields all the time and that is not considered breaking. |
Variables are not supported in otel configuration. That would be extending otel outside of the Elastic Agent which I don't think is a good idea. You will notice in this PR that otel configuration is extracted as early as possible. This is done to ensure that the Elastic Agent in no way affects the otel configuration.
Good catch. I will add that now. |
Yes I can add that information. |
Long term yes, my actual goal is to discover the point where we need to commit engineering time from the UI team to rewriting or updating the input and component health implementation to account for the collector health status. I don't think just Beats receivers is that point, it is probably the point at which there are integrations with OTel native configurations. That said, we could plan UI work to account for the status changes for Beats receivers, but it will block us shipping them. |
Even short term they should be reporting the status through otel, because once something is running under the otel collector that is the only way we can get status information. They will not be connected over the control protocol and they will be in-process the only way to do that is report the status through otel. If you mean more like transparent taking the otel status and translating it to a component status to report to Fleet, then that is something different and could be done. But that still requires that the component running under the collector to be reporting it status through I think this highly depends on if there is really a link between what a component status reports and an integration in Fleet. I was under the impression that under the Fleet UI that only place it is used is on the status page for the Agent and there is not connection between the two. The only connection I know of is with Endpoint, and that won't be changing in the short term. |
Yes the interface to the Fleet and the eventual content of the .fleet-agents datastream there is all I care about. It's our API contract with the UI team. Everything happening inside the agent as far as getting status from the collector LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reapproving, probably force push did the magic
I need an approval from @elastic/ingest-eng-prod for the github lint action change. It is required for lint to pass on Windows as some Windows only module uses a new golang 1.22 features. We are using olger version of golint action that used only go 1.21 for lint, it needs to understand 1.22. |
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing .github/workflows/golangci-lint.yml
LGTM
(cherry picked from commit b07566b) # Conflicts: # NOTICE.txt # go.mod # go.sum
What does this PR do?
Adds the ability to run the EDOT along side the running Elastic Agent.
This connects the EDOT into the coordinator of the Elastic Agent. At any point if any of these top-level keys (
receivers
,processors
,exporters
,extensions
,service
) exist in the configuration or policy for the elastic-agent the EDOT is started. If all of those keys are removed from the configuration or policy then the EDOT is automatically stopped. If any configuration change occurs the updated configuration is passed along to the EDOT to handle.Why is it important?
This allows EDOT configuration to exist inside of the configuration or policy and allow it to work as expected.
Checklist
[ ] I have made corresponding changes to the documentation./changelog/fragments
using the changelog toolDisruptive User Impact
This is an addition and doesn't affect the way the current Elastic Agent runs at all.
How to test this PR locally
Place OTel configuration into the
elastic-agent.yml
:Run
elastic-agent run -e
.Related issues
Closes #5796