Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running EDOT inside of running Elastic Agent #5767

Merged
merged 6 commits into from
Nov 20, 2024

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Oct 11, 2024

What does this PR do?

Adds the ability to run the EDOT along side the running Elastic Agent.

This connects the EDOT into the coordinator of the Elastic Agent. At any point if any of these top-level keys (receivers, processors, exporters, extensions, service) exist in the configuration or policy for the elastic-agent the EDOT is started. If all of those keys are removed from the configuration or policy then the EDOT is automatically stopped. If any configuration change occurs the updated configuration is passed along to the EDOT to handle.

Why is it important?

This allows EDOT configuration to exist inside of the configuration or policy and allow it to work as expected.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

This is an addition and doesn't affect the way the current Elastic Agent runs at all.

How to test this PR locally

Place OTel configuration into the elastic-agent.yml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  batch:

exporters:
  otlp:
    endpoint: otelcol:4317

extensions:
  health_check:
  pprof:

service:
  extensions: [health_check, pprof]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Run elastic-agent run -e.

Related issues

Closes #5796

Copy link
Contributor

mergify bot commented Oct 11, 2024

This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 11, 2024

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 11, 2024
Copy link
Contributor

mergify bot commented Oct 15, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b otel-hybrid upstream/otel-hybrid
git merge upstream/main
git push upstream otel-hybrid

Copy link
Contributor

mergify bot commented Oct 15, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b otel-hybrid upstream/otel-hybrid
git merge upstream/main
git push upstream otel-hybrid

2 similar comments
Copy link
Contributor

mergify bot commented Oct 25, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b otel-hybrid upstream/otel-hybrid
git merge upstream/main
git push upstream otel-hybrid

Copy link
Contributor

mergify bot commented Nov 7, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b otel-hybrid upstream/otel-hybrid
git merge upstream/main
git push upstream otel-hybrid

@blakerouse blakerouse marked this pull request as ready for review November 12, 2024 17:52
@blakerouse blakerouse requested review from a team as code owners November 12, 2024 17:52
@blakerouse
Copy link
Contributor Author

This is ready for a review. Don't be scare too much by the size of this change, most of that comes from NOTICE.txt, the generate control protocol and the rename of Unpack to UnpackTo.

To answer the incoming question of why rename Unpack to UnpackTo is that the Unpacker interface used by go-ucfg has the function named Unpack so that collides. This required the rename for the configuration to work correctly.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pierrehilbert pierrehilbert added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Nov 13, 2024
Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, the mentioned change makes this PR a bit uglier that it needs to be.

I imagined our first take on hybrid approach pretty much like this, i have no major things to point out besides missing keyword and verifying windows works.

Tested this locally, seems fine. Status reported nicely on component level

@@ -2,8 +2,6 @@
// or more contributor license agreements. Licensed under the Elastic License 2.0;
// you may not use this file except in compliance with the Elastic License 2.0.

//go:build !windows
Copy link
Contributor

@michalpristas michalpristas Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you verified Agent installed as a windows service starts properly without any timing issues? it was a bit difficult to verify last time

Copy link
Contributor

@michalpristas michalpristas Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linking issue #4976 cc @leehinman

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing those issues in this PR. This PR is running on Windows with the integration testing framework and all of those are passing just fine. Those would not be passing if this was not working correctly.

@@ -39,14 +51,20 @@ var DefaultOptions = []interface{}{
ucfg.VarExp,
VarSkipKeys("inputs"),
ucfg.IgnoreCommas,
OTelKeys("receivers", "processors", "exporters", "extensions", "service"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added! Thanks for the catch.

// Unpack unpacks a struct to Config.
func (c *Config) Unpack(to interface{}, opts ...interface{}) error {
// UnpackTo unpacks this config into to with the given options.
func (c *Config) UnpackTo(to interface{}, opts ...interface{}) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said PR would be o simpler without this one, if we could extract this to refactoring PR we could get reviews for this one faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I probably should have split it out. Being that you already gave a +1 I would prefer to leave it as it. Splitting that out is going to take more work, then just getting this merged.

return p.uri
}

func (p *Provider) replaceCanceller(replace context.CancelFunc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks wild

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, why is this necessary? If there is no way to get rid of it, please put the explanation in the code as comments.

httpsprovider.NewFactory(),
},
ConverterFactories: []confmap.ConverterFactory{
expandconverter.NewFactory(),
Copy link
Contributor

@michalpristas michalpristas Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for dropping, we should pay more attention to upstream changelog

@blakerouse
Copy link
Contributor Author

@michalpristas For some reason GitHub is not saying your approval counted, could you possibly approve again to see if that lets me have an approval.

@pierrehilbert
Copy link
Contributor

You need more than @michalpristas's approval.
@MichaelKatsoulis / @gizas we will need your approval here please.

@blakerouse
Copy link
Contributor Author

blakerouse commented Nov 18, 2024

@pierrehilbert That is fine, but for some reason even @michalpristas is not showing as approved to GitHub.

Screenshot 2024-11-18 at 12 47 33 PM

@cmacknz
Copy link
Member

cmacknz commented Nov 18, 2024

I tried this locally and the first thing I noticed is that the format of the JSON output by elastic-agent status is different.

In isolation in the context of this PR, this isn't a problem. It will be a problem for Beats receivers though, as unless Fleet knows how to parse the new format we are going to lose the input health reporting features.

So we either have to:

  1. Have Fleet understand collector status reporting and have it integrate with the UI similar to what we have today.
  2. Have the agent translate the collector health into the existing components based health reporting structure.

I think I'd prefer 2 since the point of the Beats receivers project is to make the agent transparently use the OTel collector. Additional this has a better chance of getting the state reporting that is happening directly in Filebeat inputs working, see elastic/beats#39209 for where this was added.

I'll for this into a separate tracking issue unless we can resolve it in this PR.

sudo elastic-development-agent status --output=json

{
    "info": {
        "id": "8c884936-b6d7-4ff7-af1f-c15da3752384",
        "version": "9.0.0",
        "commit": "e787a19a6a6cb0f570cdf4f16c8839e20b0feb23",
        "build_time": "2024-11-18 19:12:02 +0000 UTC",
        "snapshot": true,
        "pid": 22023,
        "unprivileged": false,
        "is_managed": false
    },
    "state": 2,
    "message": "Running",
    "components": [],
    "FleetState": 6,
    "FleetMessage": "Not enrolled into Fleet",
    "collector": {
        "status": 2,
        "timestamp": "2024-11-18T14:52:56.483851-05:00",
        "components": {
            "extensions": {
                "status": 2,
                "timestamp": "2024-11-18T14:52:56.482935-05:00",
                "components": {
                    "extension:health_check": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.482525-05:00"
                    },
                    "extension:pprof": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.482935-05:00"
                    }
                }
            },
            "pipeline:logs": {
                "status": 2,
                "timestamp": "2024-11-18T14:52:56.483683-05:00",
                "components": {
                    "exporter:otlp": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483204-05:00"
                    },
                    "processor:batch": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483252-05:00"
                    },
                    "receiver:otlp": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483683-05:00"
                    }
                }
            },
            "pipeline:metrics": {
                "status": 2,
                "timestamp": "2024-11-18T14:52:56.483851-05:00",
                "components": {
                    "exporter:otlp": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483369-05:00"
                    },
                    "processor:batch": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483702-05:00"
                    },
                    "receiver:otlp": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483851-05:00"
                    }
                }
            },
            "pipeline:traces": {
                "status": 2,
                "timestamp": "2024-11-18T14:52:56.483829-05:00",
                "components": {
                    "exporter:otlp": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.4838-05:00"
                    },
                    "processor:batch": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483812-05:00"
                    },
                    "receiver:otlp": {
                        "status": 2,
                        "timestamp": "2024-11-18T14:52:56.483829-05:00"
                    }
                }
            }
        }
    }
}

I enrolled an agent running a collector configuration into Fleet, which was not interesting because the collector configuration was immediately replaced with the standard agent configuration from Fleet.

@cmacknz
Copy link
Member

cmacknz commented Nov 18, 2024

Looking in the diagnostics I see the otel configuration is separated out, and pre-config.yaml and computed-config.yaml are both mostly empty:

cat diag/computed-config.yaml
host:
    id: 48DA13D6-B83B-5C71-A4F3-494E674F9F37
path:
    config: /Library/Elastic/Agent-Development
    data: /Library/Elastic/Agent-Development/data
    home: /Library/Elastic/Agent-Development/data/elastic-agent-9.0.0-SNAPSHOT-e787a1
    logs: /Library/Elastic/Agent-Development
runtime:
    arch: arm64
    native_arch: arm64
    os: darwin
    osinfo:
        family: darwin
        major: 14
        minor: 7
        patch: 1
        type: macos
        version: 14.7.1

What would we expect computed-config to look like if we used variables in the otel configuration? Is that even supported here? If it is, how would we debug it?

I also see state.yaml is missing all of the collector status. This we need to address somehow for sure, the collector runtime state needs to be in diagnostics:

cat diag/state.yaml
components: []
fleet_message: Not enrolled into Fleet
fleet_state: 6
log_level: info
message: Running
state: 2

@@ -280,6 +287,36 @@ func (f *FleetGateway) convertToCheckinComponents(components []runtime.Component
checkinComponents = append(checkinComponents, checkinComponent)
}

// OTel status is placed as a component for each top-level component in OTel
// and each subcomponent is a unit.
if otelStatus != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in the status output as well? For the most part elastic-agent status --output=json mirrors what we would send to Fleet and right now it doesn't.

I am also worried that the mapping of inputs back to integrations might be broken by this remapping, thus breaking the health reporting UIs in Fleet. We'd probably need to let fleet send otel configurations. Perhaps we can do this with the override API to test this out since they are separate sections of the UI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this would be broken in Fleet. This is sending it exactly in the same format that Fleet understand today, so this just works transparently without any changes to Fleet. Which is what we discussed and what was implemented here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fleet won't reject it, I am worried about how Fleet maps inputs back to integrations, since Fleet is integration focused and agent is only inputs.

We would need to test an integration implemented with Beats receivers and make sure the status reporting still works the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know there was a link between those two. I was just under the impression that the UI just rendered the content of the components field that is shipped to Fleet. If it does some referencing between the two (which I don't think it does), then it would be better to just let Fleet understand the component field.

Being this is experimental that is something that can easily change in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks into the meta information in the inputs, possibly it only needs the input ID to be what was in the agent policy for the component and it could figure it out.

I agree this isn't blocking and is something we could check easily with a test.

I don't love the elastic-agent status output=json format diverging from what gets sent to Fleet, I have used the fact that it is in the checkin format a few times for debugging, but I won't block the PR on it.

Comment on lines +130 to +131
// prevent aggregate statuses from flapping between StatusStarting and StatusOK
// as components are started individually by the service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just not do this filtering? Was it just annoying or causing an actual problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It causes the reporting to not really be in sync and looks very strange. This is also following the same pattern of the healthcheckv2extension.

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckv2extension/extension.go#L168

Copy link
Member

@cmacknz cmacknz Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, can you link to that code directly as a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

name string
}

func (fe *forceExtension) Convert(ctx context.Context, conf *confmap.Conf) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing and why is it necessary? Please add the explanation as a comment in the code directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add explanation in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cmacknz
Copy link
Member

cmacknz commented Nov 18, 2024

Can we start some hybrid agent documentation in https://github.com/elastic/elastic-agent/tree/main/docs? Probably just an example configuration is enough to start, and then when the time comes we can give that to our tech writers to produce official documentation.

@blakerouse
Copy link
Contributor Author

I tried this locally and the first thing I noticed is that the format of the JSON output by elastic-agent status is different.

It is only different as at the top-level there is a new field called collector that provides all the status information about the collector.

In isolation in the context of this PR, this isn't a problem. It will be a problem for Beats receivers though, as unless Fleet knows how to parse the new format we are going to lose the input health reporting features.

As you see later in your code review this is not sent to Fleet this way. The Fleet information is not in any way decoded from JSON marshalling.

So we either have to:

  1. Have Fleet understand collector status reporting and have it integrate with the UI similar to what we have today.

Nope, this is already done in this PR.

  1. Have the agent translate the collector health into the existing components based health reporting structure.

We could do that, but I don't think that is a good idea long term. The reason I made a new collector top-level key with the OTel status information, is that in the long run we will drop the other format and switch to the new format.

I think I'd prefer 2 since the point of the Beats receivers project is to make the agent transparently use the OTel collector. Additional this has a better chance of getting the state reporting that is happening directly in Filebeat inputs working, see elastic/beats#39209 for where this was added.

They should 100% only be reporting status using otel. The goal is to one day only run the collector, they should add status reporting through the collector.

I'll for this into a separate tracking issue unless we can resolve it in this PR.

sudo elastic-development-agent status --output=json

I don't see an issue. Adding new fields is not considered breaking. API's add new fields all the time and that is not considered breaking.

@blakerouse
Copy link
Contributor Author

Looking in the diagnostics I see the otel configuration is separated out, and pre-config.yaml and computed-config.yaml are both mostly empty:

cat diag/computed-config.yaml
host:
    id: 48DA13D6-B83B-5C71-A4F3-494E674F9F37
path:
    config: /Library/Elastic/Agent-Development
    data: /Library/Elastic/Agent-Development/data
    home: /Library/Elastic/Agent-Development/data/elastic-agent-9.0.0-SNAPSHOT-e787a1
    logs: /Library/Elastic/Agent-Development
runtime:
    arch: arm64
    native_arch: arm64
    os: darwin
    osinfo:
        family: darwin
        major: 14
        minor: 7
        patch: 1
        type: macos
        version: 14.7.1

What would we expect computed-config to look like if we used variables in the otel configuration? Is that even supported here? If it is, how would we debug it?

Variables are not supported in otel configuration. That would be extending otel outside of the Elastic Agent which I don't think is a good idea. You will notice in this PR that otel configuration is extracted as early as possible. This is done to ensure that the Elastic Agent in no way affects the otel configuration.

I also see state.yaml is missing all of the collector status. This we need to address somehow for sure, the collector runtime state needs to be in diagnostics:

cat diag/state.yaml
components: []
fleet_message: Not enrolled into Fleet
fleet_state: 6
log_level: info
message: Running
state: 2

Good catch. I will add that now.

@blakerouse
Copy link
Contributor Author

Can we start some hybrid agent documentation in https://github.com/elastic/elastic-agent/tree/main/docs? Probably just an example configuration is enough to start, and then when the time comes we can give that to our tech writers to produce official documentation.

Yes I can add that information.

@cmacknz
Copy link
Member

cmacknz commented Nov 18, 2024

They should 100% only be reporting status using otel. The goal is to one day only run the collector, they should add status reporting through the collector.

Long term yes, my actual goal is to discover the point where we need to commit engineering time from the UI team to rewriting or updating the input and component health implementation to account for the collector health status.

I don't think just Beats receivers is that point, it is probably the point at which there are integrations with OTel native configurations. That said, we could plan UI work to account for the status changes for Beats receivers, but it will block us shipping them.

@blakerouse
Copy link
Contributor Author

They should 100% only be reporting status using otel. The goal is to one day only run the collector, they should add status reporting through the collector.

Long term yes, my actual goal is to discover the point where we need to commit engineering time from the UI team to rewriting or updating the input and component health implementation to account for the collector health status.

I don't think just Beats receivers is that point, it is probably the point at which there are integrations with OTel native configurations. That said, we could plan UI work to account for the status changes for Beats receivers, but it will block us shipping them.

Even short term they should be reporting the status through otel, because once something is running under the otel collector that is the only way we can get status information. They will not be connected over the control protocol and they will be in-process the only way to do that is report the status through otel.

If you mean more like transparent taking the otel status and translating it to a component status to report to Fleet, then that is something different and could be done. But that still requires that the component running under the collector to be reporting it status through componentstatus.

I think this highly depends on if there is really a link between what a component status reports and an integration in Fleet. I was under the impression that under the Fleet UI that only place it is used is on the status page for the Agent and there is not connection between the two. The only connection I know of is with Endpoint, and that won't be changing in the short term.

@cmacknz
Copy link
Member

cmacknz commented Nov 18, 2024

If you mean more like transparent taking the otel status and translating it to a component status to report to Fleet, then that is something different and could be done.

Yes the interface to the Fleet and the eventual content of the .fleet-agents datastream there is all I care about. It's our API contract with the UI team.

Everything happening inside the agent as far as getting status from the collector LGTM.

Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reapproving, probably force push did the magic

@blakerouse
Copy link
Contributor Author

I need an approval from @elastic/ingest-eng-prod for the github lint action change.

It is required for lint to pass on Windows as some Windows only module uses a new golang 1.22 features. We are using olger version of golint action that used only go 1.21 for lint, it needs to understand 1.22.

Copy link
Contributor

@alexsapran alexsapran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing .github/workflows/golangci-lint.yml LGTM

@blakerouse blakerouse merged commit b07566b into elastic:main Nov 20, 2024
14 checks passed
mergify bot pushed a commit that referenced this pull request Nov 20, 2024
(cherry picked from commit b07566b)

# Conflicts:
#	NOTICE.txt
#	go.mod
#	go.sum
blakerouse added a commit that referenced this pull request Nov 27, 2024
…Elastic Agent (#6096)

* Add support for running EDOT inside of running Elastic Agent (#5767)

(cherry picked from commit b07566b)

---------

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EDOT Collector] Allow Elastic Agent to run in hybrid mode
7 participants