Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OTE-1506] Loadbalancer exporter: Add resource_keys routing for the traces #14208

Merged
merged 1 commit into from
Oct 24, 2024

Conversation

foadnh
Copy link

@foadnh foadnh commented Oct 23, 2024

Ticket

Context

Resource-based load balancing is required for Canva reliability metrics.
Currently, we maintain a copy of the load balancer in the otel-platform repo. This copy is based on the old version (v0.79.0) and never updated with the upstream.
There are improvements and new features in the upstream version that potentially can resolve our public gateway migration issues. See this thread.

In this PR, we add resource_keys based load balancing to the opentelemetry-collector-contrib fork.

Notes

  • Why didn't we simply copy the implementation from the old copy? In this PR, we used a different approach to add resource_keys based load balancing. Because it is simpler and more aligned with the rest of the load balancer, it enables us to keep our fork in sync with upstream easily. Also, it allows us to contribute changes upstream and eliminate this fork.
  • Does upstream accept this new feature? Yes, there is an open request for a limited version of this feature. And the last time we tried to contribute this feature to the upstream, the plugin owner liked the feature but needed a more robust implementation.
  • Is this branch ready for a PR to upstream? No, not yet
    • We will follow the Otel collector contribution guideline. This means we will start by raising an issue and describing the feature requirements. Once we get LGTM on that one, then we will raise a PR.
    • To minimise changes on our side, we used our old config namings (routing_key: resource) in this PR, which conflicts with the new versions' features.
    • To keep this PR small, we only added a resource_keys based load balancer to traces. We must add it to metrics and logs before raising it to the upstream.
    • This is based on v0.102.0 and not the upstream master (Preparing for v0.112.0). We need to rebase with the upstream master to prevent conflicts.
    • We skipped README.md and CHANGELOG.md updates.

Copy link

@zjanc zjanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Really neat way to do it, and it seems refactoring has already addressed a few issues we previously had e.g. batching of traces.

Would note here possible large cardinality in metrics

		if err == nil {
			_ = stats.RecordWithTags(
				ctx,
				[]tag.Mutator{tag.Upsert(endpointTagKey, endpoints[exp]), successTrueMutator},
				mBackendLatency.M(duration.Milliseconds()))
		} else {
			_ = stats.RecordWithTags(
				ctx,
				[]tag.Mutator{tag.Upsert(endpointTagKey, endpoints[exp]), successFalseMutator},
				mBackendLatency.M(duration.Milliseconds()))
		}

Since this creates a metric for each endpoint, on each pod, but I think we already do this anyway (so might be worth checking existing cardinality and if it's significant).

@foadnh
Copy link
Author

foadnh commented Oct 23, 2024

LGTM! Really neat way to do it, and it seems refactoring has already addressed a few issues we previously had e.g. batching of traces.

Would note here possible large cardinality in metrics

		if err == nil {
			_ = stats.RecordWithTags(
				ctx,
				[]tag.Mutator{tag.Upsert(endpointTagKey, endpoints[exp]), successTrueMutator},
				mBackendLatency.M(duration.Milliseconds()))
		} else {
			_ = stats.RecordWithTags(
				ctx,
				[]tag.Mutator{tag.Upsert(endpointTagKey, endpoints[exp]), successFalseMutator},
				mBackendLatency.M(duration.Milliseconds()))
		}

Since this creates a metric for each endpoint, on each pod, but I think we already do this anyway (so might be worth checking existing cardinality and if it's significant).

It shouldn't be considered as a high cardinality, I guess, cause Datadog calculates cardinality on moving 1h windows, not absolute cardinality. So, worst case, it will be like (source)x(destination)x(loadbalancers)x(during_deployment_factor)=200x200x3x2 = 240,000?

}
return ids, nil
if !missingResourceKey {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace missingResourceKey and resourceKeyFound this with len(ids) > 0 ? We should return ids map if it is not empty, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an assumption that there should be an identifier per resource in this function (Although in reality, we will have only a single resource per call - something Strange about the upstream implementation that should be addressed, since I want to merge it to upstream, I don't want to address this assumption!)0

@@ -149,15 +155,25 @@ func routingIdentifiersFromTraces(td ptrace.Traces, key routingKey) (map[string]
return nil, errors.New("empty spans")
}

if key == svcRouting {
if e.routingKey == resourceKeysRouting {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@foadnh does this change mean that we no longer have routing by service? It looks like you've switched from service routing to resource key routing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks!

Copy link

@steves-canva steves-canva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now that I understand why it's been implemented the way it has :)

@foadnh foadnh merged commit 75eff33 into main Oct 24, 2024
129 of 169 checks passed
@foadnh foadnh deleted the foadnh-loadbalancer-resource-keys-routing branch October 24, 2024 02:00
foadnh added a commit that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants