Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to retrieve trace spans in bulk for multiple trace ids #4491

Open
nielsreijers opened this issue Dec 24, 2024 · 3 comments
Open

API to retrieve trace spans in bulk for multiple trace ids #4491

nielsreijers opened this issue Dec 24, 2024 · 3 comments

Comments

@nielsreijers
Copy link

Is your feature request related to a problem? Please describe.

We're building a system to analyse failures, where we need to take a samples of sufficient size from failed and successful requests to find the common causes for requests to fail.

As input we have a set of trace ids, which we currently retrieve one by one through the GET /api/traces/<traceID> endpoint.

This takes some time, even with start and end parameters set. In our setup, older traces already moved to the back end storage take anywhere from 400 to 1000ms.

Since all traces are in the same time window, an endpoint that would allow us to get multiple traces at once could probably be much more efficient.

Describe the solution you'd like
An endpoint that accepts a list of trace ids instead of a single trace id and returns all found traces in one response.

Describe alternatives you've considered
We're currently retrieving them in parallel, but this doesn't seem to scale beyond a speedup of around 4-5x.

Additional context

@joe-elliott
Copy link
Member

I'm not opposed to this feature. To be done well we'd need to push the trace ids all the way to the queriers. Given the amount of data potentially retrieved this could put pressure on queriers, ingesters and the frontend as well.

Other thoughts:

  • Are you using the start and end parameters? That should speed up your queries
  • Depending on what you need it's quite possible TraceQL can retrieve this for you directly and would be far more efficient than pulling and analyzing individual traces. if you explain your needs with an example trace i may be able to give you a query that answers your questions directly.
  • It's also possible that writing a custom program to parse the parquet file would be the best answer. See this project for an example go application.

@nielsreijers
Copy link
Author

  • Yes, we're setting start and end. Strangely, it doesn't seem to help that much. I've done numerous tests and anything with a window of size 100,000 second around the event time seems to perform the same (500-800ms per trace). Not setting it, or using a window of 1,000,000 s or larger also performs the same, but a few hundred ms slower per trace. This is for traces older than 15-30 minutes or so. Fresh traces are retrieved in 100-200ms, but again it doesn't matter what the window size is as long as it's <= 100,000s.
  • I was looking at TraceQL but couldn't figure out the way to do it. What we're doing is getting two sets from different time windows and reconstructing the call trees to find significant differences between the two, so we're using pretty much all the output we're getting from the /api/traces endpoint. If there's a TraceQL way to get that same data in bulk that would be great, but then would it be possible to wrap that in a user friendly endpoint?

I'm not sure why it would increase pressure on queriers/ingesters/frontend?

I'm not hindered by any knowledge of Tempo's internals, so my assumption was that instead of doing many calls where the code is looking for a single id, should be pressure in total to do 1 slightly slower call where the code is looking for matches to a set of trace ids.

For our case, we want to get two sets of around 50-100 traces each. If it were faster that would be nice to make the analysis more responsive and we might increase that number to make the averages more stable, but getting more than a few hundred is unlikely to reveal anything new.
But if having this bulk option encourages users to request huge numbers of traces in bulk while that's not the intended usage, I can see how that could be a problem, so maybe there should be a limit to how many ids can be requested at once.

@joe-elliott
Copy link
Member

Yes, we're setting start and end ...

Trace by ID lookup breaks the block guid range up based on the configured number of query shards. So regardless of the start/end the same number of jobs are created. It's possible that increasing this number would show some performance benefits if you're running a larger cluster. Even if trace by id is not returning faster Tempo is doing less work with a start/end passed

I was looking at TraceQL but couldn't figure out the way to do it.

Without details its hard to say but there may be some shenanigans you can do. Let's say you wanted to compare the dependency graphs of traces with a root span name of "foo". This way you could detect if a new service was added or removed in the past week. This query:

{trace:rootName = "foo" } >> { kind = server } | select(nestedSetLeft, nestedSetRight, nestedSetParent)

Will return all server spans (entry points to a service) along with their nested set values. The nested set values can be used to rebuild the tree to reconstruct the call graph. This is the kind of query that the explore traces app is performing to build service graph or error trees.

I'm not sure why it would increase pressure on queriers/ingesters/frontend?

It is quite costly to translate an entire trace (depending on the size) from the parquet representation into proto to return to the client. The query pipeline naturally creates a backpressure that batch querying would sidestep. By asking a querier to simultaneously unmarshal/marshal 100 traces you would likely see elevated memory usage.

I'm not hindered by any knowledge of Tempo's internals, ...

I am quite hindered by knowledge of Tempo's internals and I'd do the same thing :)

I'd be open to a PR that returns traces in bulk, but it would be work that spans the entire query pipeline. If you (or anyone) would like to take this on I could detail where to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants