Spanmetrics: Root Span duration ~ DurationNanos #2310

jdfresser · 2023-04-07T15:57:29Z

jdfresser
Apr 7, 2023

Hello,

Do you plan to implement metrics on 'Root Span' Duration? We already use spanmetrics but would be enthusiast to see the whole duration, including inter-spans delay (async delays for example).

A bit more context on our use case might be useful:
We are tracing a Stream Pipeline, with different steps. We use an async message broker (RabbitMQ).
We'd like to know how much time it takes to comply all the steps, as it can vary a lot (Faster/Slower processing, accumulation of message in RabbitMQ Queues that lead to delays....). Better if we can have fresh data < ~5mn, so we can detect delay and react quickly.

An example of trace that we are using:

1/ Trace with long duration processing:

2/ Trace with long duration and message broker delays:

I haven't found similar discussion about this topic, here are some questions I was wondering:

Maybe this feature already exists but I've missed it,
I've seen that you do store this info in Parquet (DurationNanos)
The 'additionnal dimensions' would be very interesting in this metric, as in spanmetrics.
I'm now wondering if the future improvements of GraphQL could answer this topic, maybe by generating metrics, like:
avg_over_time({ .env = "production" && span.http.status_code >= 200})

PS: We're really happy with the traceQL feature, it answers many questions we were having, kudos for the Product Evolution!

joe-elliott · 2023-04-10T18:17:14Z

joe-elliott
Apr 10, 2023
Maintainer

There's a few ideas/questions in here. Will try to address them all:

Do you plan to implement metrics on 'Root Span' Duration? We already use spanmetrics but would be enthusiast to see the whole duration, including inter-spans delay (async delays for example).

We have discussed this and I'm 100% on board. It would be simple to add a label to a spanmetric if the span's parent = nil. Will ping @zalegrala and @ie-pham who are working in this area. Also @kovrus who is looking at otel spanmetrics. I will however note that in your case this will likely not help since your root spans don't encompass the processing of the entire pipeline.

I've seen that you do store this info in Parquet (DurationNanos)

We do intend to add support for adding the trace scope to #1989 which would allow you to search for traces whose duration exceeded specific thresholds, but this wouldn't allow for aggregate metrics (yet).

** Long Processing **
This could be handled using existing span metrics. You can use exemplars to jump directly to traces of long running steps. However, this doesn't give a good holistic view of the entire pipeline. It can just alert you when there's a spike in processing time.

** Message Broker Delays **
This does not exist now, but service graph metrics come close to it. Currently we metric the duration of the client and the duration of the server span. We ALSO support producer/consumer relationships which is likely how your async queues are modeled. It wouldn't be too difficult to add a new histogram metric that indicated the "missing time" between the start of the client/producer span and the start of the server/consumer span.

** Holistic end to end metrics **
This would be more difficult for Tempo to metric. We have some upcoming live "metrics from traces" features that will move toward this, but it's likely a while away.

Let me know if you think any of this will help.

1 reply

jdfresser Jun 30, 2023
Author

Hello Joe, thanks for your detailed reply (and sorry for the delay).

** Long Processing **
You can use exemplars to jump directly to traces of long running steps
Yes, we've seen this feature, did not take time to try it yet because we have few use cases where it could be interesting.

** Message Broker Delays **
It wouldn't be too difficult to add a new histogram metric that indicated the "missing time" between the start of the client/producer span and the start of the server/consumer span.
This would be interesting. We managed to get this info by requesting the 'rabbitmq_queue_head_message_timestamp' (The timestamp property of the first message in the queue, if present) which gives us an approximated information on time missing between producers & consumers.

** Holistic end to end metrics **
We have some upcoming live "metrics from traces" features that will move toward this, but it's likely a while away.
We're eager to see this feature (even if it takes time!), TraceQL was definitely a great improvement

woma1040 · 2023-11-28T15:00:42Z

woma1040
Nov 28, 2023

Hi,
are there any updates regarding the Message Broker Delays ?

0 replies

joe-elliott · 2023-11-28T17:54:32Z

joe-elliott
Nov 28, 2023
Maintainer

There is currently no way to get this info from Tempo. The metrics generator could be improved to watch for these kinds of parent/child combinations and produce histograms, but this would require holding the parent information for quite awhile (hours? days?) depending on the queue.

Another option would be to record the delay in the child span as a custom attribute. This will allow direct searching and soon metrics via TraceQL.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanmetrics: Root Span duration ~ DurationNanos #2310

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Spanmetrics: Root Span duration ~ DurationNanos #2310

jdfresser Apr 7, 2023

Replies: 3 comments · 1 reply

joe-elliott Apr 10, 2023 Maintainer

jdfresser Jun 30, 2023 Author

woma1040 Nov 28, 2023

joe-elliott Nov 28, 2023 Maintainer

jdfresser
Apr 7, 2023

Replies: 3 comments 1 reply

joe-elliott
Apr 10, 2023
Maintainer

jdfresser Jun 30, 2023
Author

woma1040
Nov 28, 2023

joe-elliott
Nov 28, 2023
Maintainer