-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture low volume metrics in telemetry #25713
Comments
These ones are a little weird because they're not gauges like memory usage or CPU. They're meant to represent rates (i.e. throughput). These are a little strange because what we're trying to capture is the rate information in 1m intervals. However, we send a report every hour and we don't want to send all 60 recorded intervals in. The tricky bit is that the hour isn't exact, so a total isn't useful unless you have a start time and end time for the total reporting interval (so you can calculate a rate), or you have the last report with some total and time that you can compare to this report with a total and its own time. So what would be useful here? We'd like to know if there are big spikes or dips in throughput over the course of the reporting interval (1 hour) and we'd like to know the average throughput in seconds over the course of the hour. Since many cases have fewer than 1 query (or write) per second, if we express the rate at that level, we'd want to capture it as a float. Closing this out, maybe we want 3 rates expressed for each hour long reporting interval:
|
this makes sense to me @pauldix, maybe I'm misunderstanding but what you described as min_count_minute and max_count_minute - is that not what the _min_1m and _max_1m metrics are already capturing? this request came after I was exploring the data and noticed fields like QUERY_REQUESTS_AVG_1M were mostly 0's, because it's a rounded whole number of the per/minute values. So, if the user is running a few queries every few minutes, it could easily average out to zero. the total value would get us what we need to report on query/write volume, and then between the min/max/avg we could get at what you are interested in which is looking for spikes/drops over the reporting interval by comparing min/max to total and avg. understood volume is inexact given the non-precise "hour" roll-up, but my assumption was we would want to be looking at total query/write volume as we launch and see it going up and to the right. something that could be lost if we only capture min/max/avg |
@mona-influx yes, the existing _min_1m and _max_1m capture the counts I was talking about. |
@pauldix - I think you'll be able to compare the average throughput spike/dips as long as it's done at least 60 of them in a minute. If it's done anything below it'll come up as 0. If we need to compare low "volume" (<60 reads/writes per min) then we can add the fields as you mentioned. |
Telemetry already exposes
min
,max
andavg
for some of the metrics. Theavg
field can be used to calculate the volume on the fly but for certain metrics (eg query count), there could be just one or two queries in an hour and when calculatingavg
it gets rounded down to 0 for the whole hour. This can either be fixed by using floats (at a given precision) foravg
or by sending the totals calculated for the hour as a separate field (this option is preferred).The text was updated successfully, but these errors were encountered: