Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture low volume metrics in telemetry #25713

Open
praveen-influx opened this issue Dec 27, 2024 · 4 comments
Open

Capture low volume metrics in telemetry #25713

praveen-influx opened this issue Dec 27, 2024 · 4 comments
Labels

Comments

@praveen-influx
Copy link
Contributor

praveen-influx commented Dec 27, 2024

Telemetry already exposes min, max and avg for some of the metrics. The avg field can be used to calculate the volume on the fly but for certain metrics (eg query count), there could be just one or two queries in an hour and when calculating avg it gets rounded down to 0 for the whole hour. This can either be fixed by using floats (at a given precision) for avg or by sending the totals calculated for the hour as a separate field (this option is preferred).

@pauldix
Copy link
Member

pauldix commented Dec 27, 2024

These ones are a little weird because they're not gauges like memory usage or CPU. They're meant to represent rates (i.e. throughput).

These are a little strange because what we're trying to capture is the rate information in 1m intervals. However, we send a report every hour and we don't want to send all 60 recorded intervals in. The tricky bit is that the hour isn't exact, so a total isn't useful unless you have a start time and end time for the total reporting interval (so you can calculate a rate), or you have the last report with some total and time that you can compare to this report with a total and its own time.

So what would be useful here? We'd like to know if there are big spikes or dips in throughput over the course of the reporting interval (1 hour) and we'd like to know the average throughput in seconds over the course of the hour.

Since many cases have fewer than 1 query (or write) per second, if we express the rate at that level, we'd want to capture it as a float.

Closing this out, maybe we want 3 rates expressed for each hour long reporting interval:

  • min_count_minute (the count of queries/writes of the minute over the reporting interval with the fewest)
  • max_count_minute (the count of queries/writes of the minute over the reporting interval with the most)
  • total (the total number over the reporting interval)
  • rate_seconds (the rate over the reporting interval, should be a float. If the interval is exactly 1 hour, the total / 3600 will be this number)

@mona-influx
Copy link

mona-influx commented Dec 27, 2024

this makes sense to me @pauldix, maybe I'm misunderstanding but what you described as min_count_minute and max_count_minute - is that not what the _min_1m and _max_1m metrics are already capturing? this request came after I was exploring the data and noticed fields like QUERY_REQUESTS_AVG_1M were mostly 0's, because it's a rounded whole number of the per/minute values. So, if the user is running a few queries every few minutes, it could easily average out to zero.

the total value would get us what we need to report on query/write volume, and then between the min/max/avg we could get at what you are interested in which is looking for spikes/drops over the reporting interval by comparing min/max to total and avg. understood volume is inexact given the non-precise "hour" roll-up, but my assumption was we would want to be looking at total query/write volume as we launch and see it going up and to the right. something that could be lost if we only capture min/max/avg

@pauldix
Copy link
Member

pauldix commented Dec 27, 2024

@mona-influx yes, the existing _min_1m and _max_1m capture the counts I was talking about.

@praveen-influx
Copy link
Contributor Author

@pauldix - I think you'll be able to compare the average throughput spike/dips as long as it's done at least 60 of them in a minute. If it's done anything below it'll come up as 0. If we need to compare low "volume" (<60 reads/writes per min) then we can add the fields as you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants