You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Despite setting low rate limits and use max_traces_per_user, our ingesters get OOM killed when trace size grows above 100KiB.
To Reproduce
Steps to reproduce the behavior:
Start Tempo (2.5.0) in a k8s cluster with 3 Ingesters at (10 GiB) memory limits each.
Start xk6-client-tracing with average trace size set to 100KiB. (see param.js below)
Run load test with ~3 to 5k active live traces.
Expected behavior
Ingester's get OOM killed and restart.
Environment:
Infrastructure: [Kubernetes]
Deployment tool: [tanka]
Tempo version: 2.5.0
No of Distributor to Ingester: 3 :: 3
Additional Context
We do not face this problem when average trace size (p95) is below 50 KiB; Whenever average trace size exceeds ~90 KiB, we cannot prevent OOM kills despite setting a low burst_size_bytes, rate_limit_bytes and max_traces_per_user
OOM Kills
burst_size_bytes
rate_limit_bytes
Average Trace Size (Bytes)
Live Traces (30k)
Distributor bytes limit (burst + rate)
Distributor (N) x Ingester (N)
Ingester Memory (Max)
Rate Limit Strategy
Time Under Test
Average Trace Size * Live Traces (MiB)
0
17 MiB
14 MiB
57000
15000
29MiB
3 x 3
80%
Global
25m
815.3915405
0
17 MiB
14 MiB
48000
18000
29 MiB
3 x 3
70%
Global
25m
823.9746094
0
17 MiB
14 MiB
38000
25000
28 MiB
3 x 3
60%
Global
25m
905.9906006
1
17 MiB
14 MiB
187000
2000
18 MiB
3 x 3
N/A
Global
< 10m
356.6741943
1
17 MiB
14 MiB
219000
1200
18.9 MiB
3 x 3
N/A
Global
< 10m
250.6256104
param.js
import { sleep } from 'k6';
import tracing from 'k6/x/tracing';
export const options = {
vus: 120,
stages: [
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
{ duration: '10s', target: 120 },
{ duration: '2m', target: 120 },
]
};
const endpoint = __ENV.ENDPOINT || "https://<>:443"
const client = new tracing.Client({
endpoint,
exporter: tracing.EXPORTER_OTLP,
tls: {
insecure: true,
}
});
export default function () {
let pushSizeTraces = 50;
let pushSizeSpans = 0;
let t = [];
for (let i = 0; i < pushSizeTraces; i++) {
let c = 100
pushSizeSpans += c;
t.push({
random_service_name: false,
spans: {
count: c,
size: 900,
random_name: true,
fixed_attrs: {
"test": "test",
},
}
});
}
let gen = new tracing.ParameterizedGenerator(t)
let traces = gen.traces()
sleep(5)
console.log(traces);
client.push(traces);
}
export function teardown() {
client.shutdown();
}
The text was updated successfully, but these errors were encountered:
There are two things that drive memory usage in Tempo ingesters, compactors and (depending on the query) queriers:
Trace size
Dictionary sizes in parquet
I'm not surprised you're seeing elevated memory usage as you are bringing up the trace size, but I am very surprised you are seeing such elevated usage at just ~100-200KB. We run cells with tenants who push traces that are 50MBs+.
Some things to test:
This is likely creating a very large dictionary which is probably part of the memory issue. Let's try removing it.
random_name: true
Tempo 2.7 will have some nice ingester memory improvements and will also contain the metric tempo_ingester_live_trace_bytes which will help you see per tenant who is consuming live trace memory.
Another issue that we are looking now is that an ingester that is cpu starved will experience lock contention and go heap will balloon. This is harder to prove out, but it should be in back of mind while we are diagnosing this. A memory profile would be helpful to seeing if this is the issue. Honestly, a memory profile would be great all around and would help me very quickly diagnose the issue if you could provide one.
This metric will show us what Tempo thinks roughly the bytes per traces and would be useful to confirm what we believe the test is creating. We can show this metric per pod or per tenant to see if there's anything interesting.
Describe the bug
Despite setting low rate limits and use max_traces_per_user, our ingesters get OOM killed when trace size grows above 100KiB.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Ingester's get OOM killed and restart.
Environment:
Additional Context
We do not face this problem when average trace size (p95) is below 50 KiB; Whenever average trace size exceeds ~90 KiB, we cannot prevent OOM kills despite setting a low burst_size_bytes, rate_limit_bytes and max_traces_per_user
param.js
The text was updated successfully, but these errors were encountered: