Client side rate limiting #383

rob-whittle · 2024-02-14T14:44:27Z

TL;DR

When a large number of pods try to mount secrets concurrently we run into client rate limit issues with error:
unable to obtain workload identity auth: unable to fetch SA info: client rate limiter Wait returned an error: context canceled

I Think the error is actually down to rate limiting on the go client used to query the kube-api server. Error seems to come from the part of the process where the secrets-store-gcp plugin queries the gcp service account annotation on the kubernetes service account used by the pod

secrets-store-csi-driver-provider-gcp/auth/auth.go

Line 136 in 3ba36fc

return nil, fmt.Errorf("unable to fetch SA info: %w", err)

We have observed this error in two scenarios:

a large number of cron jobs starting at the same time
when upgrading the kubernetes cluster and pods are migrated to the new nodes

Expected behavior
If rate limiting occurs, there should be some retry logic

Observed behavior
The pods fail to start as they cannot mount the secret volume. Restarting a single pod manually it still fails with the same error. Restarting the secrets-store-csi-driver-provider-gcp daemonset resolved the issue. This had to be repeated multiple times until all pods successfully mounted their respective secrets. The issue manifested again the next time the cron jobs started so we staggered the start times. During cluster upgrades we had to keep an eye on all the pods and continually restart the secrets-store-csi-driver-provider-gcp daemonset to work through the issue.

Environment
provider version: v1.2.0
secret store version: v1.3.3

Additional information
We are using Pod Workload Identity
This is also a known issue with the aws provider see aws/secrets-store-csi-driver-provider-aws#136 (comment)

The text was updated successfully, but these errors were encountered:

tuusberg · 2024-08-27T20:18:33Z

What was the total number of pods in your scenario? Trying to better understand the definition of "large" :)

rob-whittle added the bug Something isn't working label Feb 14, 2024

abheda-crest mentioned this issue Nov 8, 2024

Updated QPS and Burst capacity for k8s client #482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client side rate limiting #383

Client side rate limiting #383

rob-whittle commented Feb 14, 2024 •

edited

Loading

tuusberg commented Aug 27, 2024

Client side rate limiting #383

Client side rate limiting #383

Comments

rob-whittle commented Feb 14, 2024 • edited Loading

TL;DR

tuusberg commented Aug 27, 2024

rob-whittle commented Feb 14, 2024 •

edited

Loading