Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client side rate limiting #383

Open
rob-whittle opened this issue Feb 14, 2024 · 1 comment
Open

Client side rate limiting #383

rob-whittle opened this issue Feb 14, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@rob-whittle
Copy link

rob-whittle commented Feb 14, 2024

TL;DR

When a large number of pods try to mount secrets concurrently we run into client rate limit issues with error:
unable to obtain workload identity auth: unable to fetch SA info: client rate limiter Wait returned an error: context canceled

I Think the error is actually down to rate limiting on the go client used to query the kube-api server. Error seems to come from the part of the process where the secrets-store-gcp plugin queries the gcp service account annotation on the kubernetes service account used by the pod

return nil, fmt.Errorf("unable to fetch SA info: %w", err)

We have observed this error in two scenarios:

  1. a large number of cron jobs starting at the same time
  2. when upgrading the kubernetes cluster and pods are migrated to the new nodes

Expected behavior
If rate limiting occurs, there should be some retry logic

Observed behavior
The pods fail to start as they cannot mount the secret volume. Restarting a single pod manually it still fails with the same error. Restarting the secrets-store-csi-driver-provider-gcp daemonset resolved the issue. This had to be repeated multiple times until all pods successfully mounted their respective secrets. The issue manifested again the next time the cron jobs started so we staggered the start times. During cluster upgrades we had to keep an eye on all the pods and continually restart the secrets-store-csi-driver-provider-gcp daemonset to work through the issue.

Environment
provider version: v1.2.0
secret store version: v1.3.3

Additional information
We are using Pod Workload Identity
This is also a known issue with the aws provider see aws/secrets-store-csi-driver-provider-aws#136 (comment)

@rob-whittle rob-whittle added the bug Something isn't working label Feb 14, 2024
@tuusberg
Copy link

What was the total number of pods in your scenario? Trying to better understand the definition of "large" :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants