You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a large number of pods try to mount secrets concurrently we run into client rate limit issues with error: unable to obtain workload identity auth: unable to fetch SA info: client rate limiter Wait returned an error: context canceled
I Think the error is actually down to rate limiting on the go client used to query the kube-api server. Error seems to come from the part of the process where the secrets-store-gcp plugin queries the gcp service account annotation on the kubernetes service account used by the pod
returnnil, fmt.Errorf("unable to fetch SA info: %w", err)
We have observed this error in two scenarios:
a large number of cron jobs starting at the same time
when upgrading the kubernetes cluster and pods are migrated to the new nodes
Expected behavior
If rate limiting occurs, there should be some retry logic
Observed behavior
The pods fail to start as they cannot mount the secret volume. Restarting a single pod manually it still fails with the same error. Restarting the secrets-store-csi-driver-provider-gcp daemonset resolved the issue. This had to be repeated multiple times until all pods successfully mounted their respective secrets. The issue manifested again the next time the cron jobs started so we staggered the start times. During cluster upgrades we had to keep an eye on all the pods and continually restart the secrets-store-csi-driver-provider-gcp daemonset to work through the issue.
Environment
provider version: v1.2.0
secret store version: v1.3.3
TL;DR
When a large number of pods try to mount secrets concurrently we run into client rate limit issues with error:
unable to obtain workload identity auth: unable to fetch SA info: client rate limiter Wait returned an error: context canceled
I Think the error is actually down to rate limiting on the go client used to query the kube-api server. Error seems to come from the part of the process where the secrets-store-gcp plugin queries the gcp service account annotation on the kubernetes service account used by the pod
secrets-store-csi-driver-provider-gcp/auth/auth.go
Line 136 in 3ba36fc
We have observed this error in two scenarios:
Expected behavior
If rate limiting occurs, there should be some retry logic
Observed behavior
The pods fail to start as they cannot mount the secret volume. Restarting a single pod manually it still fails with the same error. Restarting the secrets-store-csi-driver-provider-gcp daemonset resolved the issue. This had to be repeated multiple times until all pods successfully mounted their respective secrets. The issue manifested again the next time the cron jobs started so we staggered the start times. During cluster upgrades we had to keep an eye on all the pods and continually restart the secrets-store-csi-driver-provider-gcp daemonset to work through the issue.
Environment
provider version: v1.2.0
secret store version: v1.3.3
Additional information
We are using Pod Workload Identity
This is also a known issue with the aws provider see aws/secrets-store-csi-driver-provider-aws#136 (comment)
The text was updated successfully, but these errors were encountered: