-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate options to speed up reconciliation up for large amount of GWs and Listeners and policies #1085
Comments
PR with added sectionName: |
@trepel I have found on issue with the reconcile of the ratelimit policies. I expect it was causing a lot of the slow down that you were seen. The testing I did was done using kind locally and I could see a large slow down with as little as 10 listeners. It would be nice if you could run #1100 through the load test, and maybe grab the logs from the kuadrant operator also. I seen there was a lot of noisy logs which I hope a lot have being removed now. |
@Boomatang great news! Thanks. I will look into this - I don't want to promise today but no later than tomorrow hopefully. |
A way to improve the time is to increase the vCPU requests/limits for Kuadrant operator pod (big thanks to @Boomatang to point at this). Currently both is set to 200m. I tried "2 Gateway 32 Listerners each scenario" with requests/limits set to 1500m and the time AuthP/RLPs waiting for .status decreased from 76s-120s to 13s. The peak vCPU consumption was sligthly over 900m during the scale test run. Q: The fact that Kuadrant operator was throttled was not obvious to me, do we have any alerting in place or plan to have any? Another thing to look at is the scale test itself. It might make sense to split it so that issuing Certificates and issuing dnsrecords in DNS cloud provider are done in separate step. Performance of this is not directly related to Kuadrant. Furthermore there is separate DNSOperator scale test on DNSOperator level under development: Yet another thing is that kube-burner creates resources one by one (in relatively quick succession but not really at once). Given how our reconciliation flow logic works we might get better results if everything is created at once. |
Currently the operator will not install any alerts for the user. There are example alerts that the user can create. I don't think there is any plan for the kuadrant operator to do the installation of the alerts. Over the next while I hope to do a bit of research into where the operator was spending the cpu cycles. As it stands we don't know if that is a large load. I agree that it might be good to split the tests out, but also we do need to be seeing how all the components play together. I think we need both . Again with kube-burner creating resources slowly could be more common as a user adds more services into the system, but on pod restarts and upgrades it would be a bulk reconcile. So it is hard to know which is the one to more effort into. |
Overview
Recently a scale test has been implemented:
https://github.com/Kuadrant/testsuite/tree/main/scale_test
I did run a few scale test runs. It took quite some time for Kuadrant operator to reconcile all the policies. Many AuthPolicies and RLPs did not get any status for quite some time. After they got some status it often was:
'AuthPolicy waiting for the following components to sync: [AuthConfig (0cbc22a687a9ff2a57c54007e8ad9b6bc17de3744144196b9b8286fb1593f495)]'
and
'RateLimitPolicy waiting for the following components to sync: [Limitador]'
Everything got reconciled successfully and policies got enforced eventually, but it took quite some time:
1GW 16 Listeners -> 16s to get status for all policies
1 GW 32 Listeners -> 120s to get status for all policies
1 GW 48 Listeners -> 7 min to get status for all policies
1 GW 63 Listeners -> 30 min to get status for all policies
In the operator log there were a lot of entries complaining about invalid paths. So I made the HTTPRoutes target specific Listener rather than the whole Gateway. This make the results much nicer:
1 GW 32 Listeners -> 18s to get status for all policies
1 GW 48 Listeners -> 60s to get status for all policies
1 GW 63 Listeners -> 120s to get status for all policies
I tried with 2 GWs as well:
2 GW 16 Listeners -> 18s to get status for all policies
2 GW 32 Listeners -> 76s to get status for all policies
However, this was still too much:
2 GW 63 Listeners -> 16 min to get status for all policies
Initial Investigation
Be aware that certificate generation and DNS record creation might affect the results. It takes some for certificates to get created (the scale test uses self signed cluster issuer) and it also takes time for cloud provider to issue that many DNS records.
It seems reasonable that wasm config (only one per GW) is a contention point that said it should eventually get there.
There are repeated log entries of “failed to update the object has been modified; please apply your changes to the latest version” in Kuadrant operator pod.
Also entries like "failed to create SOMETHING, SOMETHING already exists" appeared in Kuadrant operator pod log (SOMETHING being Certificate/AuthConfig typically) - not sure if this indicates some issue or not.
Questions / Investigation required
Does it make sense that having many invalid paths is so expensive?
What can be done to improve on that 16 minutes? 2 Gateways and 63 Listeners on each are not super high numbers.
Steps to reproduce
Basically follow the readme of the scale test:
https://github.com/Kuadrant/testsuite/tree/main/scale_test
I used OCP on AWS (6 worker nodes) and DNS setup on the same AWS account
This was done against Kuadrant v1.0.1, OCP V4.17.7
It is enough details here I believe, for even more detais see (Red Hat only, sorry):
https://docs.google.com/document/d/1ATH2aZJ7-qlYTV3jF_rZduMC1MTPKoWCD-N4LmmcaMA/edit?tab=t.0
The text was updated successfully, but these errors were encountered: