[spike] Investigate DNSrecond events #1108

Boomatang · 2025-01-13T12:29:04Z

DNS Records status is updated frequently due to the "queuedAt" field, let's investigate if there is a way to remove this field without risking excessive communication with the DNS Provider API.

This was discovered during the investigation into #1085 it was notice after the load test had completed there was a number of events triggered by a DNSrecord. There was nothing in the logs to suggest what actions the operator was preforming.

Attached is the pod logs from two different load test runs, the pods were restarted with new resource limits between runs.

kuadrant-operator-controller-manager-8464cd4785-82lfs-manager.log
kuadrant-operator-controller-manager-db6784dfc-ghv9d-manager.log

mikenairn · 2025-01-14T09:08:58Z

With the current implementation of the DNSRecord status update I'd say this is expected.

In order to facilitate multiple records and therefore multiple clusters sharing a hostname, a mechanism by which the consistency of the records in the provider was enforced was added to the DNS record reconcile. This in short comes down to periodically reconciling the DNSRecord every x amount of time (15 minutes i believe), and during the initial creation doing this more frequently, and slowly backing off until we reach the max polling interval (15 mins). It would be expected during this time, especially if multiple records are adding values to the same record set, that multiple writes are happening to the DNSRecord status.

Following is an example of the diff of the same record that has been created for a few hours:

31c31
<         "resourceVersion": "90148",
---
>         "resourceVersion": "121484",
219c219
<         "queuedAt": "2025-01-13T22:34:58Z",
---
>         "queuedAt": "2025-01-14T08:41:03Z",

Diff of a record that has only been created for a shorter while (Notice the validFor is changing):

31c31
<         "resourceVersion": "35504",
---
>         "resourceVersion": "39965",
219c219
<         "queuedAt": "2025-01-13T17:12:40Z",
---
>         "queuedAt": "2025-01-13T17:23:20Z",
338c338
<         "validFor": "10m40s",
---
>         "validFor": "15m0s",

There is an issue with the endpoint list not being ordered and can be updated sometimes unnecessarily, but because we are updating these other values in the status on every reconcile, it's going to cause an update regardless.

Might be possible to patch those fields instead of update, not totally sure if that makes any difference, i would have thought it still needed to update the resourceVersion field in order to avoid conflicts.

The resourceVersion field forever increasing in value is a bit concerning.

You might find more info on it in here , although its probably way of date now tbh.

guicassolato added this to Kuadrant Jan 13, 2025

philbrookes moved this to Todo in Kuadrant Jan 16, 2025

philbrookes changed the title ~~Investigate DNSrecond events~~ [spike] Investigate DNSrecond events Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spike] Investigate DNSrecond events #1108

[spike] Investigate DNSrecond events #1108

Boomatang commented Jan 13, 2025 •

edited by philbrookes

Loading

mikenairn commented Jan 14, 2025

[spike] Investigate DNSrecond events #1108

[spike] Investigate DNSrecond events #1108

Comments

Boomatang commented Jan 13, 2025 • edited by philbrookes Loading

mikenairn commented Jan 14, 2025

Boomatang commented Jan 13, 2025 •

edited by philbrookes

Loading