Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ types tightening #384

Merged
merged 7 commits into from
Oct 2, 2024

Conversation

grokspawn
Copy link
Contributor

Solves #381

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 6, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 14, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2024
Copy link

codecov bot commented Sep 26, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Project coverage is 35.28%. Comparing base (db8617f) to head (7f0da5b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
api/core/v1alpha1/zz_generated.deepcopy.go 0.00% 1 Missing ⚠️
...rnal/controllers/core/clustercatalog_controller.go 96.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #384      +/-   ##
==========================================
- Coverage   35.65%   35.28%   -0.37%     
==========================================
  Files          14       14              
  Lines         805      802       -3     
==========================================
- Hits          287      283       -4     
  Misses        472      472              
- Partials       46       47       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -164,6 +164,9 @@ type CatalogSource struct {
}

// ResolvedCatalogSource is a discriminated union of resolution information for a Catalog.
// ResolvedCatalogSource contains the information about a sourced Catalog
// +union
// +kubebuilder:validation:XValidation:rule="self.type != 'Image' || has(self.image)",message="sourceType 'Image' requires image field"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want something more like:

Suggested change
// +kubebuilder:validation:XValidation:rule="self.type != 'Image' || has(self.image)",message="sourceType 'Image' requires image field"
// +kubebuilder:validation:XValidation:rule="self.type == 'Image' && has(self.image)",message="sourceType 'Image' requires image field"

This ensures that we return "True" for the validation only if the type is Image and the image field is set.

IIUC, the current validation of self.type != 'Image' || has(self.image) returns "True" in scenarios that would be invalid:

  • type is set to Foo (not currently possible based on enums, but for sake of ensuring we have a solid CEL validation assume we can) and the image field is set
  • type is set to Image and the image field is unset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into some issues when using self.type == 'Image' && has(self.image), mostly when I tried validating with an unsupported image type and when I tried adding another supported sourceType to test extensibility.

the +kubebuilder:validation:XValidation tag requires all its validation checks to be true, so we were failing on the self.type=='Image' part of the check even when the source type was something else. I added the self.type != 'Image' so we wouldn't be doing that check on non-image sources.

I've added another validation to ensure we don't have the image field on non-image sources to cover the case you highlighted, thanks for pointing that out! We'll still fail on non-supported sources because of the enum check, but this won't throw the type 'Image' requires image field error when the source type isn't 'Image'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into some issues when using self.type == 'Image' && has(self.image), mostly when I tried validating with an unsupported image type and when I tried adding another supported sourceType to test extensibility

I think this is something that I would expect based on our allowed enum values.

the +kubebuilder:validation:XValidation tag requires all its validation checks to be true, so we were failing on the self.type=='Image' part of the check even when the source type was something else. I added the self.type != 'Image' so we wouldn't be doing that check on non-image sources.

I'm a bit confused here. Are you saying that it the XValidation tag doesn't evaluate the expression holistically but instead each individual expression within a chain? This sounds off to me - do you have any documentation or references you've found that state that is the case?

If I'm not mistaken, what seems more likely here is that the validations are failing on the fact that we only allow the Image value as part of our enum checks so we don't even hit the CEL validations.

Holistically I would expect the following test cases to fail, but not hit our CEL validations until more than the Image value is allowed:

  • Attempting to set type to something other than Image
  • Attempting to set type to something other than Image AND setting image field

For ^ I would avoid checking for specific error messages for now and just ensuring that we encounter an error.

One that I would expect us to fail from the CEL validation:

  • Attempting to set type to Image but nil image field

One that I would expect to succeed:

  • Attempting to set type to Image and set image field

I do think we should be able to satisfy all these cases with a singular CEL expression with the current state of our API. Future iterations to the API that add new source types will likely need to update the CEL validations so I would avoid trying to accommodate future use cases of different source type values in the CEL introduced as part of this PR

Copy link
Contributor

@ankitathomas ankitathomas Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is something that I would expect based on our allowed enum values.

I'd added another allowed value to our enums and regenerated manifests to test this case
// +kubebuilder:validation:Enum:="Image";"someothertype"

Are you saying that it the XValidation tag doesn't evaluate the expression holistically but instead each individual expression within a chain?

It seems that way from https://book.kubebuilder.io/reference/markers/crd-validation

// +kubebuilder:validation:XValidation
    marks a field as requiring a value for which a given expression evaluates to true.
    This marker may be repeated to specify multiple expressions, all of which must evaluate to true.

I verified this with the test I mentioned earlier. A catalog with the type "someothertype" would fail with or without the image field with
The ClusterCatalog "testcatalogothertype" is invalid: spec.source: Invalid value: "object": source type 'Image' requires image field

Copy link
Member

@joelanford joelanford Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add another sourceType in the future, we will have to modify this expression. e.g.

(self.type == 'Image' && has(self.image)) || (self.type == 'Foo' && has(self.foo))

But since there's just one enum value now, I don't think we need to worry about making the CEL expression future proof.

Copy link
Contributor

@ankitathomas ankitathomas Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// +kubebuilder:validation:Enum:="Image,SomeOtherType"

That seems incorrect, the manifest generated that way considers the quoted value to be a single string, so it treats the valid value as "Image,SomeOtherType" rather than "Image" and "SomeOtherValue"

The ClusterCatalog "operatorhubio2" is invalid: spec.source.type: Unsupported value: "SomeOtherType": supported values: "Image,someothertype"

I also tried with // +kubebuilder:validation:Enum:="Image","someothertype" , that fails to generate the manifest with Error: not all generators ran successfully

// +kubebuilder:validation:Enum:="Image";"someothertype" works correctly though, if the CEL validation is removed, the catalog manifest with the "someothertype" applies without any issues

Copy link
Contributor

@ankitathomas ankitathomas Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(self.type == 'Image' && has(self.image)) || (self.type == 'Foo' && has(self.foo))

I think we won't be able to tailor our validation errors if we make the CEL validation one long validation for all types. We'd also be throwing the CEL validation error when we encounter an unsupported type

apiVersion: olm.operatorframework.io/v1alpha1
kind: ClusterCatalog
metadata:
  name: operatorhubio2
spec:
  source:
    type: unsupported

The ClusterCatalog "operatorhubio2" is invalid: 
* spec.source.type: Unsupported value: "unsupported": supported values: "Image", "Foo"
* spec: Invalid value: "object": no such key: image evaluating rule: cannot specify PollInterval while using digest-based image
* spec.source: Invalid value: "object": source type 'Image' requires image field

Not sure why it's throwing an error with the PollInterval though.

This was the expression set preserves a little more info about validation errors:

// +kubebuilder:validation:XValidation:rule="self.type != 'Image' || has(self.image)",message="source type 'Image' requires image field"
// +kubebuilder:validation:XValidation:rule="self.type == 'Image' || !has(self.image)",message="image field must only be set for source type 'Image'"
// +kubebuilder:validation:XValidation:rule="self.type != 'Foo' || has(self.foo)",message="source type 'Foo' requires foo field"
// +kubebuilder:validation:XValidation:rule="self.type == 'Foo' || !has(self.foo)",message="foo field must only be set for source type 'Foo'"

an unsuported type gives the following error set:

The ClusterCatalog "operatorhubio2" is invalid: 
* spec.source.type: Unsupported value: "unsupported": supported values: "Image", "Foo"
* spec: Invalid value: "object": no such key: image evaluating rule: cannot specify PollInterval while using digest-based image

A catalog with an invalid field gives something like:

kind: ClusterCatalog
metadata:
  name: operatorhubio2
spec:
  source:
    type: Foo
    image:
      ref: ""
The ClusterCatalog "operatorhubio2" is invalid: 
* spec.source: Invalid value: "object": image field must only be set for source type 'Image'
* spec.source: Invalid value: "object": source type 'Foo' requires foo field

We can combine the validations for each type so we don't end up too verbose

// +kubebuilder:validation:XValidation:rule="(self.type != 'Image' || has(self.image)) && (self.type == 'Image' || !has(self.image))",message="source type 'Image' requires image field"
// +kubebuilder:validation:XValidation:rule="(self.type != 'Foo' && !has(self.foo)) || (self.type == 'Foo' && has(self.foo))",message="source type 'Foo' requires foo field"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realize we could have multiple validation rules like that, but I guess it makes sense. That's nicer because we can tailor the message for each one.

If we want to consolidate to a single rule, we can use messageExpression instead of message in order to use CEL to build a message. But I'm not sure that would give us the flexibility we could need to craft a helpful validation message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having multiple validation rules like that is fine, but I'd be curious how that translates to the CEL cost budget.

I'm still a bit hung up on the OR comparisons we are doing here to verify the cases. Maybe a sync would be helpful here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up: #384 (comment)

@ankitathomas ankitathomas force-pushed the more-types-updates branch 4 times, most recently from 0d394ca to 1d2538c Compare September 26, 2024 14:57
@ankitathomas
Copy link
Contributor

A summary of changes :

  • capitalized catalog type values ('image' -> 'Image')
  • removed omitempty from spec.Priority field so printcols prints a default value of 0
  • removed status.ObservedGeneration field
  • added CEL validation rules to spec.CatalogSource and status.ResolvedCatalogSource + tests, removed other kubebuilder validation for the same
  • updated status.ResolvedCatalogSource.ref to reference the resolved image digest instead of tracking the value of spec.ref that the resolving/unpacking was performed on, removed status.ResolvedCatalogSource.resolvedRef
    • updated check to stop unpack retries for unchanged digest-based images. Tag based images are retried if a polling interval is set.
  • removed status.ResolvedCatalogSource.LastUnpacked, switching to only use status.lastUnpacked to track when bundle was last unpacked successfully

@LalatenduMohanty LalatenduMohanty removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2024
@everettraven everettraven marked this pull request as ready for review September 26, 2024 16:06
@everettraven everettraven requested a review from a team as a code owner September 26, 2024 16:06
}
// time to unpack
return true
nextPoll := catalog.Status.ResolvedSource.Image.LastSuccessfulPollAttempt.Add(catalog.Spec.Source.Image.PollInterval.Duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place that we are reading from status. We can probably fold this into the other bug issue I mentioned. Another thing that we should store/lookup from our cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can fold into our reconcile intervals? If we set the requeuing interval to the polling interval, would we effectively get the same behaviour?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using nextPoll to set the RequeueAfter already I think. The problem is that we don't want to poll every time we reconcile and we also don't want to reset our timer and wait too long. So we need reconciles (R) and polls (P) that look like the following:

     poll1          now                      poll2
-----RP-------------R---------R--------------RP
0    1    2    3    4    5    6    7    8    9

If our interval is 8 and we reconcile and poll at time 1, we don't want to poll again until time 9, even if we reconcile in the meantime. So if we reconcile again at time 4, we need to calculate a RequeueAfter based on the current time, the last successful poll time and the interval.

In this case, we want:

(now - poll1) + requeueAfter = interval
requeueAfter = interval - now + poll1
requeueAfter = poll1 + interval - now
requeueAfter = 1 + 8 - 4
requeueAfter = 5

If we reconcile again at time 6, that's too early again, so we need to recalculate requeueAfter (this time, it would be 3) so that we're always timing for another reconcile to happen at the poll interval.

This particular case of reading from status isn't that bad because the worst case scenario is that we fail to update status after a successful poll attempt. If that happens, we'll end up re-polling the image registry every single reconcile until status update succeeds.

If we read from an in-memory map, out lastSuccessfulPoll map would not survive restarts, so we would end up re-polling every single ClusterCatalog at cluster startup. This concern is partially why we're using jitter to slowly spread poll times of different ClusterCatalogs around, even if they have the same interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I was curious because it seemed like we don't do anything apart from setting finalizers on catalogs that aren't ready for polling. Once we move to an async reconcile, we'd want to reconcile more frequently than our polling interval.

I've created #420 to track this and the previous bug you pointed out

grokspawn and others added 3 commits October 1, 2024 16:25
…lidation for spec.source and status.resolvedSource

Signed-off-by: Ankita Thomas <ankithom@redhat.com>
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
everettraven
everettraven previously approved these changes Oct 1, 2024
Copy link
Collaborator

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a couple minor comments, but I don't think leaving them unaddressed causes any problems. Worst case I think we get a panic when running the tests instead of a "clean" test failure

Signed-off-by: Ankita Thomas <ankithom@redhat.com>
everettraven
everettraven previously approved these changes Oct 1, 2024
Comment on lines 240 to 244
imgRef, err := reference.ParseNamed(catalog.Spec.Source.Image.Ref)
if err != nil {
// don't retry unpacking an image ref that can't be parsed
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens on the very first reconcile of the object, do we still end up setting the status correctly with "Serving=False" and "Progressing=True,Retrying,failed to parse image ref..."?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like my above scenario is covered by other checks that return true earlier.

However, now I'm concerned about this scenario:

  1. We did previously unpack successfully (hence catalog.Status.ResourceSource != nil, so we continue to line 233)
  2. We do have content from that last successful unpack (so we continue to line 236)
  3. We do have an image specified (so we continue to line 240)
  4. We have an invalid reference (so we return false at line 243)
  5. Back in reconcile and line 138, we return without touching status.

So it seems like status will never update when the spec has a bad image ref after a successful unpack. Am I understanding that correctly? If so, we need to fix this somehow.

I feel like we actually need a bunch of scrutiny on the needsUnpacking method. I'm worried there may be other super-nuanced bugs like this because this function is somewhat edge-based, not level-based.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - I can have it go to reconcile on invalid refs so the unpacker can add the error on the progressing status as a quick fix for now, I think we'd need another issue to analyse and rework the needsUnpacking logic.

Comment on lines -255 to -257
if nextPoll.After(time.Now()) && catalog.Generation == catalog.Status.ObservedGeneration {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we dropped Status.ObservedGeneration, so this particular code is impossible now, but I'm trying to understand what this was trying to do. If our status is up-to-date with our spec wrt the generation and our nextPoll time is in the future, then we skip unpacking.

If the ObservedGeneration is lagging the current generation, then we're going to unpack regardless of the nextPoll time. That generally makes sense from the perspective of "user made change and wants to see controller act on that change immediately". Do we need to account for this?

(This goes back to me other comment about analyzing this function in depth and probably adding a whole bunch of comments to explain our reasoning)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check for polling interval, but we don't check for any user made changes apart from changes to the image ref. I don't know if we have any fields users could change that would warrant a reconcile as of now, apart from the image ref, there's the PollInterval, the CatalogSource Type, the Priority, and maybe the labels and annotations - none of which require a reconcile with the current state of the reconciler.

I don't think this needs to be addressed immediately, but I agree it's something we need to fix before the API grows complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For right now, I would imagine the follow changes would cause an immediate re-unpack:

  1. Change the image ref
  2. Change the poll interval

In the near future, I'd also imagine unpacking when the Catalog flips from disabled to enabled (based on this design: https://docs.google.com/document/d/1VmncLojrZFAEjRBIz0KwZlB_7K4ENTpCLPZfzHJRVDA/edit)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything here that would force re-unpack for (1) and (2). We don't actually know when those change because we don't see the edge transitions. We are level-based.

If the image ref is tag based, it looks like we don't know if it has changed (that's good, that means we're level based), but we also don't re-unpack it (that's bad - we need to drive toward desired state as soon as the user's declared intent is updated).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for now, we check observed generation of all conditions, and if any of them don't match metadata.generate, we need to unpack.

Copy link
Contributor

@ankitathomas ankitathomas Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the near future, I'd also imagine unpacking when the Catalog flips from disabled to enabled (based on this design: https://docs.google.com/document/d/1VmncLojrZFAEjRBIz0KwZlB_7K4ENTpCLPZfzHJRVDA/edit)

That's definitely something that should force re-unpack, sounds like it's a good idea to also re-evaluate the needsUnpacking as part of that effort .

but we also don't re-unpack it

We should re-unpack it if it's a tag based image with a pollInterval set, the only time we skip unpacks is if we have a digest based image on spec that's the same as the resolvedImage on status (we need to rework that too as part of #420)

so far, we unpack if:

  • we're reconciling the catalog for the first time (i.e, there's no resolved image in the status)
  • we don't have the catalog cached (missing from storage, so if the controller is restarted)
  • if spec.source.image isn't a valid image, so we can have an error on the Progressing condition
  • if the pollInterval is non-zero and the lastSuccessfulPoll was further in the past than the pollInterval

Interestingly, we re-unpack both digest and tag based images when a pollInterval is set, so we may actually be re-unpacking the same digest over and over again.

Copy link
Member

@joelanford joelanford Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the pollInterval is non-zero and the lastSuccessfulPoll was further in the past than the pollInterval

I'm arguing that this is insufficient. If:

  • pollInterval is nil, but we have a different image ref, we need to unpack immediately.
  • pollInterval is non-nil, but we have a different image ref, we need to unpack immediately, not wait for the next poll interval
  • (less concerning, but still something I think I'd expect), if the poll interval is changed, we should unpack immediately.

It seems like a fairly simple approximation of the above is:

  • If I haven't updated my status to account for the current generation, then I need to unpack. (And I know this means reading status, but that's what it seems like we have to do until we refactor our image cache to provide more info).

Copy link
Contributor

@ankitathomas ankitathomas Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pollInterval is non-nil, but we have a different image ref, we need to unpack immediately, not wait for the next poll interval

I know we'd do this if both images were digest based, but if the new image was tag based, we wouldn't unpack immediately. Considering we don't track the original ref we attempted to resolve anymore, we don't currently have a way of doing this. I think we should address this in a separate issue, possibly as part of #420 . If we also track the original ref that was resolved along with the digest based image and the last polling time in our cache, we should be able to account for this situation.

I can think of a dirty temporary fix for this in the interim - to resolve the image digest during needsUnpack and compare that with the resolved image on status, which pretty much amounts to implementing https://github.com/operator-framework/catalogd/blob/main/internal/source/containers_image.go#L204 in needsUnpacked.

If we either use the digest resolved from the spec for comparison or cache info about catalog resolution, we shouldn't need to touch the pollInterval in any way.

I think we're mostly ok for pollInterval changes that aren't paired with image ref changes though:

  • pollInterval nil -> non-zero : this is the only case we may need to unpack, but unless the pollInterval is very large or the catalog gets updated with the new pollInterval soon after creation, we should be unpacking pretty quickly since lastSuccessfulUnpackTime would be far enough in the past to warrant a retry.
  • pollInterval non-zero -> nil : no unpack required
  • pollInterval large value -> small value (eg, 30m -> 1m), I'm not certain what user expected behaviour here would be, waiting till the new pollInterval duration seems like it should be understandable.
  • pollInterval large value -> small value : we'd explicitly want to avoid unpacking here, if user specified they don't want to unpack often.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think we should make a network call to determine whether we need to unpack. That amounts to polling.

I think the correct fix for now is to check the observedGeneration of the conditions. That is the closest thing we have to what was there before when we actually had an observedGeneration field in the spec. So we'd basically be doing the same thing, just looking in a slightly different place than before (in the conditions rather than in the status root)

Copy link
Contributor

@ankitathomas ankitathomas Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the observedGeneration of the conditions

sounds good, that should also subsume the image ref comparisons.

allow bad image refs to pass through to unpacker to propagate error to Progressing status

Signed-off-by: Ankita Thomas <ankithom@redhat.com>
…Unpack

Signed-off-by: Ankita Thomas <ankithom@redhat.com>
@joelanford joelanford added this pull request to the merge queue Oct 2, 2024
Merged via the queue into operator-framework:main with commit 29d21c7 Oct 2, 2024
11 of 13 checks passed
@grokspawn grokspawn deleted the more-types-updates branch October 4, 2024 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants