Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for kueue to work with VAP on OCP 4.16+ #1480

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zdtsw
Copy link
Member

@zdtsw zdtsw commented Jan 8, 2025

  • only add VAP specific manifests into list if latest OCP version in history is over 16 in min version
  • add check OCP function to get OCP version during main startup, and value can be access by components also show in the dsc status

Description

https://issues.redhat.com/browse/RHOAIENG-16133

How Has This Been Tested?

manual test on quay.io/wenzhou/opendatahub-operator-catalog:v2.17.1613315
spin up two clusters

  • 4.17.9:
    create DSCI, DSC only enable kueue
    kueue CR ready, 2 new resource VAP and VAPB created
    change VAPB action from Deny to Warn, it keeps new value
    disable kueue
    kueue CR removed, VAP removed, VAPB remains
  • 4.15.6 then upgrade to 4.16.27: cretae DSCI, DSC only enable kueue
    kueue CR ready, no API registered in cluster for VAP/VAPB CRD, operator pod not throw error

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link

openshift-ci bot commented Jan 8, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Jan 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from zdtsw. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lburgazzoli
Copy link
Contributor

  • need to uplift to go-client 0.29.2 whichi is the latest one has SwitchMetrics for leaderelection

What is this about ?

  • need to uplift k8s.io/api to 0.30.4 which start to have VAP included

Is this actually needed? we use unstructured objects so the current version should work too

@zdtsw
Copy link
Member Author

zdtsw commented Jan 8, 2025

  • need to uplift to go-client 0.29.2 whichi is the latest one has SwitchMetrics for leaderelection

What is this about ?

  • need to uplift k8s.io/api to 0.30.4 which start to have VAP included

Is this actually needed? we use unstructured objects so the current version should work too

i should revert the uplift for api part, at least was using &admissionregistrationv1.ValidatingAdmissionPolicyBinding which only contain these 2 from 0.30.4
but since now i changed to use watchsgvk the uplift is not needed any more
and go-client change was due to api uplift, can revert back as well.

).
WatchesGVK(
gvk.ValidatingAdmissionPolicyBinding,
reconciler.Dynamic(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -95,6 +95,16 @@ func GetDomain(ctx context.Context, c client.Client) (string, error) {
return domain, err
}

func GetOCPVersion(ctx context.Context, c client.Client) (semver.Version, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be computed at startup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did some change, moved this call a bit up to the main

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 0% with 65 lines in your changes missing coverage. Please review.

Project coverage is 19.78%. Comparing base (ad20d38) to head (c8372c6).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/cluster/cluster_config.go 0.00% 37 Missing ⚠️
controllers/components/kueue/kueue_controller.go 0.00% 11 Missing ⚠️
...llers/components/kueue/kueue_controller_actions.go 0.00% 11 Missing ⚠️
controllers/components/kueue/kueue_support.go 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1480      +/-   ##
==========================================
- Coverage   19.88%   19.78%   -0.11%     
==========================================
  Files         160      160              
  Lines       10818    10874      +56     
==========================================
  Hits         2151     2151              
- Misses       8440     8496      +56     
  Partials      227      227              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zdtsw zdtsw marked this pull request as ready for review January 9, 2025 11:58
@openshift-ci openshift-ci bot requested review from asanzgom and ykaliuta January 9, 2025 11:58
@zdtsw zdtsw requested review from ChristianZaccaria and removed request for asanzgom January 9, 2025 13:51
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks brilliant Wen, thank you!!

Just making sure. The ValidatingAdmissionPolicyBinding resource should be configurable, i.e., by the Cluster Admin. This resource should have a similar behaviour to how ConfigMaps are handled. - Let me know if this makes sense

@zdtsw
Copy link
Member Author

zdtsw commented Jan 9, 2025

This looks brilliant Wen, thank you!!

Just making sure. The ValidatingAdmissionPolicyBinding resource should be configurable, i.e., by the Cluster Admin. This resource should have a similar behaviour to how ConfigMaps are handled. - Let me know if this makes sense

i was not aware of this requirement.
so VAP is static but VAPB will be user configurable?
but why not instead user create another VAPB instance? e.g if the binding is to filter on more matching part or the action should be "warn"/"audit" ?

@ChristianZaccaria
Copy link
Contributor

so VAP is static but VAPB will be user configurable?

@zdtsw For reference after our async discussion:

  • VAPB CR, should be first deployed with default values from manifests. Then the Admin can decide on configuration, or simply keep the defaults.
  • VAPB CR, wont be reconcile if user updates its contents.
  • VAP CR is static.

@zdtsw zdtsw force-pushed the jira_16133 branch 2 times, most recently from e4a4d57 to ce77585 Compare January 10, 2025 12:10
Copy link
Contributor

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm tested and works as expected. Thank you for the brilliant work!


// Add OCP 4.17+ specific manifests if Minor > 16 .
func vapPredicate(context.Context, *odhtypes.ReconciliationRequest) bool {
return cluster.GetClusterInfo().Version.Minor > 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make things a little but more clear, maybe we should have a constant for the minimum version and use SemVer.GTE

var V4_17 = semver.MustParse("4.17")

func vapPredicate(context.Context, *odhtypes.ReconciliationRequest) bool {
  return cluster.GetClusterInfo().Version.GTE(V4_17)
}

pkg/deploy/deploy.go Outdated Show resolved Hide resolved
@zdtsw zdtsw removed the request for review from ykaliuta January 11, 2025 17:31
@zdtsw zdtsw requested a review from lburgazzoli January 13, 2025 06:24
@zdtsw zdtsw force-pushed the jira_16133 branch 2 times, most recently from c2830a2 to 1ea36b9 Compare January 13, 2025 12:30
@@ -164,6 +172,36 @@ func (tc *KueueTestCtx) testOwnerReferences() error {
return nil
}

func (tc *KueueTestCtx) validateVAPReady() error {
// if ocp is 4.17+ then VAP and VAPB should be created
if cluster.GetClusterInfo().Version.GTE(semver.MustParse("4.17.0")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check shold probably me moved to the t.Run block, like:

t.Run("Validate Kueue Dynamically create VAP", func(t *testing.T) {
        if !cluster.GetClusterInfo().Version.GTE(semver.MustParse("4.17.0") {
            t.Skip("Disabled as requires OpenShift >= 4.17")
            return
        }
        
	err = kueueCtx.validateVAPReady()
	require.NoError(t, err, "Kueue instance is not Ready")
})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, doing rebase now, need local test before push the final change

vap := &unstructured.Unstructured{}
vap.SetGroupVersionKind(gvk.ValidatingAdmissionPolicy)

err := tc.testCtx.customClient.Get(tc.testCtx.ctx, keyvap, vap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to use gomega's Eventually so that it won't fail in case the resource takes a little time to become available

vapb := &unstructured.Unstructured{}
vapb.SetGroupVersionKind(gvk.ValidatingAdmissionPolicyBinding)

err = tc.testCtx.customClient.Get(tc.testCtx.ctx, keyvapb, vapb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to use gomega's Eventually so that it won't fail in case the resource takes a little time to become available

@@ -164,6 +172,36 @@ func (tc *KueueTestCtx) testOwnerReferences() error {
return nil
}

func (tc *KueueTestCtx) validateVAPReady() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also probably need a negative test so that it validated that no VAP resources are created on an unsupported openshift verions

@@ -41,6 +44,7 @@ import (
)

func (s *componentHandler) NewComponentReconciler(ctx context.Context, mgr ctrl.Manager) error {
enableVAP = cluster.GetClusterInfo().Version.GTE(semver.MustParse("4.17.0"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since enableVAP is a package variable, it may be better to set it as part of the init() function

@zdtsw zdtsw force-pushed the jira_16133 branch 2 times, most recently from 7d74968 to 7f51b9e Compare January 14, 2025 12:03
@zdtsw zdtsw requested a review from lburgazzoli January 14, 2025 16:19
g.Get(gvk.ValidatingAdmissionPolicy, types.NamespacedName{Name: "kueue-validating-admission-policy"}).Eventually().Should(
jq.Match(`.metadata.ownerReferences == "%s"`, componentApi.KueueInstanceName),
)
vapb, err := g.Get(gvk.ValidatingAdmissionPolicyBinding, types.NamespacedName{Name: "kueue-validating-admission-policy-binding"}).Get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should use Eventually as the other one

v, err := tc.GetClusterVersion()
g.Expect(err).ToNot(HaveOccurred())
if v.GTE(semver.MustParse("4.17.0")) {
g.Get(gvk.ValidatingAdmissionPolicy, types.NamespacedName{Name: "kueue-validating-admission-policy"}).Eventually().WithTimeout(1 * time.Second).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason to wait only 1 second ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i moved this test case after "ValidateUpdateDeploymentsResources" test.
i think no need to wait DefaultTimeout 2mins here, if component is ready but VAP/VAPB not found

zdtsw added 4 commits January 15, 2025 16:48
- only add VAP specific manifests into list if latest OCP version in history is over 16 in min version
  update: explicit check 4.17.0 not just minor version
- now each component can get ocp version without calculate
- add test in kueue
- make VAPB object user configable (as no ownerreference set)
- user cannot update VAP object ( it get reconciled back to default value)

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- fix order for kueue, sinde the deploy need to be called after we get all manifests
- testcase to explictly check error matching no CRD of VAP/VAPB found for < 4.17
- cannot use cluster.GetClusterInfo since that is init/set in the main, tc wont have that value

Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
@lburgazzoli
Copy link
Contributor

/test opendatahub-operator-e2e

@zdtsw
Copy link
Member Author

zdtsw commented Jan 15, 2025

/test opendatahub-operator-e2e

Copy link
Contributor

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested latest changes from this PR. This continues to work as expected, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

4 participants