-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Kuberay RayJob MultiKueue adapter #3892
base: main
Are you sure you want to change the base?
[Feature] Kuberay RayJob MultiKueue adapter #3892
Conversation
Skipping CI for Draft Pull Request. |
/ok-to-test |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
@@ -70,6 +70,7 @@ IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) | |||
JOBSET_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" sigs.k8s.io/jobset) | |||
KUBEFLOW_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/kubeflow/training-operator) | |||
KUBEFLOW_MPI_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/kubeflow/mpi-operator) | |||
KUBERAY_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/ray-project/kuberay/ray-operator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far it's the version without managedBy.
We don't have a ray-operator image that supports managedBy, that would have to be a custom one...
latest -> 1.2.2, which is the last release
export KUBERAY_MANIFEST="${ROOT_DIR}/dep-crds/ray-operator/default/" | ||
export KUBERAY_IMAGE=bitnami/kuberay-operator:${KUBERAY_VERSION/#v} | ||
export KUBERAY_RAY_IMAGE=rayproject/ray:2.9.0 | ||
export KUBERAY_RAY_IMAGE_ARM=rayproject/ray:2.9.0-aarch64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is for us people working on macOS, it's vital for development, not so much for prod - I think it should stay
pkg/controller/jobs/raycluster/raycluster_multikueue_adapter.go
Outdated
Show resolved
Hide resolved
pkg/controller/jobs/raycluster/raycluster_multikueue_adapter.go
Outdated
Show resolved
Hide resolved
d27f12c
to
ad4df0d
Compare
pkg/controller/jobs/raycluster/raycluster_multikueue_adapter.go
Outdated
Show resolved
Hide resolved
ad4df0d
to
b13ce9a
Compare
/retest |
d3974b4
to
3c3a0c7
Compare
3c3a0c7
to
01f5354
Compare
5c05b98
to
4e83f8b
Compare
906cc02
to
6f49e56
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mszadkow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6f49e56
to
1801ddc
Compare
Although the PR for Kuberay RayJob was ready for review before New Year, I couldn’t finish to work on it due to peculiar issue.
Namely Envtest failed on setup and with couple of different yet relating errors:
Those errors were distributed over all cluster routines (as clusters run in separate go routines). During the investigation with Mykhailo we were able to boil down the issue to simple addition of “ray-operator” CRDS to the suite. We also checked:
Finally what worked was to reduce the Which is still not understood why, but we assume that maybe apiserver of envtest got overwhelmed and possibly test-infra requires resource update. We don’t have any proof that this will work, but some people reportedly so similar issues (not the exact ones):
Maybe any of you guys encounter similar issue in the past? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you had to reduce INTEGRATION_NPROCS I suspect the cluster running the integration tests is resource constrained. However, I'm wondering if INTEGRATION_API_LOG_LEVEL=1
or --output-interceptor-mode=none
could be making a difference here?
func (j *RayJob) IsSuspended() bool { | ||
return j.Spec.Suspend | ||
} | ||
|
||
func (j *RayJob) IsActive() bool { | ||
return j.Status.JobDeploymentStatus != rayv1.JobDeploymentStatusSuspended | ||
return (j.Status.JobDeploymentStatus != rayv1.JobDeploymentStatusSuspended) && (j.Status.JobDeploymentStatus != rayv1.JobDeploymentStatusNew) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed? Maybe this is a fix for regular RayJobs too? In that case we should do it in a separate PR so that it can be cherry-picked.
E2eKuberayTestImage := "rayproject/ray:2.9.0" | ||
if runtime.GOOS == "darwin" { | ||
E2eKuberayTestImage = "rayproject/ray:2.9.0-aarch64" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we read the image from the KUBERAY_RAY_IMAGE
}} | ||
} | ||
|
||
func (j *JobWrapper) RayJobSpecsDefault() *JobWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why to introduce it? Could the default be already set with MakeJob as so far?
Adding this to most (or all) tests makes the diff larger unnecesarily.
To check if the tests are stable: |
What type of PR is this?
/kind feature
What this PR does / why we need it:
We want to be bale to run Kuberay workloads RayJob in MultiKueue.
RayCluster will come as separate PR.
Which issue(s) this PR fixes:
Relates to #3822
Special notes for your reviewer:
Does this PR introduce a user-facing change?