Skip to content

Operator for installation and lifecycle management of CodeFlare distributed workload stack, starting with MCAD and InstaScale

License

Notifications You must be signed in to change notification settings

zdtsw-forking/codeflare-operator

 
 

Repository files navigation

codeflare-operator

Operator for installation and lifecycle management of CodeFlare distributed workload stack.

CodeFlare Stack Compatibility Matrix

Component Version
CodeFlare Operator v1.4.0
CodeFlare-SDK v0.16.0
KubeRay v1.1.0

Development

Requirements:

  • GNU sed - sed is used in several Makefile command. Using macOS default sed is incompatible, so GNU sed is needed for correct execution of these commands. When you have a version of the GNU sed installed on a macOS you may specify the binary using
    # brew install gnu-sed
    make install -e SED=/usr/local/bin/gsed

Testing

The e2e tests can be executed locally by running the following commands:

  1. Use an existing cluster, or set up a test cluster, e.g.:

    # Create a KinD cluster
    make kind-e2e
    # Install the CRDs
    make install

    [!NOTE] Some e2e tests cover the access to services via Ingresses, as end-users would do, which requires access to the Ingress controller load balancer by its IP. For it to work on macOS, this requires installing docker-mac-net-connect.

  2. Start the operator locally:

    NAMESPACE=default make run

    Alternatively, You can run the operator from your IDE / debugger.

  3. Set up the test CodeFlare stack:

    make setup-e2e

    [!NOTE] In OpenShift the KubeRay operator pod gets random user assigned. This user is then used to run Ray cluster. However the random user assigned by OpenShift doesn't have rights to store dataset downloaded as part of test execution, causing tests to fail. To prevent this failure on OpenShift user should enforce user 1000 for KubeRay and Ray cluster by creating this SCC in KubeRay operator namespace (replace the namespace placeholder):

    kind: SecurityContextConstraints
    apiVersion: security.openshift.io/v1
    metadata:
      name: run-as-ray-user
    seLinuxContext:
      type: MustRunAs
    runAsUser:
      type: MustRunAs
      uid: 1000
    users:
      - 'system:serviceaccount:$(namespace):kuberay-operator'
  4. In a separate terminal, set your output directory for test files, and run the e2e suite:

    export CODEFLARE_TEST_OUTPUT_DIR=<your_output_directory>
    make test-e2e

Alternatively, You can run the e2e test(s) from your IDE / debugger.

Testing on disconnected cluster

To properly run e2e tests on disconnected cluster user has to provide additional environment variables to properly configure testing environment:

  • CODEFLARE_TEST_PYTORCH_IMAGE - image tag for image used to run training job
  • CODEFLARE_TEST_RAY_IMAGE - image tag for Ray cluster image
  • MNIST_DATASET_URL - URL where MNIST dataset is available
  • PIP_INDEX_URL - URL where PyPI server with needed dependencies is running
  • PIP_TRUSTED_HOST - PyPI server hostname

For ODH tests additional environment variables are needed:

  • NOTEBOOK_IMAGE_STREAM_NAME - name of the ODH Notebook ImageStream to be used
  • ODH_NAMESPACE - namespace where ODH is installed

Release

  1. Invoke project-codeflare-release.yaml
  2. Once all jobs within the action are completed, verify that compatibility matrix in README was properly updated.
  3. Verify that opened pull request to OpenShift community operators repository has proper content.
  4. Once PR is merged, announce the new release in slack and mail lists, if any.
  5. Release automation should open a PR with changes in ODH CodeFlare operator repo. Review the changes proposed by automation. If all the changes are correct then manually cherrypick all CARRY and PATCH commits from the current main branch, push the result to a dedicated branch and ask on Slack channel for review of the result branch content. Once agreed then push the changes directly to the main branch (branch protection has to be temporarily disabled).
  6. Build ODH/CFO image by triggering Build and Push action
  7. Create a release branch on Red Hat CodeFlare operator repo for the next release if it doesn't exist yet.
  8. Create a dedicated branch containing changes from ODH CodeFlare operator repo. Cherrypick all relevant changes available in Red Hat CodeFlare operator repo latest release branch which should be available also in the next release. Ask on Slack channel for review of the result branch content. Once agreed then push the changes directly to the release branch.
  9. Make sure that release automation created a PR updating CodeFlare SDK version in ODH Notebooks repository. Make sure the PR gets merged.

Releases involving part of the stack

There may be instances in which a new CodeFlare stack release requires releases of only a subset of the stack components. Examples could be hotfixes for a specific component. In these instances:

  1. Build updated components as needed:

  2. Invoke tag-and-build.yml GitHub action, this action will create a repository tag, build and push operator image.

  3. Check result of tag-and-build.yml GitHub action, it should pass.

  4. Verify that compatibility matrix in README was properly updated.

  5. Follow the steps 3-6 from the previous section.

About

Operator for installation and lifecycle management of CodeFlare distributed workload stack, starting with MCAD and InstaScale

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 66.2%
  • Python 13.6%
  • Makefile 13.2%
  • Jupyter Notebook 3.7%
  • Shell 2.6%
  • Dockerfile 0.7%