Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fix(backend): add retries to getOrInsertContext in MLMD client #74

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gregsheremeta
Copy link

Description of your changes:

Add retries to getOrInsertContext in MLMD client.

This function is known to be flaky and racy right after a server is initially created and pipeline runs are just starting, so we retry up to 3 times. The actual cause of the race isn't fully understood, but it's probably caused by deadlocks in MLMD SQL code. General MySQL advice is to retry on deadlock errors.

Checklist:

Copy link

openshift-ci bot commented Aug 20, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from gregsheremeta. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dsp-developers
Copy link

Commit Checker results:

**NOTE**: These are the results of the commit checker scans. 
If these are not commits from upstream kfp, then please ensure
you adhere to the commit checker formatting
commitchecker verson unknown
Validating 1 commits between a8fbbd2020d280f4cab31245a9639a350207b3f9...7fb8762303a8ed015d3aa77bda5440e136cba83c

UPSTREAM commit 7fb8762 has invalid summary fix(backend): add retries to getOrInsertContext in MLMD client.

UPSTREAM commits are validated against the following regular expression:
  ^UPSTREAM: (revert: )?(([\w.-]+/[\w-.-]+)?: )?(\d+:|<carry>:|<drop>:)

UPSTREAM commit summaries should look like:

  UPSTREAM: <PR number|carry|drop>: description

UPSTREAM commits which revert previous UPSTREAM commits should look like:

  UPSTREAM: revert: <normal upstream format>

Examples of valid summaries:

  UPSTREAM: 12345: A kube fix
  UPSTREAM: <carry>: A carried kube change
  UPSTREAM: <drop>: A dropped kube change
  UPSTREAM: revert: 12345: A kube revert


@dsp-developers
Copy link

A set of new images have been built to help with testing out this PR:
API Server: quay.io/opendatahub/ds-pipelines-api-server:pr-74
DSP DRIVER: quay.io/opendatahub/ds-pipelines-driver:pr-74
DSP LAUNCHER: quay.io/opendatahub/ds-pipelines-launcher:pr-74
Persistence Agent: quay.io/opendatahub/ds-pipelines-persistenceagent:pr-74
Scheduled Workflow Manager: quay.io/opendatahub/ds-pipelines-scheduledworkflow:pr-74
MLMD Server: quay.io/opendatahub/mlmd-grpc-server:latest
MLMD Envoy Proxy: registry.redhat.io/openshift-service-mesh/proxyv2-rhel8:2.3.9-2
UI: quay.io/opendatahub/ds-pipelines-frontend:pr-74

@dsp-developers
Copy link

An OCP cluster where you are logged in as cluster admin is required.

The Data Science Pipelines team recommends testing this using the Data Science Pipelines Operator. Check here for more information on using the DSPO.

To use and deploy a DSP stack with these images (assuming the DSPO is deployed), first save the following YAML to a file named dspa.pr-74.yaml:

apiVersion: datasciencepipelinesapplications.opendatahub.io/v1alpha1
kind: DataSciencePipelinesApplication
metadata:
  name: pr-74
spec:
  dspVersion: v2
  apiServer:
    image: "quay.io/opendatahub/ds-pipelines-api-server:pr-74"
    argoDriverImage: "quay.io/opendatahub/ds-pipelines-driver:pr-74"
    argoLauncherImage: "quay.io/opendatahub/ds-pipelines-launcher:pr-74"
  persistenceAgent:
    image: "quay.io/opendatahub/ds-pipelines-persistenceagent:pr-74"
  scheduledWorkflow:
    image: "quay.io/opendatahub/ds-pipelines-scheduledworkflow:pr-74"
  mlmd:  
    deploy: true  # Optional component
    grpc:
      image: "quay.io/opendatahub/mlmd-grpc-server:latest"
    envoy:
      image: "registry.redhat.io/openshift-service-mesh/proxyv2-rhel8:2.3.9-2"
  mlpipelineUI:
    deploy: true  # Optional component 
    image: "quay.io/opendatahub/ds-pipelines-frontend:pr-74"
  objectStorage:
    minio:
      deploy: true
      image: 'quay.io/opendatahub/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance'

Then run the following:

cd $(mktemp -d)
git clone git@github.com:opendatahub-io/data-science-pipelines.git
cd data-science-pipelines/
git fetch origin pull/74/head
git checkout -b pullrequest 7fb8762303a8ed015d3aa77bda5440e136cba83c
oc apply -f dspa.pr-74.yaml

More instructions here on how to deploy and test a Data Science Pipelines Application.

@gregsheremeta gregsheremeta marked this pull request as draft August 20, 2024 23:28
@dsp-developers
Copy link

Commit Checker results:

**NOTE**: These are the results of the commit checker scans. 
If these are not commits from upstream kfp, then please ensure
you adhere to the commit checker formatting
commitchecker verson unknown
Validating 2 commits between a8fbbd2020d280f4cab31245a9639a350207b3f9...fc9fdad126ef4c9bda2d1ed826545be3d5f15d6e

UPSTREAM commit fc9fdad has invalid summary more logging.

UPSTREAM commits are validated against the following regular expression:
  ^UPSTREAM: (revert: )?(([\w.-]+/[\w-.-]+)?: )?(\d+:|<carry>:|<drop>:)

UPSTREAM commit summaries should look like:

  UPSTREAM: <PR number|carry|drop>: description

UPSTREAM commits which revert previous UPSTREAM commits should look like:

  UPSTREAM: revert: <normal upstream format>

Examples of valid summaries:

  UPSTREAM: 12345: A kube fix
  UPSTREAM: <carry>: A carried kube change
  UPSTREAM: <drop>: A dropped kube change
  UPSTREAM: revert: 12345: A kube revert



UPSTREAM commit 7fb8762 has invalid summary fix(backend): add retries to getOrInsertContext in MLMD client.

UPSTREAM commits are validated against the following regular expression:
  ^UPSTREAM: (revert: )?(([\w.-]+/[\w-.-]+)?: )?(\d+:|<carry>:|<drop>:)

UPSTREAM commit summaries should look like:

  UPSTREAM: <PR number|carry|drop>: description

UPSTREAM commits which revert previous UPSTREAM commits should look like:

  UPSTREAM: revert: <normal upstream format>

Examples of valid summaries:

  UPSTREAM: 12345: A kube fix
  UPSTREAM: <carry>: A carried kube change
  UPSTREAM: <drop>: A dropped kube change
  UPSTREAM: revert: 12345: A kube revert


@dsp-developers
Copy link

Change to PR detected. A new PR build was completed.
A set of new images have been built to help with testing out this PR:
API Server: quay.io/opendatahub/ds-pipelines-api-server:pr-74
DSP DRIVER: quay.io/opendatahub/ds-pipelines-driver:pr-74
DSP LAUNCHER: quay.io/opendatahub/ds-pipelines-launcher:pr-74
Persistence Agent: quay.io/opendatahub/ds-pipelines-persistenceagent:pr-74
Scheduled Workflow Manager: quay.io/opendatahub/ds-pipelines-scheduledworkflow:pr-74
MLMD Server: quay.io/opendatahub/mlmd-grpc-server:latest
MLMD Envoy Proxy: registry.redhat.io/openshift-service-mesh/proxyv2-rhel8:2.3.9-2
UI: quay.io/opendatahub/ds-pipelines-frontend:pr-74

@github-actions github-actions bot added the Stale label Oct 23, 2024
This function is known to be flaky and racy right after a server is initially created
and pipeline runs are just starting, so we retry up to 3 times. The actual cause of the
race isn't fully understood, but it's probably caused by deadlocks in MLMD SQL code.
General MySQL advice is to retry on deadlock errors.

Signed-off-by: Greg Sheremeta <gshereme@redhat.com>
Signed-off-by: Greg Sheremeta <gshereme@redhat.com>
@dsp-developers
Copy link

Commit Checker results:

**NOTE**: These are the results of the commit checker scans. 
If these are not commits from upstream kfp, then please ensure
you adhere to the commit checker formatting
commitchecker verson unknown
Validating 2 commits between 57a28230fb3390fda3195efb99228bfc8d99c7f8...796ecbcb42dbce1c1dc012c58c82d1997b59dc7e

UPSTREAM commit 796ecbc has invalid summary more logging.

UPSTREAM commits are validated against the following regular expression:
  ^UPSTREAM: (revert: )?(([\w.-]+/[\w-.-]+)?: )?(\d+:|<carry>:|<drop>:)

UPSTREAM commit summaries should look like:

  UPSTREAM: <PR number|carry|drop>: description

UPSTREAM commits which revert previous UPSTREAM commits should look like:

  UPSTREAM: revert: <normal upstream format>

Examples of valid summaries:

  UPSTREAM: 12345: A kube fix
  UPSTREAM: <carry>: A carried kube change
  UPSTREAM: <drop>: A dropped kube change
  UPSTREAM: revert: 12345: A kube revert



UPSTREAM commit 927dae7 has invalid summary fix(backend): add retries to getOrInsertContext in MLMD client.

UPSTREAM commits are validated against the following regular expression:
  ^UPSTREAM: (revert: )?(([\w.-]+/[\w-.-]+)?: )?(\d+:|<carry>:|<drop>:)

UPSTREAM commit summaries should look like:

  UPSTREAM: <PR number|carry|drop>: description

UPSTREAM commits which revert previous UPSTREAM commits should look like:

  UPSTREAM: revert: <normal upstream format>

Examples of valid summaries:

  UPSTREAM: 12345: A kube fix
  UPSTREAM: <carry>: A carried kube change
  UPSTREAM: <drop>: A dropped kube change
  UPSTREAM: revert: 12345: A kube revert


@dsp-developers
Copy link

Change to PR detected. A new PR build was completed.
A set of new images have been built to help with testing out this PR:
API Server: quay.io/opendatahub/ds-pipelines-api-server:pr-74
DSP DRIVER: quay.io/opendatahub/ds-pipelines-driver:pr-74
DSP LAUNCHER: quay.io/opendatahub/ds-pipelines-launcher:pr-74
Persistence Agent: quay.io/opendatahub/ds-pipelines-persistenceagent:pr-74
Scheduled Workflow Manager: quay.io/opendatahub/ds-pipelines-scheduledworkflow:pr-74
MLMD Server: quay.io/opendatahub/mlmd-grpc-server:latest
MLMD Envoy Proxy: registry.redhat.io/openshift-service-mesh/proxyv2-rhel8:2.3.9-2
UI: quay.io/opendatahub/ds-pipelines-frontend:pr-74

@github-actions github-actions bot removed the Stale label Oct 31, 2024
@github-actions github-actions bot added the Stale label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants