Skip to content

Commit

Permalink
Adding support for IBM autopilot GPU metrics, alerts, dashboards (#57)
Browse files Browse the repository at this point in the history
We wish to use the IBM Autopilot application. A Kubernetes-native daemon
that continuously monitors and evaluates GPUs, network and storage
health, designed to detect and report infrastructure-level issues during
the lifetime of AI workloads. We will deploy autopilot to the test and
prod cluster, and the grafana dashboard to the obs cluster.
  • Loading branch information
computate authored Oct 22, 2024
1 parent 7df0656 commit adcd217
Show file tree
Hide file tree
Showing 5 changed files with 46 additions and 0 deletions.
15 changes: 15 additions & 0 deletions clusters/lib/autopilot/application.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: autopilot
labels:
nerc.mghpcc.org/sync-policy: common
spec:
project: default
source:
repoURL: https://github.com/ocp-on-nerc/nerc-ocp-config.git
targetRevision: HEAD
path: REPLACEME
destination:
name: REPLACEME
namespace: autopilot
4 changes: 4 additions & 0 deletions clusters/lib/autopilot/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- application.yaml
9 changes: 9 additions & 0 deletions clusters/nerc-ocp-obs/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ kind: Kustomization
resources:
- ../lib/cluster-scope
- ../lib/logging
- ../lib/autopilot
- dex
- loki
- fake-metrics-server
Expand Down Expand Up @@ -47,3 +48,11 @@ patches:
- op: replace
path: /spec/source/path
value: logging/overlays/nerc-ocp-obs
- target:
kind: Application
name: autopilot
patch: |
- op: replace
path: /spec/source/path
value: autopilot/overlays/nerc-ocp-obs
9 changes: 9 additions & 0 deletions clusters/nerc-ocp-prod/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ resources:
- ../lib/nfd-operator
- ../lib/nvidia-gpu-operator
- ../lib/nerc-images
- ../lib/autopilot
- gatekeeper-policy
- acct-mgt
- curator
Expand Down Expand Up @@ -97,3 +98,11 @@ patches:
- op: replace
path: /spec/source/path
value: nvidia-gpu-operator/overlays/nerc-ocp-prod
- target:
kind: Application
name: autopilot
patch: |
- op: replace
path: /spec/source/path
value: autopilot/overlays/nerc-ocp-prod
9 changes: 9 additions & 0 deletions clusters/nerc-ocp-test/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ resources:
- ../lib/virt
- ../lib/nfs
- ../lib/csi-driver-nfs
- ../lib/autopilot

nameSuffix: -test

Expand Down Expand Up @@ -86,3 +87,11 @@ patches:
- op: replace
path: /spec/source/path
value: csi-driver-nfs/overlays/nerc-ocp-test
- target:
kind: Application
name: autopilot
patch: |
- op: replace
path: /spec/source/path
value: autopilot/overlays/nerc-ocp-test

0 comments on commit adcd217

Please sign in to comment.