Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ODH Operator Fails to create KfDef Instance and accociated elements. #111

Open
donovat opened this issue Sep 22, 2023 · 10 comments
Open
Labels
kind/bug Something isn't working

Comments

@donovat
Copy link

donovat commented Sep 22, 2023

ODH Component

ODH Operator

Current Behavior

  1. ODH Operator is installed as per instructions in the quick start, via the OpenShift Console & Operator Hub tile.
  2. As per quick start, create a new namespace - in this case - odh.
  3. Change to new namespace odh and enter Installed Open Data Hub Operator's console.
  4. As per quick start, create an instance of KfDef, select default options.
  5. KfDef instance is created but with status of: Conditions: Degraded, Available and none of the other components are created.

Expected Behavior

As above, but for step 4 - the additional elements get created, i.e. pods, deployment resources etc.

Steps To Reproduce

See notes above.

Workaround (if any)

None yet.

What browsers are you seeing the problem on? (If applicable)

Firefox

Open Data Hub Version

Open Data Hub Operator
1.9.0 provided by Open Data Hub

Anything else

I note the following issues in the logs for the Operator Pod..
2023-09-22T12:07:11.171Z INFO controllers.KfDef Reconciling KfDef resources {"Request.Namespace": "odh", "Request.Name": "opendatahub"}
2023-09-22T12:07:11.171Z INFO controllers.KfDef Creating a new KubeFlow Deployment {"KubeFlow.Namespace": "odh"}
2023-09-22T12:07:11.172Z ERROR controllers.KfDef Failed to create the app directory {"error": "mkdir /tmp/odh: read-only file system"}
github.com/opendatahub-io/opendatahub-operator/controllers/kfdef%2eapps%2ekubeflow%2eorg.(*KfDefReconciler).Reconcile
/workspace/controllers/kfdef.apps.kubeflow.org/kfdef_controller.go:236
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.0/pkg/internal/controller/controller.go:227
2023-09-22T12:07:11.172Z ERROR controllers.KfDef failed to load KfApp {"error": "mkdir /tmp/odh: read-only file system"}

Which I note is the same / similar to issue opendatahub-operator#259

I also note that the Operator has been created with what looks to be the wrong scc values, (taken from the yaml file installed).
On a different install of ODH I have values:
openshift.io/scc: anyuid
but on my install I have value:
openshift.io/scc: ibo.std-scc

I also have values:
fsGroup: 1000
runAsUser: 101
readOnlyRootFilesystem: true

Which look to be taken from a different Operator also installed in the OpenShift-Operator namespace.
Coould / should this operator be effecting the ODH operator?

@donovat donovat added the kind/bug Something isn't working label Sep 22, 2023
@VaishnaviHire
Copy link
Member

@donovat Given this thread, can we close this issue?

@donovat
Copy link
Author

donovat commented Sep 22, 2023

Hi @VaishnaviHire - sorry no. The slack channel thread, for the issue I have now resolved was to get the alternative install of Open Data Hub, which I have then used to compare to the original install of Open Data Hub, to see if I could find any reason its failing to create the resources during instance creation of KfDef. That issue still needs to be resolved. Yes I also have a slack channel conversation going for this issue, but so far, its not had any helpful postings, and others reading this issue may wish to know how to solve the same problem.

@donovat
Copy link
Author

donovat commented Sep 28, 2023

Taken from the slack channel discussion of this problem:
Hi
@tim Donovan
regarding the SCCs, those are not applied by the ODH operator, but can be updated using pod-security labels in a namespace.
A typical set of pod-security labels for anyuid looks like -

security.openshift.io/scc.podSecurityLabelSync: 'true'
    pod-security.kubernetes.io/warn: privileged
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged

Was there any changes in these labels for your openshift-operators namespace?

@donovat
Copy link
Author

donovat commented Sep 28, 2023

Hi @VaishnaviHire
Thanks for th einfo: from what I can see no there have been no changes to either namespace that could be effected by this operator (or any installed in the same namespace).
In namespace openshift-operators for the install of Open Data Hub that's failing:

Labels:
openshift.io/scc        
pod-security.kubernetes.io/enforce-version=v1.24
kubernetes.io/metadata.name=openshift-operators
security.openshift.io/scc.podSecurityLabelSync=true
pod-security.kubernetes.io/warn=privileged
pod-security.kubernetes.io/audit=privileged
pod-security.kubernetes.io/warn-version=v1.24
pod-security.kubernetes.io/enforce=privileged
pod-security.kubernetes.io/audit-version=v1.24

And for the working version:

openshift.io/scc
pod-security.kubernetes.io/enforce-version=v1.24
kubernetes.io/metadata.name=openshift-operators
security.openshift.io/scc.podSecurityLabelSync=true
pod-security.kubernetes.io/warn=privileged
pod-security.kubernetes.io/audit=privileged
pod-security.kubernetes.io/warn-version=v1.24
pod-security.kubernetes.io/enforce=privileged
pod-security.kubernetes.io/audit-version=v1.24

Which are the same, and almost the same is true for the namespaces odh on both clusters:

kubernetes.io/metadata.name=odh
pod-security.kubernetes.io/audit=baseline
pod-security.kubernetes.io/audit-version=v1.24
pod-security.kubernetes.io/warn=baseline
pod-security.kubernetes.io/warn-version=v1.24

and

kubernetes.io/metadata.name=odh
katib-metricscollector-injection=enabled
openshift-pipelines.tekton.dev/namespace-reconcile-version=1.11.1
control-plane=kubeflow
pod-security.kubernetes.io/warn=baseline
pod-security.kubernetes.io/audit=baseline
pod-security.kubernetes.io/warn-version=v1.24
olm.operatorgroup.uid/543a0d02-4113-4a12-bb53-54128851b736
pod-security.kubernetes.io/audit-version=v1.24

I do note on the cluster thats failing I have the following SCC which I am not sure how or by what got created:

allowHostPorts: false
priority: 100
requiredDropCapabilities: null
allowPrivilegedContainer: false
runAsUser:
  type: MustRunAsRange
  uidRangeMax: 1883
  uidRangeMin: 101
users:
  - 'system:serviceaccount:ibo-helm:ibo-sa'
allowHostDirVolumePlugin: false
allowHostIPC: false
seLinuxContext:
  type: RunAsAny
readOnlyRootFilesystem: true
metadata:
  annotations:
    meta.helm.sh/release-name: ibo
    meta.helm.sh/release-namespace: ibo-helm
  creationTimestamp: '2022-04-05T15:59:13Z'
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
 name: ibo.std-scc
  resourceVersion: '319963784'
  uid: fe9136e7-919b-4881-bae3-823424cb86de
fsGroup:
  ranges:
    - max: 1000
      min: 1000
  type: MustRunAs
groups: []
kind: SecurityContextConstraints
defaultAddCapabilities: null
supplementalGroups:
  type: RunAsAny
volumes:
  - awsElasticBlockStore
  - azureDisk
  - azureFile
  - cephFS
  - cinder
  - configMap
  - csi
  - downwardAPI
  - emptyDir
  - ephemeral
  - fc
  - flexVolume
  - flocker
  - gcePersistentDisk
  - gitRepo
  - glusterfs
  - iscsi
  - nfs
  - persistentVolumeClaim
  - photonPersistentDisk
  - portworxVolume
  - projected
  - quobyte
  - rbd
  - scaleIO
  - secret
  - storageOS
  - vsphere
allowHostPID: false
allowHostNetwork: false
allowPrivilegeEscalation: false
apiVersion: security.openshift.io/v1
allowedCapabilities: null

Which is the scc thats being picked up by the install of the Operator Open Data Hub and replacing the anyuid on the failing cluster.

@LaVLaS
Copy link
Contributor

LaVLaS commented Oct 17, 2023

Just to follow up, I attempted to reproduce this issue on two fresh OCP 4.12.35 clusters. The ODH operator install does not explicitly request the anyuid SCC.
Based on the error below in the odh-operator pod logs

2023-09-22T12:07:11.172Z ERROR controllers.KfDef Failed to create the app directory {"error": "mkdir /tmp/odh: read-only file system"}

It is safe to assume that the odh-operator can not create the /tmp/odh directory in the operator which means it cannot unpack the repo manifest(s) specified in the kfdef. From that assumption it is either

Judging by the content of ibo.std-scc, it has a higher priority: 100 than anyuid, priority: 10 which is applying readOnlyRootFilesystem: true to the odh-operator pod denying write access to /tmp. I assume that another service created this SCC for system:serviceaccount:ibo-helm:ibo-sa? I would need to refresh my knowledge of the SCC administration to provide a workaround as the recommended solution is to mount an emptyDir volume to /tmp when readOnlyRootFilesystem: true

@donovat
Copy link
Author

donovat commented Oct 27, 2023

Hi @LaVLaS - I agree with your assessment, and that the ibo.std.scc is setting the readOnlyRootFileSystem: true.
It's my belief that the ibo.std.scc comes from the [IBM ST4SD Operator](https://st4sd.github.io/overview/) which was installed on the cluster, and like odh gets installed into the same openshift-operators namespace.

Any suggestions?

@VassilisVassiliadis
Copy link

Hello, ibo.std.scc is not part of ST4SD.

ST4SD does create a SCC but it has the name st4sd-workflow-operator-scc-{$the namespace here}-do-not-modify.

Tim, one option would be to manually assign the anyuid SCC to the service-account that the odh pod uses.

@donovat
Copy link
Author

donovat commented Nov 3, 2023

I have managed to now install ODH - I reduced the priority to 1 for the odd scc, and this allowed the operator (after deleting the old instance, and re-installing the operator) to create the working KfDef. I have not been able to trace where the odd scc came from, other than it was installed by helm.

@VassilisVassiliadis
Copy link

Judging by the ServiceAccount that uses the ibo SCC (from the definiton of the SCC : system:serviceaccount:ibo-helm:ibo-sa), you'll find objects that use it in the ibo-helm namespace. You might be able to trace what creates the SCC in the same namespace.

@donovat
Copy link
Author

donovat commented Nov 6, 2023

Thanks @VassilisVassiliadis - but there is no ibo-* resources on the cluster other than the ibo-std-scc resource. And no helm charts within any of the projects. I would guess someone installed this resource during an install, but it never got deleted when the rest of the resources/helm charts did. I am tempted to just delete, but since I have odh now working its not as big an issue as it was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants