You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is about discussing the scenario where the user workload (or any other actor different than the sriov-config-daemon) unbinds the Virtual Function driver while a VF is assigned to a Pod. If it happens, the VF remains in a unusable state and subsequent pod using that device raises errors like:
Warning FailedCreatePodSandBox 148m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create pod network sandbox k8s_test-deployment-66b745fc5c-c64r8_ocpubgs-13574_833b6235-4d08-436f-8bb7-3f20a141748a_0(a3f026aa4229536eb5ebadd839500e44945407fd588f6e0ab202c554c2bd3088):
error adding pod ocpubgs-13574_test-deployment-66b745fc5c-c64r8 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network"
failed (add): [ocpubgs-13574/test-deployment-66b745fc5c-c64r8/833b6235-4d08-436f-8bb7-3f20a141748a:network-ocpubgs-13574]:
error adding container to network "network-ocpubgs-13574":
SRIOV-CNI failed to load netconf:
LoadConf(): failed to detect if VF 0000:19:02.2 has dpdk driver "lstat /sys/devices/pci0000:17/0000:17:02.0/0000:19:02.2/driver: no such file or directory"
Unbinding the driver is not how the operator should be used, so this problem can be addressed by adding this topic to the user documentation.
BTW, it is also tricky to detect, as the pod that raises the error is innocent (well configured) and tracing the culprit can be hard.
a. Does it make any sense to increase the sriov-config-daemon resilience and to rebind the VF driver when it diverges from the expected one?
b. If yes, can it be simpler to implement this behavior in the sriov-cni. Maybe adding a driver check before running a Pod?
The text was updated successfully, but these errors were encountered:
This issue is about discussing the scenario where the user workload (or any other actor different than the
sriov-config-daemon
) unbinds the Virtual Function driver while a VF is assigned to a Pod. If it happens, the VF remains in a unusable state and subsequent pod using that device raises errors like:Unbinding the driver is not how the operator should be used, so this problem can be addressed by adding this topic to the user documentation.
BTW, it is also tricky to detect, as the pod that raises the error is innocent (well configured) and tracing the culprit can be hard.
a. Does it make any sense to increase the sriov-config-daemon resilience and to rebind the VF driver when it diverges from the expected one?
b. If yes, can it be simpler to implement this behavior in the sriov-cni. Maybe adding a driver check before running a Pod?
The text was updated successfully, but these errors were encountered: