Workload unbind VF driver #443

zeeke · 2023-05-22T13:28:47Z

This issue is about discussing the scenario where the user workload (or any other actor different than the sriov-config-daemon) unbinds the Virtual Function driver while a VF is assigned to a Pod. If it happens, the VF remains in a unusable state and subsequent pod using that device raises errors like:

Warning  FailedCreatePodSandBox  148m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = 
    failed to create pod network sandbox k8s_test-deployment-66b745fc5c-c64r8_ocpubgs-13574_833b6235-4d08-436f-8bb7-3f20a141748a_0(a3f026aa4229536eb5ebadd839500e44945407fd588f6e0ab202c554c2bd3088): 
    error adding pod ocpubgs-13574_test-deployment-66b745fc5c-c64r8 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" 
    failed (add): [ocpubgs-13574/test-deployment-66b745fc5c-c64r8/833b6235-4d08-436f-8bb7-3f20a141748a:network-ocpubgs-13574]: 
    error adding container to network "network-ocpubgs-13574": 
    SRIOV-CNI failed to load netconf: 
    LoadConf(): failed to detect if VF 0000:19:02.2 has dpdk driver "lstat /sys/devices/pci0000:17/0000:17:02.0/0000:19:02.2/driver: no such file or directory"

Unbinding the driver is not how the operator should be used, so this problem can be addressed by adding this topic to the user documentation.
BTW, it is also tricky to detect, as the pod that raises the error is innocent (well configured) and tracing the culprit can be hard.

a. Does it make any sense to increase the sriov-config-daemon resilience and to rebind the VF driver when it diverges from the expected one?
b. If yes, can it be simpler to implement this behavior in the sriov-cni. Maybe adding a driver check before running a Pod?

The text was updated successfully, but these errors were encountered:

zeeke mentioned this issue May 23, 2023

[Enhancement] config-daemon: add a periodic validation routine #445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workload unbind VF driver #443

Workload unbind VF driver #443

zeeke commented May 22, 2023

Workload unbind VF driver #443

Workload unbind VF driver #443

Comments

zeeke commented May 22, 2023