Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify status changes on managed interfaces #530

Merged
merged 4 commits into from
Apr 24, 2024

Conversation

mlguerrero12
Copy link
Contributor

@mlguerrero12 mlguerrero12 commented Oct 26, 2023

Users could modify the settings of VFs which have been configured by the sriov operator. This PR starts the reconciliation loop when these changes are detected in the generic plugin.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@SchSeba
Copy link
Collaborator

SchSeba commented Oct 29, 2023

Hi @mlguerrero12 thanks for the PR!
question this PR is to cover a bug? asking because now everytime we will go over a much heavier function to understand that we don't need todo anything then just check the the last apply configuration

@dmoessne
Copy link

Hi @mlguerrero12 thanks for the PR! question this PR is to cover a bug? asking because now everytime we will go over a much heavier function to understand that we don't need todo anything then just check the the last apply configuration

Hello @SchSeba indeed this is supposed to mitigate https://issues.redhat.com/browse/OCPBUGS-22337

@adrianchiris
Copy link
Collaborator

@dmoessne can you share the details publicly ?

currently the assumption is that config daemon owns sriov configuration (unless otherwise instructed e.g via externally managed feature) so no one is expected to change the configuration.

based on that assumption comparing the states is expected to be sufficient. that said, im not against checking the system for the actual state however we need to understand IF we actually need this as it brings with it overhead in the daemon reconcile loop.

@mlguerrero12
Copy link
Contributor Author

@SchSeba, @adrianchiris, we have a customer that manually changed the mtu of some VFs that were previously configured with a sriov policy. The sriov config daemon detects these changes (via pollNicStatus which runs every 30s) and updates the status of the sriov network node state object. However, the reconciliation loop doesn't trigger because it only does when the generation field changes (this is when the spec of the sriov-network-node-state changes). We already informed the customer that they shouldn't modify the mtu without using sriov policies. I'm currently working on a PR to also make the operator react when the status doesn't match what was set in the spec.

This customer was relying on this bug. The problem was that due to a recent issue with etcd encryption, the kube-api server pods were restarted (weekly). The resource version of objects were incremented due to this, including the one for the "device-plugin-config" config map. The resource version of this config map is part of the sriov-network-node-state spec. Due to this, the sriov config daemon started its reconciliation loop. The function OnNodeStateChange of the generic plugin detected changes in the MTU and returned "needDrain = true". The node was drained and this is the reason why the customer complained. They didn't understand why the node was repeatedly draining. The Apply function of the generic plugin never reconciled the status after the draining because it only does when the spec changes. This is what I'm fixing with this PR. Regarding the cost, yes, it has an extra cost but it's just some for loops to compare some parameters of the interfaces. Nothing to be concerned about I would think but I could speed this up with some routines.

To sum up, customer shouldn't be changing the mtu without using sriov policies of VFs that are managed by the sriov operator. We identified some issues in the logic of the sriov daemon. One of them is solved with this PR.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@mlguerrero12 mlguerrero12 changed the title Remove LastState parameter of GenericPlugin Trigger daemon reconcilation loop when status changes Oct 30, 2023
@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@mlguerrero12
Copy link
Contributor Author

I added a commit to start the reconciliation loop of the sriov config daemon when status of interfaces are externally modified (see my previous comment). Updated title and description of PR.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Oct 31, 2023

Pull Request Test Coverage Report for Build 8818177298

Details

  • 47 of 175 (26.86%) changed or added relevant lines in 8 files are covered.
  • 9 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.3%) to 38.617%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/plugins/fake/fake_plugin.go 0 3 0.0%
pkg/plugins/k8s/k8s_plugin.go 0 3 0.0%
pkg/plugins/mellanox/mellanox_plugin.go 0 3 0.0%
pkg/plugins/virtual/virtual_plugin.go 0 3 0.0%
pkg/plugins/intel/intel_plugin.go 0 5 0.0%
pkg/plugins/mock/mock_plugin.go 4 15 26.67%
pkg/plugins/generic/generic_plugin.go 30 63 47.62%
pkg/daemon/daemon.go 13 80 16.25%
Files with Coverage Reduction New Missed Lines %
pkg/plugins/generic/generic_plugin.go 1 52.85%
pkg/daemon/daemon.go 3 44.16%
controllers/generic_network_controller.go 5 74.53%
Totals Coverage Status
Change from base Build 8817158319: 0.3%
Covered Lines: 4870
Relevant Lines: 12611

💛 - Coveralls

@mlguerrero12 mlguerrero12 marked this pull request as draft November 1, 2023 10:05
Copy link

github-actions bot commented Nov 1, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@mlguerrero12 mlguerrero12 changed the title Trigger daemon reconcilation loop when status changes Verify status changes on managed interfaces Nov 1, 2023
Copy link

github-actions bot commented Nov 1, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

github-actions bot commented Nov 1, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

github-actions bot commented Nov 1, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@mlguerrero12 mlguerrero12 marked this pull request as ready for review November 1, 2023 17:54
Copy link

github-actions bot commented Apr 2, 2024

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@mlguerrero12
Copy link
Contributor Author

failing test is not related

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just small nits to add comments and we can merge this one.

nice work!

pkg/plugins/generic/generic_plugin.go Show resolved Hide resolved
pkg/plugins/k8s/k8s_plugin.go Show resolved Hide resolved
pkg/plugins/mellanox/mellanox_plugin.go Show resolved Hide resolved
pkg/plugins/virtual/virtual_plugin.go Show resolved Hide resolved
Copy link

github-actions bot commented Apr 4, 2024

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@mlguerrero12
Copy link
Contributor Author

LGTM just small nits to add comments and we can merge this one.

nice work!

Thanks @SchSeba. I added the TODO only for the generic plugin since other plugins have a different logic.

Please reopen #632. it was mistakenly closed.

Copy link

github-actions bot commented Apr 4, 2024

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@mlguerrero12
Copy link
Contributor Author

@Eoghan1232 @e0ne, please have a look at this when you have time. Thanks!

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link
Collaborator

@ykulazhenkov ykulazhenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

The generic plugin was applying config changes only if
the desired spec of interfaces was different from the last
applied spec. This logic is different from the one in
OnNodeStateChange where the real status of the interfaces is
used to detect changes.

By removing the LastState parameter (and related code), the
generic plugin will also use the real status of interfaces
to decide whether to apply changes or not. The SyncNodeState
function has this logic.
Users could modify the settings of VFs which have been
configured by the sriov operator. This PR starts the
reconciliation loop when these changes are detected in the
generic plugin.

Signed-off-by: Marcelo Guerrero <marguerr@redhat.com>
Logic to check missing kernel arguments is placed in a method
to be used by both OnNodeStateChange and CheckStatusChanges.
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SchSeba
Copy link
Collaborator

SchSeba commented Apr 24, 2024

merging this one thanks for all the help folks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants