Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-37955,SDN-4930,OCPBUGS-42616,SDN-5031,OCPBUGS-38753: [DownstreamMerge] 10-8-24 #2314

Merged
merged 101 commits into from
Oct 31, 2024

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Oct 8, 2024

No description provided.

numansiddique and others added 30 commits September 17, 2024 16:34
Signed-off-by: Numan Siddique <numans@ovn.org>
This is needed in order to get net-attach-def client v1.7.3 [0] that
can consume the UDN interfaces.

[0] k8snetworkplumbingwg/multus-cni#1336

Signed-off-by: Ram Lavi <ralavi@redhat.com>
Before this commit, this was the network status we reported for a UDN
pod:
```
kubectl get pods pod
-ojsonpath="{@.metadata.annotations.k8s\.v1\.cni\.cncf\.io\/network-status}"
| jq
[
  {
    "name": "ovn-kubernetes",
    "interface": "eth0",
    "ips": [
      "10.244.1.9",
      "fd00:10:244:2::9"
    ],
    "mac": "0a:58:0a:f4:01:09",
    "default": true,
    "dns": {}
  }
]
```

With it, we now report:
```
[
 {
    "name": "ovn-kubernetes",
    "interface": "ovn-udn1",
    "ips": [
      "10.128.0.3"
    ],
    "mac": "0a:58:0a:80:00:03",
    "default": true,
    "dns": {}
  },
  {
    "name": "ovn-kubernetes",
    "interface": "eth0",
    "ips": [
      "10.244.1.6",
      "fd00:10:244:2::6"
    ],
    "mac": "0a:58:0a:f4:01:06",
    "default": false,
    "dns": {}
  }
 ]
```

This way, applications complying to the k8snetworkplumbingwg de-facto
standard can be aware of the UDN interface information.

We report the primary UDN first so the
network-attachment-definition-client can identify which of the 2
interfaces to report as primary [0].

[0]
k8snetworkplumbingwg/network-attachment-definition-client#71

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Co-authored-by: Ram Lavi <ralavi@redhat.com>
This allows us to add a separate context in the future, to test pods
with primary UDN annotations set.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Co-authored-by: Ram Lavi <ralavi@redhat.com>
Whenever we invoke `cmdAddWithGetCNIResultFunc` it will return a result
with 2 interfaces: the first for the host side veth, and the second for
the pod side veth (featuring the sandbox attribute). Check the source
code at [0] to understand how it happens.

Now that we have changed the production code to *also* reply with the
result of adding the UDN interface, we are in fact returning 4
interfaces on the result of `cmdAddWithGetCNIResultFunc`:

index: interface
----------------
0: host side primary UDN
1: pod side primary UDN
2: host side cluster default network
3: pod side cluster default network

[0] -
https://github.com/ovn-org/ovn-kubernetes/blob/a1e88dcd0b79b6c5eb033daea178708993f8664e/go-controller/pkg/cni/helper_linux.go#L650

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Co-authored-by: Ram Lavi <ralavi@redhat.com>
After PR [0] was merged we now need to specify which IP families are
available in the cluster whenever the `GetActiveNetworkForNamespace`
function is used [1], otherwise we'll see errors like:
```
W0913 10:40:54.797673  126506 util.go:418] Skipping nad 'foo-ns/tenantred' as active network after failing parsing it with invalid subnet cnfiguration: error while parsing subnets: network tenantred is attempting to use ipv4 subnets but the cluster does not support ipv4
```

[0] - ovn-kubernetes/ovn-kubernetes#4625
[1] - https://github.com/ovn-org/ovn-kubernetes/blob/4cf0a2e9b317a71aa15faa76ced99c88f04868c3/go-controller/pkg/util/util.go#L418

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Co-authored-by: Ram Lavi <ralavi@redhat.com>
Adding a new e2e test to check the network-status in the of udn primary
pods.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
In Layer2 networks there is no join switch, only the cluster switch.
Ensure that the ports are named appropriately, this is important for
the logical router policies created for local node access.

Signed-off-by: Patryk Diak <pdiak@redhat.com>
(cherry picked from commit 967923ec75b5ec50d80e158a33441ece2eb82fdb)
For some reason we were not adding the reroutes
to mpX interface required in LGW in L2 topology
on the GR. This commit fixes that.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Co-Authored-by: Patryk Diak <pdiak@redhat.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
after this PR:

 unable to get logical router GR_l2.network_ovn-worker2: object not found

before this PR:

object not found

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
syncNodeManagementPort should be called after the gwManager
because we want to add the routes to the GR which is only
created when the gwManager is run

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
See https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/default_network_controller.go#L920
In UDN L3/L2 controllers we were missing
checking for mgmtPortFailed and accounting for that
as part of nodeupdates. So retry for any errors from
syncManagementPort was not happening.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
fedora: Update Dockerfile to OVN release ovn-24.09.0-33.
UDN: Design routes and policies on L2's GR correctly.
Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Leverages the NAD Controller to also track primary networks. This should
be faster than listing all NADs in a namespace and iterating through
them to find the primary network for a namespace.

Refactors network manager and nad controller. Changes the network cache to
only be stored in network manager, rather than having a duplicate cache in
NAD controller.

NAD Controller is modified to only hold references to network names for
NADs, and then queries network manager to get the networks. NAD Controller
and Network Manager are now thread safe.

Tests are also modified:
 - NAD Controller now implemented as an interface so that unit tests can
   implement a fake NAD controller for getting active network
 - Fully functioning OVN controller unit tests still leverage a fully
   functioning NAD controller. NCM may be faked out in these cases.
 - Fake entities moved to testing/nad package

Co-authored-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Signed-off-by: Tim Rozet <trozet@redhat.com>
Previously the function would check the informer cache which provides
(in practical terms) real time access to if there is a primary network
for this namespace as a NAD. This did not cover the case where a UDN
could be generating the NAD, but the NAD has not been created yet. This
commit addresses it by checking for a UDN (only when a NAD is not found
for the namespace).

This should help in UDN cases (the case we really care about) to ensure
that the function returns an accurate result. Especially during times of
delay where perhaps a pod handler and UDN handler are firing almost
simultaneously.

Right now the function will return an error indicating there is a
primary network that hasn't been processed yet, letting the caller know
to retry. We have seen in the past with time sensitive handlers that
they may give up after 5 or so retries before the dependent controller
has finished. In those cases, we usually add a wait in the handler
itself and have around a 200-300ms timeout while we wait for the
dependent handler (in this case the NAD handler) to finish. This commit
may be expanded on in the future to accomodate that where needed.

Signed-off-by: Tim Rozet <trozet@redhat.com>
On GRs we have an SNAT from the join port IP address to the external IP
address for ICMP needs frag sent by OVN to the external network.

In L2 UDNs, even though there is no join network, the internal port has
two IPs, one for the pod subnet and another for the join subnet. The
ICMP source address will be the first one in lexicographical order which
might be the pod subnet one.

The proper SNAT for the GR pod subnet address gets created in the gateway
initialization. But:
1 when DisableSNATMultipleGWs=false, it is redundant because there is
  already a SNAT for the whole pod subnet
2 when DisableSNATMultipleGWs=true we end up with no SNAT because
  cleanupStalePodSNATs cleans it up becaause it never expected that
  SNAT

This commit fixes openshift#2

Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
Signed-off-by: Patryk Diak <pdiak@redhat.com>
Signed-off-by: Patryk Diak <pdiak@redhat.com>
@tssurya
Copy link
Contributor

tssurya commented Oct 29, 2024

/cherry-pick release-4.17

@openshift-cherrypick-robot

@tssurya: once the present PR merges, I will cherry-pick it on top of release-4.17 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tssurya
Copy link
Contributor

tssurya commented Oct 29, 2024

/cherry-pick release-4.17

UGH wrong PR please ignore bot

@tssurya
Copy link
Contributor

tssurya commented Oct 30, 2024

/retest

@jluhrsen
Copy link
Contributor

GCP techpreview is green 🎉 https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2314/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-techpreview/1851163859293835264

@kyrtapz , I didn't notice anything, but was there any work done that helped make it green?

@tssurya , LMK if you want me to re-test anything to make sure.

@trozet
Copy link
Contributor Author

trozet commented Oct 30, 2024

waiting on openshift/origin#29243

@trozet
Copy link
Contributor Author

trozet commented Oct 30, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Oct 31, 2024

@trozet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-kubevirt c6a0243 link false /test e2e-aws-ovn-kubevirt
ci/prow/e2e-metal-ipi-ovn-techpreview c6a0243 link false /test e2e-metal-ipi-ovn-techpreview
ci/prow/e2e-metal-ipi-ovn-ipv6-techpreview c6a0243 link false /test e2e-metal-ipi-ovn-ipv6-techpreview
ci/prow/security c6a0243 link false /test security
ci/prow/e2e-aws-ovn-hypershift-conformance-techpreview c6a0243 link false /test e2e-aws-ovn-hypershift-conformance-techpreview
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway-techpreview c6a0243 link false /test e2e-metal-ipi-ovn-dualstack-local-gateway-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@maiqueb
Copy link
Contributor

maiqueb commented Oct 31, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2024
Copy link
Contributor

openshift-ci bot commented Oct 31, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maiqueb, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tssurya
Copy link
Contributor

tssurya commented Oct 31, 2024

/test e2e-metal-ipi-ovn-dualstack-local-gateway-techpreview
/test e2e-metal-ipi-ovn-techpreview

@tssurya
Copy link
Contributor

tssurya commented Oct 31, 2024

/payload-job periodic-ci-openshift-microshift-release-4.18-periodics-e2e-aws-ovn-ocp-conformance-serial
/payload-job periodic-ci-openshift-microshift-release-4.18-periodics-e2e-aws-ovn-ocp-conformance
/payload-job periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-ipv6
/payload-job periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial

Copy link
Contributor

openshift-ci bot commented Oct 31, 2024

@tssurya: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-microshift-release-4.18-periodics-e2e-aws-ovn-ocp-conformance-serial
  • periodic-ci-openshift-microshift-release-4.18-periodics-e2e-aws-ovn-ocp-conformance
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/18a09ee0-9767-11ef-90a6-a9339fdf3790-0

@tssurya
Copy link
Contributor

tssurya commented Oct 31, 2024

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-ovn-kubernetes-2314-periodics-e2e-aws-ovn-ocp-conformance-serial/1851913062232100864 and https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-ovn-kubernetes-2314-periodics-e2e-aws-ovn-ocp-conformance/1851913062798331904 are not even installing gonna give up on that ... @neisw @dgoodwin : maybe TRT knows something around why these nightly payloads are not running properly due to :

INFO[2024-10-31T09:04:24Z] ci-operator version v20241030-7705202fd      
INFO[[2](https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-ovn-kubernetes-2314-periodics-e2e-aws-ovn-ocp-conformance-serial/1851913062232100864#1:build-log.txt%3A2)024-10-31T09:04:24Z] Loading configuration from https://config.ci.openshift.org for openshift/ovn-kubernetes@master 
ERRO[2024-10-[3](https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-ovn-kubernetes-2314-periodics-e2e-aws-ovn-ocp-conformance-serial/1851913062232100864#1:build-log.txt%3A3)1T09:04:24Z] Failed to load arguments.                     error=[tests[e2e-aws-ovn-ocp-conformance-serial].steps.pre[4].from: unknown image "root" (configuration is missing `build_root`), tests[e2e-aws-ovn-ocp-conformance-serial].steps.pre[5].from: unknown image "root" (configuration is missing `build_root`), tests[e2e-aws-ovn-ocp-conformance-serial].steps.pre[6].from: unknown image "root" (configuration is missing `build_root`), tests[e2e-aws-ovn-ocp-conformance-serial].steps.post[1].from: unknown image "root" (configuration is missing `build_root`)]
INFO[202[4](https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-ovn-kubernetes-2314-periodics-e2e-aws-ovn-ocp-conformance-serial/1851913062232100864#1:build-log.txt%3A4)-10-31T09:04:24Z] Reporting job state 'failed' with reason 'loading_args:validating_config'

@tssurya
Copy link
Contributor

tssurya commented Oct 31, 2024

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-ovn-kubernetes-2314-nightly-4.18-e2e-aws-ovn-serial/1851913063821742080 is not related to this PR:

: [bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info expand_less	0s
{  KubeAPIErrorBudgetBurn was at or above info for at least 58s on platformidentification.JobType{Release:"4.18", FromRelease:"", Platform:"aws", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 1h11m56s, firing for 58s:

Oct 31 10:30:44.051 - 58s   E namespace/openshift-kube-apiserver alert/KubeAPIErrorBudgetBurn alertstate/firing severity/critical ALERTS{alertname="KubeAPIErrorBudgetBurn", alertstate="firing", long="6h", namespace="openshift-kube-apiserver", prometheus="openshift-monitoring/k8s", severity="critical", short="30m"}}

@trozet
Copy link
Contributor Author

trozet commented Oct 31, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit fdcbde9 into openshift:master Oct 31, 2024
32 of 38 checks passed
@openshift-ci-robot
Copy link
Contributor

@trozet: Jira Issue OCPBUGS-42616: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-42616 has been moved to the MODIFIED state.

Jira Issue OCPBUGS-38753: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38753 has been moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@tssurya: Failed to get PR patch from GitHub. This PR will need to be manually cherrypicked.

Error messagestatus code 406 not one of [200], body: {"message":"Sorry, the diff exceeded the maximum number of files (300). Consider using 'List pull requests files' API or locally cloning the repository instead.","errors":[{"resource":"PullRequest","field":"diff","code":"too_large"}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#list-pull-requests-files","status":"406"}

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ovn-kubernetes-base
This PR has been included in build ose-ovn-kubernetes-base-container-v4.18.0-202410311939.p0.gfdcbde9.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ovn-kubernetes-microshift
This PR has been included in build ovn-kubernetes-microshift-container-v4.18.0-202410311939.p0.gfdcbde9.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-ovn-kubernetes
This PR has been included in build ose-ovn-kubernetes-container-v4.18.0-202410311939.p0.gfdcbde9.assembly.stream.el9.
All builds following this will include this PR.

@tssurya
Copy link
Contributor

tssurya commented Nov 4, 2024

/retitle OCPBUGS-37955,SDN-4930,OCPBUGS-42616,SDN-5031,OCPBUGS-38753: [DownstreamMerge] 10-8-24

@openshift-ci openshift-ci bot changed the title SDN-4930,OCPBUGS-42616,SDN-5031,OCPBUGS-38753: [DownstreamMerge] 10-8-24 OCPBUGS-37955,SDN-4930,OCPBUGS-42616,SDN-5031,OCPBUGS-38753: [DownstreamMerge] 10-8-24 Nov 4, 2024
@openshift-ci-robot
Copy link
Contributor

@trozet: Jira Issue OCPBUGS-37955: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-37955 has been moved to the MODIFIED state.

Jira Issue OCPBUGS-42616 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

Jira Issue OCPBUGS-38753 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.