-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node not ready: container runtime is down #9984
Comments
I was having the same problem in #9980 but I applied the suggested fix of using the Talos discovery service, and still have the same problems. |
I am seeing this as well. I did not see it on the v1.9.0 betas. |
Please supply a You should be looking into Make sure if you have any contianerd/CRI config customizations, that they were update for containerd 2.0 configuration, but it should have failed on Talos 1.8 as well. |
I seem to be experiencing the same issues, at the moment only affecting one of 4 machines, all installed on similar bare metal hardware. It seems to initially come online and report ready only to report not ready shortly thereafter. Have attached support zip if it helps. |
Here is mine as well. It just happened after a reboot. |
Not sure what's going on there, but in both support files it happens around the time |
@smira Not sure, but I doubt rook-ceph blew this up. Something changed between the beta and v1.9.0. I have git history and nothing has changed except upgrading to v1.9.0. That's when the problem occurred. |
You can see yourself in the logs. We don't have any failures whatsoever in any of the tests, including Ceph. If there's a reproducer, happy to verify. |
I read the same logs as you. It just so happens that cephfs is the last thing that comes up. Most likely a fluke, because again, nothing changed. Maybe there is a problem with upgrading from |
None of our tests showed any issues, I read Kubernetes source code. You might try to increase log verbosity of the kubelet with In your log there's no clear reason on why kubelet considers CRI to be unhappy. |
I changed kubelet to -v=9, here is the support.zip |
I’m facing the same issue. On Talos 1.8.4, CRI worked fine, but after upgrading to 1.9.0, 2 of my 3 nodes go ‘Ready’ for 30-60 seconds before switching to ‘Not Ready.’ |
Which Kubernetes version is everyone having issues using? |
@smira I was experiencing the issue on both |
Also, is everybody using Ceph? |
I have already reverted my cluster to 1.8.4 but here is a support bundle before I reverted |
I can confirm this is due to |
I don't use multus on my cluster, but I have some nodes with the cephfs csi driver and some nodes without, all nodes with the driver had the problem, the control planes without cephfs worked flawlessly. On my cluster I switched all nodes back to 1.8.3, as it was the last version I had running successfully. ( I never had tried 1.8.4 ) |
I filed this for containerd: containerd/containerd#11186 |
I got Cilium 1.16.5 and Talos 1.9.0 I get this happening without Multus when going from |
|
I think @buroa found the root cause (most probably): containerd/go-cni#123 (comment) We'll patch containerd for v1.9.1 This bug is hard to reproduce |
Fixes siderolabs/talos#9984 Patch with containerd/go-cni#126 See also: * containerd/go-cni#125 * containerd/containerd#11186 * containerd/go-cni#123 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 0b00e86)
Bug Report
Description
After upgrading to Talos 1.9.0 some of my nodes are never ready.
Logs
Environment
talosctl version --nodes <problematic nodes>
]Server Version: v1.32.0
The text was updated successfully, but these errors were encountered: