Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP shows Ready: false after reboot #9991

Closed
oliverl-21 opened this issue Dec 19, 2024 · 17 comments · Fixed by #10130
Closed

CP shows Ready: false after reboot #9991

oliverl-21 opened this issue Dec 19, 2024 · 17 comments · Fixed by #10130

Comments

@oliverl-21
Copy link

Bug Report

Description

I updated my Nodes to Talos 1.9.0 and Kubernetes 1.32.0. After rebooting a CP with talosctl the task gets stuck on:

◳ watching nodes: [rpi03]
    * rpi03: stage: RUNNING ready: false unmetCond: [name:"nodeReady" reason:"node \"rpi03\" status is not available yet"]

talosctl dashboard rpi03 also shows Ready: false while everything is working fine.
a kubectl get nodes shows the nodes as Status: Ready

Logs

k describe node rpi03

Events:
  Type     Reason                   Age                    From             Message
  ----     ------                   ----                   ----             -------
  Normal   Starting                 6m9s                   kube-proxy
  Normal   RegisteredNode           45m                    node-controller  Node rpi03 event: Registered Node rpi03 in Controller
  Normal   RegisteredNode           21m                    node-controller  Node rpi03 event: Registered Node rpi03 in Controller
  Normal   RegisteredNode           21m                    node-controller  Node rpi03 event: Registered Node rpi03 in Controller
  Normal   Shutdown                 9m7s                   kubelet          Shutdown manager detected shutdown event
  Normal   NodeNotReady             9m7s                   kubelet          Node rpi03 status is now: NodeNotReady
  Warning  InvalidDiskCapacity      6m24s                  kubelet          invalid capacity 0 on image filesystem
  Normal   NodeAllocatableEnforced  6m24s                  kubelet          Updated Node Allocatable limit across pods
  Normal   Starting                 6m24s                  kubelet          Starting kubelet.
  Warning  Rebooted                 6m23s (x3 over 6m24s)  kubelet          Node rpi03 has been rebooted, boot id: 295c8af1-a7eb-4b9e-b308-d5dab0fbe475
  Normal   NodeHasSufficientMemory  6m23s (x4 over 6m24s)  kubelet          Node rpi03 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6m23s (x4 over 6m24s)  kubelet          Node rpi03 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     6m23s (x4 over 6m24s)  kubelet          Node rpi03 status is now: NodeHasSufficientPID
  Normal   NodeNotReady             6m23s (x3 over 6m24s)  kubelet          Node rpi03 status is now: NodeNotReady
  Normal   NodeReady                6m23s                  kubelet          Node rpi03 status is now: NodeReady

Environment

  • Talos version:
Client:
        Tag:         v1.9.0
        SHA:         3cb25ceb
        Built:
        Go version:  go1.23.4
        OS/Arch:     darwin/arm64
Server:
        NODE:        rpi04
        Tag:         v1.9.0
        SHA:         3cb25ceb
        Built:
        Go version:  go1.23.4
        OS/Arch:     linux/arm64
        Enabled:     RBAC
  • Kubernetes version:
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
  • Platform: RasPi 4
@smira
Copy link
Member

smira commented Dec 19, 2024

This not directly a Talos issue on its own, you need to look into the Node status to understand why it's not ready. This decision is made by the kubelet.

kubectl describe node gives you a detailed breakdown of all conditions.

See also #9984.

@oliverl-21
Copy link
Author

The Conditions look fine for me.

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 19 Dec 2024 09:27:02 +0100   Thu, 19 Dec 2024 09:27:02 +0100   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Thu, 19 Dec 2024 10:28:57 +0100   Thu, 19 Dec 2024 09:26:49 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 19 Dec 2024 10:28:57 +0100   Thu, 19 Dec 2024 09:26:49 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 19 Dec 2024 10:28:57 +0100   Thu, 19 Dec 2024 09:26:49 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 19 Dec 2024 10:28:57 +0100   Thu, 19 Dec 2024 09:26:50 +0100   KubeletReady                 kubelet is posting ready status

@smira
Copy link
Member

smira commented Dec 19, 2024

If your node is ready, there's no problem?

If there's a problem, please grab a talosctl support bundle and attach it to this ticket.

@oliverl-21
Copy link
Author

i guess its more of a cosmetic thing when Ready: false is shown in the dashboard

grafik

@smira
Copy link
Member

smira commented Dec 19, 2024

Please see what I posted above, if you need help, add a support bundle.

@oliverl-21
Copy link
Author

it is somehow reporting as true now. Sorry for wasting your time

@spagno
Copy link

spagno commented Dec 21, 2024

I have the same problem (just different architecture, I'm using x86-64), In attach the support.zip from one node
support.zip

@smira
Copy link
Member

smira commented Dec 23, 2024

I have the same problem (just different architecture, I'm using x86-64), In attach the support.zip from one node support.zip

I don't see why it doesn't report Node status, even though everything seems to be fine.

@spagno
Copy link

spagno commented Dec 26, 2024

I have the same problem (just different architecture, I'm using x86-64), In attach the support.zip from one node support.zip

I don't see why it doesn't report Node status, even though everything seems to be fine.

well, the VIP disappeared because in talos 1.8.4 the predictable interface name was enxMACADDRESS, in 1.9 the predictable name is "ensX"

but even now that the VIP is working, READY is still false with this config:

  discovery:
    enabled: true
    registries:
      kubernetes:
        disabled: true
      service:
        disabled: false

@smira
Copy link
Member

smira commented Dec 27, 2024

but even now that the VIP is working, READY is still false with this config:

talosctl -n <NODE> get machinestatus -o yaml to see details, and then go from it to get the root cause

@spagno
Copy link

spagno commented Dec 27, 2024

node: 192.168.68.11
metadata:
    namespace: runtime
    type: MachineStatuses.runtime.talos.dev
    id: machine
    version: 18
    owner: runtime.MachineStatusController
    phase: running
    created: 2024-12-27T00:08:21Z
    updated: 2024-12-27T00:08:29Z
spec:
    stage: running
    status:
        ready: false
        unmetConditions:
            - name: nodeReady
              reason: node "k8s-control-1" status is not available yet

really confused. Which is the condition to get the readiness?

@smira
Copy link
Member

smira commented Dec 27, 2024

I'm not sure why, but Talos can't pull the status of the node. That's what I posted above.

@spagno
Copy link

spagno commented Dec 27, 2024

seems so. I also reinstalled the cluster from scratch (directly version 1.9.1) and no idea how to troubleshoot it. I mean, it's not a big issue because everything works but, you know, I'm curious to understand

@azkbn
Copy link

azkbn commented Dec 29, 2024

Same story on my side. Due to some reason status of my control plane is not available as well

metadata:
    namespace: runtime
    type: MachineStatuses.runtime.talos.dev
    id: machine
    version: 17
    owner: runtime.MachineStatusController
    phase: running
    created: 2024-12-29T13:29:09Z
    updated: 2024-12-29T13:30:42Z
spec:
    stage: running
    status:
        ready: false
        unmetConditions:
            - name: nodeReady
              reason: node "cp01" status is not available yet

but cluster looks healthy:

NAME   STATUS   ROLES           AGE   VERSION
cp01   Ready    control-plane   18m   v1.32.0
w01    Ready    <none>          17m   v1.32.0

I'm using talos v1.9.0 + siderolabs/talos terraform provider 0.7.0

@azkbn
Copy link

azkbn commented Dec 29, 2024

Got some updates. Very weird case. I'm running talos in Proxmox. The issue started to happen when I decided to use different names for VM and hostname of nodes. Before they were always matching. But once I switched to more short names cp01 and w01 for k8s nodes (talos hostname), I faced with that issue. After rolling back those changes I finally got my nodes in Ready status on Talos dashboard. I guess it's very specific to my infra setup, but maybe will be helpful to someone else

@spagno
Copy link

spagno commented Dec 30, 2024

Could be anything related to dns? is there any readiness check which uses PTR or A record to resolve controlplane's ip/hostname?

@RealKelsar
Copy link

I got the same Issue, but i allready have pretty short names example tp-n1
Cluster works fine so far, but talosctl commands for reboot and update won't ever finish without error, because they wait for ready.
support.zip
DNS resolves multiple variants of the names like: tp-n1 or tp-n1.turing

smira added a commit to smira/talos that referenced this issue Jan 14, 2025
Also use a constant everywhere in informers.

Add some debug logs.

Might fix siderolabs#9991

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this issue Jan 16, 2025
Also use a constant everywhere in informers.

Add some debug logs.

Might fix siderolabs#9991

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit da2e811)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants