Add KB about how to shutdown a Harvester cluster

Signed-off-by: Jian Wang <jian.wang@suse.com>
harvester · Jul 22, 2024 · ad434a6 · ad434a6
1 parent 2ca5001
commit ad434a6
Show file tree

Hide file tree

Showing 3 changed files with 311 additions and 0 deletions.
diff --git a/kb/2024-07-22/harvester_cluster_shutdown_and_restart.md b/kb/2024-07-22/harvester_cluster_shutdown_and_restart.md
@@ -0,0 +1,311 @@
+---
+title: Shutdown and Restart a Harvester Cluster
+description: Details steps about how to shutdown a Harvester cluster and restart it.
+slug: shutdown_and_restart_a_harvester_cluster
+authors:
+  - name: Jian Wang
+    title: Staff Software Engineer
+    url: https://github.com/w13915984028
+    image_url: https://github.com/w13915984028.png
+tags: [harvester, cluster, shutdown, rancher]
+hide_table_of_contents: false
+---
+
+Scenarios:
+
+1. The Harvester cluster is installed with 3+ nodes.
+
+1. The Rancher server is deoployed indepedently.
+
+1. The Harvester cluster is imported to Rancher server and worked as a node dirver
+
+1. Rancher deploys a couple of downstream k8s clusters, the machines/nodes of those clusters are backed by Harvester VMs.
+
+1. There are also some traditional VMs created on Harvester cluster, they have no direct connection with Rancher server.
+
+You have a plan to move those Harvester nodes geographically, it is essentail to shutdown the Harvester cluster and restart later.
+
+:::note
+
+2 3 4 are optional if your Harvester cluster is mainly running as IaaS. This instruction covers all the above scenarios
+
+:::
+
+## 1. Precondition
+
+### Generate a support-bundle file
+
+Follow [this instruction]https://docs.harvesterhci.io/v1.3/troubleshooting/harvester#generate-a-support-bundle to generate a support-bundle file.
+
+### Network Stability
+
+Harvester cluster is built on top of Kubernetes, a general requirement is that the Node/Host IP and the cluster VIP should keep stable in the whole lifecycle, if IP changes the cluster will fail to work as expected.
+
+If your VMs on Harvester are used as Rancher downstream cluster machines/nodes, and their IPs are allocated from DHCP server, also make sure those VMs will still get the same IPs after the Harvester cluster is rebooted and VMs are restarted.
+
+## 2. Backup
+
+### Backup all VMs if possible
+
+It is always a good practice to backup things before a whole cluster shutdown.
+
+### Backup downstream guest clusters if possible
+
+Harvester doesn't touch the downstream clusters' workload, you need to consider the related backup.
+
+## 3. Shutdown Workloads
+
+### 3.1 Shutdown Traditional VMs
+
+1. Shutdown VM from the VM shell (e.g. Linux `shutdown` command), the OS itself will save data to disks.
+
+2. Check the VM status from [Harvester UI - VM page](https://docs.harvesterhci.io/v1.3/troubleshooting/vm#vm-general-operations), when it is not `Off`, then click `Stop` command.
+
+### 3.2 Shutdown Rancher downstream cluster machines(VMs)
+
+Your Harvester cluster was [imported to Rancher](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management) as a [node driver](https://docs.harvesterhci.io/v1.3/rancher/rancher-integration#creating-kubernetes-clusters-using-the-harvester-node-driver) before.
+
+When Rancher deploys a downstream cluster on node dirver Harvester, it creates a couple of VMs on Harvester automatically. Directly stopping those VMs on Harvester is not a good practice when Rancher is still managing the downstream cluster. For example, Rancher will create new VMs if you stop them from Harvester.
+
+If you have got a solution to **shutdown** those downstream clusters, then check those VMs are `Off`; or there is no downstream clusters, then jump to next step [disable all addons](#32-disable-all-addons).
+
+Unless you have already deleted all the downstream clusters which are deploy on this Harvester, **DO NOT** [remove this imported Harvester from rancher](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management#delete-imported-harvester-cluster). Harvester will get a different driver-id when it is imported later, but those aforementioned downstream clusters are connected to driver-id.
+
+To safely shutdown those VMs but still keep Rancher downstream cluster `alive`, please follow the steps below:
+
+#### Disconnect Harvester from Rancher server
+
+All following CLI commands are executed upon Harvester.
+
+1. Set the `management.cattle.io/scale-available` of `deployment rancher` to be `""` instead of `"3"` or ther values.
+
+This change will stop the auto-scaling.
+
+```
+harvester$ kubectl edit deployment -n cattle-system rancher
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  annotations:
+...
+    management.cattle.io/scale-available: "3"  // record this value, and change it to ""
+...
+  generation: 16
+  labels:
+    app: rancher
+    app.kubernetes.io/managed-by: Helm
+...
+  name: rancher
+  namespace: cattle-system
+```
+
+2. Scale down the Rancher deployment.
+
+```
+harvester$ kubectl scale deployment -n cattle-system rancher --replicas=0 
+deployment.apps/rancher scaled
+
+
+harvester$ get deployment -n cattle-system rancher
+NAME      READY   UP-TO-DATE   AVAILABLE   AGE
+rancher   0/0     0            0           33d
+harv41:/home/rancher # 
+```
+
+3. Make sure the rancher pods are gone.
+
+Check the `rancher-*` pods on `cattle-system` are gone, if any of them is stucking at `Terminating`, use `kubectl delete pod -n cattle-system rancher-pod-name --force` to delete it.
+
+```
+harvester$ kubectl get pods -n cattle-system
+NAME                                         READY   STATUS        RESTARTS       AGE
+..
+rancher-856f674f7d-5dqb6                     0/1     Terminating   0              3d22h
+rancher-856f674f7d-h4vsw                     1/1     Running       23 (68m ago)   33d
+rancher-856f674f7d-m6s4r                     0/1     Pending       0              3d19h
+...
+```
+
+Please note:
+
+1. From now on, this Harvester is `Unavailable` on Rancher server.
+
+![Unavailable](./imgs/harvester_unavailable_on_rancher.png)
+
+2. The Harvester Webui returns `503 Service Temporarily Unavailable`, all following options can be done via `kubectl`.
+
+![503 Service Temporarily Unavailable](./imgs/harvester_503_error.png)
+
+#### Shutdown Rancher downstream cluster machines(VMs)
+
+1. Shutdown VM from the VM shell (e.g. Linux `shutdown` command).
+
+2. Check the `vmi` instances, if any is still `Running`, stop it.
+
+```
+harvester$ kubectl get vmi
+NAMESPACE   NAME   AGE    PHASE     IP            NODENAME   READY
+default     vm1    5m6s   Running   10.52.0.214   harv41     True
+
+
+harvester$ virtctl stop vm1 --namespace default
+VM vm1 was scheduled to stop
+
+harvester$ kubectl get vmi -A
+NAMESPACE   NAME   AGE    PHASE     IP            NODENAME   READY
+default     vm1    5m6s   Running   10.52.0.214   harv41     False
+
+
+harvester$ kubectl get vmi -A
+No resources found
+
+harvester$ kubectl get vm -A
+NAMESPACE   NAME   AGE   STATUS    READY
+default     vm1    7d    Stopped   False
+```
+
+### 3.3 Disable all addons
+
+From Harvester UI [addon page](https://docs.harvesterhci.io/v1.3/advanced/addons), write down those none-Disabled addons, click `Disable` to disable them, wait until the state becomes `Disabled`
+
+From CLI:
+
+```
+$ kubectl get addons.harvesterhci.io -A
+
+NAMESPACE                  NAME                    HELMREPO                                                 CHARTNAME                         ENABLED
+cattle-logging-system      rancher-logging         http://harvester-cluster-repo.cattle-system.svc/charts   rancher-logging                   false
+cattle-monitoring-system   rancher-monitoring      http://harvester-cluster-repo.cattle-system.svc/charts   rancher-monitoring                true
+harvester-system           harvester-seeder        http://harvester-cluster-repo.cattle-system.svc/charts   harvester-seeder                  false
+harvester-system           nvidia-driver-toolkit   http://harvester-cluster-repo.cattle-system.svc/charts   nvidia-driver-runtime             false
+harvester-system           pcidevices-controller   http://harvester-cluster-repo.cattle-system.svc/charts   harvester-pcidevices-controller   false
+harvester-system           vm-import-controller    http://harvester-cluster-repo.cattle-system.svc/charts   harvester-vm-import-controller    false
+
+Example: disable rancher-monitoring
+
+$ kubectl edit addons.harvesterhci.io -n cattle-monitoring-system rancher-monitoring
+
+...
+spec:
+  chart: rancher-monitoring
+  enabled: false               // set this field to be false
+...
+
+```
+
+### 3.4  (Optional) Disable other workloads
+
+If you have deployed some customized workloads on Harvester cluster directly, it is better to disable/remove them.
+
+## 4. Shutdown nodes
+
+Get all nodes from Harvester WebUI [Host Management](https://docs.harvesterhci.io/v1.3/host/).
+
+From CLI:
+
+```
+harvester$ kubectl get nodes -A
+NAME     STATUS     ROLES                       AGE    VERSION
+harv2    NotReady   <none>                      4d6h   v1.27.10+rke2r1
+harv41   Ready      control-plane,etcd,master   34d    v1.27.10+rke2r1
+```
+
+### 4.1 Shutdown worker nodes, witness node
+
+1. Ssh to the Harvester worker nodes, witness node
+
+2. Run command `sudo -i shutdown`
+
+```
+$ sudo -i shutdown
+
+Shutdown scheduled for Mon 2024-07-22 06:58:56 UTC, use 'shutdown -c' to cancel.
+```
+
+3. Wait until all those nodes are downs
+
+### 4.2 Shutdown control-plane nodes
+
+
+### 4.2 Shutdown the last control-plane node
+
+## 5. Restart
+
+### 5.1 Restart the control-plane nodes
+
+
+#### Restart the leader control-plane node
+
+Wait unitl the Harvester UI is accessable and the [leader node on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) is `Active`.
+
+#### Restart the rest of control-plane nodes
+
+Wait until all [control-plane nodes on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) are `Active`.
+
+#### Check the VIP
+
+The following `EXTERNAL-IP` should be same are the VIP of Harvester cluster.
+
+```
+harvester$ kubectl get service -n kube-system ingress-expose
+NAME             TYPE           CLUSTER-IP     EXTERNAL-IP       PORT(S)                      AGE
+ingress-expose   LoadBalancer   10.53.50.107   192.168.122.144   443:32701/TCP,80:31480/TCP   34d
+```
+
+### 5.2 Restart the worker nodes and witness node
+
+Wait until all [nodes on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) are `Active`.
+
+### 5.3 Enable addons
+
+Wait until they are `DepoloySuccessful`.
+
+### 5.4 Start VMs
+
+Wait until they are `Running`.
+
+### 5.5 Restroe the Connection to Rancher server
+
+Run following 1, 2 commands on Harvester cluster.
+
+1. Set the `management.cattle.io/scale-available` of `deployment rancher` to be the value record above.
+
+This change will enable the auto-scaling.
+
+```
+harvester$ kubectl edit deployment -n cattle-system rancher
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  annotations:
+...
+    management.cattle.io/scale-available: "3"  // record in above steps
+...
+  generation: 16
+  labels:
+    app: rancher
+    app.kubernetes.io/managed-by: Helm
+...
+  name: rancher
+  namespace: cattle-system
+```
+
+2. Scale up the Rancher deployment.
+
+```
+harvester$ kubectl scale deployment -n cattle-system rancher --replicas=3
+deployment.apps/rancher scaled
+
+harvester$ get deployment -n cattle-system rancher
+NAME      READY   UP-TO-DATE   AVAILABLE   AGE
+rancher   0/0     0            0           33d
+
+```
+
+3. Check the imported Harvester cluster on Rancher server
+
+The Harvester cluster continues to be [active on Rancher Vitualization Management](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management) .
+
+4. Check the Harvester cluster WebUI
+
+You should be albe to access the Harvester WebUI.
diff --git a/kb/2024-07-22/imgs/harvester_503_error.png b/kb/2024-07-22/imgs/harvester_503_error.png
diff --git a/kb/2024-07-22/imgs/harvester_unavailable_on_rancher.png b/kb/2024-07-22/imgs/harvester_unavailable_on_rancher.png