-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add KB about how to shutdown a Harvester cluster
Signed-off-by: Jian Wang <jian.wang@suse.com>
- Loading branch information
1 parent
2ca5001
commit ad434a6
Showing
3 changed files
with
311 additions
and
0 deletions.
There are no files selected for viewing
311 changes: 311 additions & 0 deletions
311
kb/2024-07-22/harvester_cluster_shutdown_and_restart.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,311 @@ | ||
--- | ||
title: Shutdown and Restart a Harvester Cluster | ||
description: Details steps about how to shutdown a Harvester cluster and restart it. | ||
slug: shutdown_and_restart_a_harvester_cluster | ||
authors: | ||
- name: Jian Wang | ||
title: Staff Software Engineer | ||
url: https://github.com/w13915984028 | ||
image_url: https://github.com/w13915984028.png | ||
tags: [harvester, cluster, shutdown, rancher] | ||
hide_table_of_contents: false | ||
--- | ||
|
||
Scenarios: | ||
|
||
1. The Harvester cluster is installed with 3+ nodes. | ||
|
||
1. The Rancher server is deoployed indepedently. | ||
|
||
1. The Harvester cluster is imported to Rancher server and worked as a node dirver | ||
|
||
1. Rancher deploys a couple of downstream k8s clusters, the machines/nodes of those clusters are backed by Harvester VMs. | ||
|
||
1. There are also some traditional VMs created on Harvester cluster, they have no direct connection with Rancher server. | ||
|
||
You have a plan to move those Harvester nodes geographically, it is essentail to shutdown the Harvester cluster and restart later. | ||
|
||
:::note | ||
|
||
2 3 4 are optional if your Harvester cluster is mainly running as IaaS. This instruction covers all the above scenarios | ||
|
||
::: | ||
|
||
## 1. Precondition | ||
|
||
### Generate a support-bundle file | ||
|
||
Follow [this instruction]https://docs.harvesterhci.io/v1.3/troubleshooting/harvester#generate-a-support-bundle to generate a support-bundle file. | ||
|
||
### Network Stability | ||
|
||
Harvester cluster is built on top of Kubernetes, a general requirement is that the Node/Host IP and the cluster VIP should keep stable in the whole lifecycle, if IP changes the cluster will fail to work as expected. | ||
|
||
If your VMs on Harvester are used as Rancher downstream cluster machines/nodes, and their IPs are allocated from DHCP server, also make sure those VMs will still get the same IPs after the Harvester cluster is rebooted and VMs are restarted. | ||
|
||
## 2. Backup | ||
|
||
### Backup all VMs if possible | ||
|
||
It is always a good practice to backup things before a whole cluster shutdown. | ||
|
||
### Backup downstream guest clusters if possible | ||
|
||
Harvester doesn't touch the downstream clusters' workload, you need to consider the related backup. | ||
|
||
## 3. Shutdown Workloads | ||
|
||
### 3.1 Shutdown Traditional VMs | ||
|
||
1. Shutdown VM from the VM shell (e.g. Linux `shutdown` command), the OS itself will save data to disks. | ||
|
||
2. Check the VM status from [Harvester UI - VM page](https://docs.harvesterhci.io/v1.3/troubleshooting/vm#vm-general-operations), when it is not `Off`, then click `Stop` command. | ||
|
||
### 3.2 Shutdown Rancher downstream cluster machines(VMs) | ||
|
||
Your Harvester cluster was [imported to Rancher](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management) as a [node driver](https://docs.harvesterhci.io/v1.3/rancher/rancher-integration#creating-kubernetes-clusters-using-the-harvester-node-driver) before. | ||
|
||
When Rancher deploys a downstream cluster on node dirver Harvester, it creates a couple of VMs on Harvester automatically. Directly stopping those VMs on Harvester is not a good practice when Rancher is still managing the downstream cluster. For example, Rancher will create new VMs if you stop them from Harvester. | ||
|
||
If you have got a solution to **shutdown** those downstream clusters, then check those VMs are `Off`; or there is no downstream clusters, then jump to next step [disable all addons](#32-disable-all-addons). | ||
|
||
Unless you have already deleted all the downstream clusters which are deploy on this Harvester, **DO NOT** [remove this imported Harvester from rancher](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management#delete-imported-harvester-cluster). Harvester will get a different driver-id when it is imported later, but those aforementioned downstream clusters are connected to driver-id. | ||
|
||
To safely shutdown those VMs but still keep Rancher downstream cluster `alive`, please follow the steps below: | ||
|
||
#### Disconnect Harvester from Rancher server | ||
|
||
All following CLI commands are executed upon Harvester. | ||
|
||
1. Set the `management.cattle.io/scale-available` of `deployment rancher` to be `""` instead of `"3"` or ther values. | ||
|
||
This change will stop the auto-scaling. | ||
|
||
``` | ||
harvester$ kubectl edit deployment -n cattle-system rancher | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
annotations: | ||
... | ||
management.cattle.io/scale-available: "3" // record this value, and change it to "" | ||
... | ||
generation: 16 | ||
labels: | ||
app: rancher | ||
app.kubernetes.io/managed-by: Helm | ||
... | ||
name: rancher | ||
namespace: cattle-system | ||
``` | ||
|
||
2. Scale down the Rancher deployment. | ||
|
||
``` | ||
harvester$ kubectl scale deployment -n cattle-system rancher --replicas=0 | ||
deployment.apps/rancher scaled | ||
harvester$ get deployment -n cattle-system rancher | ||
NAME READY UP-TO-DATE AVAILABLE AGE | ||
rancher 0/0 0 0 33d | ||
harv41:/home/rancher # | ||
``` | ||
|
||
3. Make sure the rancher pods are gone. | ||
|
||
Check the `rancher-*` pods on `cattle-system` are gone, if any of them is stucking at `Terminating`, use `kubectl delete pod -n cattle-system rancher-pod-name --force` to delete it. | ||
|
||
``` | ||
harvester$ kubectl get pods -n cattle-system | ||
NAME READY STATUS RESTARTS AGE | ||
.. | ||
rancher-856f674f7d-5dqb6 0/1 Terminating 0 3d22h | ||
rancher-856f674f7d-h4vsw 1/1 Running 23 (68m ago) 33d | ||
rancher-856f674f7d-m6s4r 0/1 Pending 0 3d19h | ||
... | ||
``` | ||
|
||
Please note: | ||
|
||
1. From now on, this Harvester is `Unavailable` on Rancher server. | ||
|
||
![Unavailable](./imgs/harvester_unavailable_on_rancher.png) | ||
|
||
2. The Harvester Webui returns `503 Service Temporarily Unavailable`, all following options can be done via `kubectl`. | ||
|
||
![503 Service Temporarily Unavailable](./imgs/harvester_503_error.png) | ||
|
||
#### Shutdown Rancher downstream cluster machines(VMs) | ||
|
||
1. Shutdown VM from the VM shell (e.g. Linux `shutdown` command). | ||
|
||
2. Check the `vmi` instances, if any is still `Running`, stop it. | ||
|
||
``` | ||
harvester$ kubectl get vmi | ||
NAMESPACE NAME AGE PHASE IP NODENAME READY | ||
default vm1 5m6s Running 10.52.0.214 harv41 True | ||
harvester$ virtctl stop vm1 --namespace default | ||
VM vm1 was scheduled to stop | ||
harvester$ kubectl get vmi -A | ||
NAMESPACE NAME AGE PHASE IP NODENAME READY | ||
default vm1 5m6s Running 10.52.0.214 harv41 False | ||
harvester$ kubectl get vmi -A | ||
No resources found | ||
harvester$ kubectl get vm -A | ||
NAMESPACE NAME AGE STATUS READY | ||
default vm1 7d Stopped False | ||
``` | ||
|
||
### 3.3 Disable all addons | ||
|
||
From Harvester UI [addon page](https://docs.harvesterhci.io/v1.3/advanced/addons), write down those none-Disabled addons, click `Disable` to disable them, wait until the state becomes `Disabled` | ||
|
||
From CLI: | ||
|
||
``` | ||
$ kubectl get addons.harvesterhci.io -A | ||
NAMESPACE NAME HELMREPO CHARTNAME ENABLED | ||
cattle-logging-system rancher-logging http://harvester-cluster-repo.cattle-system.svc/charts rancher-logging false | ||
cattle-monitoring-system rancher-monitoring http://harvester-cluster-repo.cattle-system.svc/charts rancher-monitoring true | ||
harvester-system harvester-seeder http://harvester-cluster-repo.cattle-system.svc/charts harvester-seeder false | ||
harvester-system nvidia-driver-toolkit http://harvester-cluster-repo.cattle-system.svc/charts nvidia-driver-runtime false | ||
harvester-system pcidevices-controller http://harvester-cluster-repo.cattle-system.svc/charts harvester-pcidevices-controller false | ||
harvester-system vm-import-controller http://harvester-cluster-repo.cattle-system.svc/charts harvester-vm-import-controller false | ||
Example: disable rancher-monitoring | ||
$ kubectl edit addons.harvesterhci.io -n cattle-monitoring-system rancher-monitoring | ||
... | ||
spec: | ||
chart: rancher-monitoring | ||
enabled: false // set this field to be false | ||
... | ||
``` | ||
|
||
### 3.4 (Optional) Disable other workloads | ||
|
||
If you have deployed some customized workloads on Harvester cluster directly, it is better to disable/remove them. | ||
|
||
## 4. Shutdown nodes | ||
|
||
Get all nodes from Harvester WebUI [Host Management](https://docs.harvesterhci.io/v1.3/host/). | ||
|
||
From CLI: | ||
|
||
``` | ||
harvester$ kubectl get nodes -A | ||
NAME STATUS ROLES AGE VERSION | ||
harv2 NotReady <none> 4d6h v1.27.10+rke2r1 | ||
harv41 Ready control-plane,etcd,master 34d v1.27.10+rke2r1 | ||
``` | ||
|
||
### 4.1 Shutdown worker nodes, witness node | ||
|
||
1. Ssh to the Harvester worker nodes, witness node | ||
|
||
2. Run command `sudo -i shutdown` | ||
|
||
``` | ||
$ sudo -i shutdown | ||
Shutdown scheduled for Mon 2024-07-22 06:58:56 UTC, use 'shutdown -c' to cancel. | ||
``` | ||
|
||
3. Wait until all those nodes are downs | ||
|
||
### 4.2 Shutdown control-plane nodes | ||
|
||
|
||
### 4.2 Shutdown the last control-plane node | ||
|
||
## 5. Restart | ||
|
||
### 5.1 Restart the control-plane nodes | ||
|
||
|
||
#### Restart the leader control-plane node | ||
|
||
Wait unitl the Harvester UI is accessable and the [leader node on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) is `Active`. | ||
|
||
#### Restart the rest of control-plane nodes | ||
|
||
Wait until all [control-plane nodes on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) are `Active`. | ||
|
||
#### Check the VIP | ||
|
||
The following `EXTERNAL-IP` should be same are the VIP of Harvester cluster. | ||
|
||
``` | ||
harvester$ kubectl get service -n kube-system ingress-expose | ||
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE | ||
ingress-expose LoadBalancer 10.53.50.107 192.168.122.144 443:32701/TCP,80:31480/TCP 34d | ||
``` | ||
|
||
### 5.2 Restart the worker nodes and witness node | ||
|
||
Wait until all [nodes on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) are `Active`. | ||
|
||
### 5.3 Enable addons | ||
|
||
Wait until they are `DepoloySuccessful`. | ||
|
||
### 5.4 Start VMs | ||
|
||
Wait until they are `Running`. | ||
|
||
### 5.5 Restroe the Connection to Rancher server | ||
|
||
Run following 1, 2 commands on Harvester cluster. | ||
|
||
1. Set the `management.cattle.io/scale-available` of `deployment rancher` to be the value record above. | ||
|
||
This change will enable the auto-scaling. | ||
|
||
``` | ||
harvester$ kubectl edit deployment -n cattle-system rancher | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
annotations: | ||
... | ||
management.cattle.io/scale-available: "3" // record in above steps | ||
... | ||
generation: 16 | ||
labels: | ||
app: rancher | ||
app.kubernetes.io/managed-by: Helm | ||
... | ||
name: rancher | ||
namespace: cattle-system | ||
``` | ||
|
||
2. Scale up the Rancher deployment. | ||
|
||
``` | ||
harvester$ kubectl scale deployment -n cattle-system rancher --replicas=3 | ||
deployment.apps/rancher scaled | ||
harvester$ get deployment -n cattle-system rancher | ||
NAME READY UP-TO-DATE AVAILABLE AGE | ||
rancher 0/0 0 0 33d | ||
``` | ||
|
||
3. Check the imported Harvester cluster on Rancher server | ||
|
||
The Harvester cluster continues to be [active on Rancher Vitualization Management](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management) . | ||
|
||
4. Check the Harvester cluster WebUI | ||
|
||
You should be albe to access the Harvester WebUI. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.