Skip to content

Commit

Permalink
Add KB about how to shutdown a Harvester cluster
Browse files Browse the repository at this point in the history
Signed-off-by: Jian Wang <jian.wang@suse.com>
  • Loading branch information
w13915984028 committed Jul 22, 2024
1 parent 2ca5001 commit ad434a6
Show file tree
Hide file tree
Showing 3 changed files with 311 additions and 0 deletions.
311 changes: 311 additions & 0 deletions kb/2024-07-22/harvester_cluster_shutdown_and_restart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
---
title: Shutdown and Restart a Harvester Cluster
description: Details steps about how to shutdown a Harvester cluster and restart it.
slug: shutdown_and_restart_a_harvester_cluster
authors:
- name: Jian Wang
title: Staff Software Engineer
url: https://github.com/w13915984028
image_url: https://github.com/w13915984028.png
tags: [harvester, cluster, shutdown, rancher]
hide_table_of_contents: false
---

Scenarios:

1. The Harvester cluster is installed with 3+ nodes.

1. The Rancher server is deoployed indepedently.

1. The Harvester cluster is imported to Rancher server and worked as a node dirver

1. Rancher deploys a couple of downstream k8s clusters, the machines/nodes of those clusters are backed by Harvester VMs.

1. There are also some traditional VMs created on Harvester cluster, they have no direct connection with Rancher server.

You have a plan to move those Harvester nodes geographically, it is essentail to shutdown the Harvester cluster and restart later.

:::note

2 3 4 are optional if your Harvester cluster is mainly running as IaaS. This instruction covers all the above scenarios

:::

## 1. Precondition

### Generate a support-bundle file

Follow [this instruction]https://docs.harvesterhci.io/v1.3/troubleshooting/harvester#generate-a-support-bundle to generate a support-bundle file.

### Network Stability

Harvester cluster is built on top of Kubernetes, a general requirement is that the Node/Host IP and the cluster VIP should keep stable in the whole lifecycle, if IP changes the cluster will fail to work as expected.

If your VMs on Harvester are used as Rancher downstream cluster machines/nodes, and their IPs are allocated from DHCP server, also make sure those VMs will still get the same IPs after the Harvester cluster is rebooted and VMs are restarted.

## 2. Backup

### Backup all VMs if possible

It is always a good practice to backup things before a whole cluster shutdown.

### Backup downstream guest clusters if possible

Harvester doesn't touch the downstream clusters' workload, you need to consider the related backup.

## 3. Shutdown Workloads

### 3.1 Shutdown Traditional VMs

1. Shutdown VM from the VM shell (e.g. Linux `shutdown` command), the OS itself will save data to disks.

2. Check the VM status from [Harvester UI - VM page](https://docs.harvesterhci.io/v1.3/troubleshooting/vm#vm-general-operations), when it is not `Off`, then click `Stop` command.

### 3.2 Shutdown Rancher downstream cluster machines(VMs)

Your Harvester cluster was [imported to Rancher](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management) as a [node driver](https://docs.harvesterhci.io/v1.3/rancher/rancher-integration#creating-kubernetes-clusters-using-the-harvester-node-driver) before.

When Rancher deploys a downstream cluster on node dirver Harvester, it creates a couple of VMs on Harvester automatically. Directly stopping those VMs on Harvester is not a good practice when Rancher is still managing the downstream cluster. For example, Rancher will create new VMs if you stop them from Harvester.

If you have got a solution to **shutdown** those downstream clusters, then check those VMs are `Off`; or there is no downstream clusters, then jump to next step [disable all addons](#32-disable-all-addons).

Unless you have already deleted all the downstream clusters which are deploy on this Harvester, **DO NOT** [remove this imported Harvester from rancher](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management#delete-imported-harvester-cluster). Harvester will get a different driver-id when it is imported later, but those aforementioned downstream clusters are connected to driver-id.

To safely shutdown those VMs but still keep Rancher downstream cluster `alive`, please follow the steps below:

#### Disconnect Harvester from Rancher server

All following CLI commands are executed upon Harvester.

1. Set the `management.cattle.io/scale-available` of `deployment rancher` to be `""` instead of `"3"` or ther values.

This change will stop the auto-scaling.

```
harvester$ kubectl edit deployment -n cattle-system rancher
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
...
management.cattle.io/scale-available: "3" // record this value, and change it to ""
...
generation: 16
labels:
app: rancher
app.kubernetes.io/managed-by: Helm
...
name: rancher
namespace: cattle-system
```

2. Scale down the Rancher deployment.

```
harvester$ kubectl scale deployment -n cattle-system rancher --replicas=0
deployment.apps/rancher scaled
harvester$ get deployment -n cattle-system rancher
NAME READY UP-TO-DATE AVAILABLE AGE
rancher 0/0 0 0 33d
harv41:/home/rancher #
```

3. Make sure the rancher pods are gone.

Check the `rancher-*` pods on `cattle-system` are gone, if any of them is stucking at `Terminating`, use `kubectl delete pod -n cattle-system rancher-pod-name --force` to delete it.

```
harvester$ kubectl get pods -n cattle-system
NAME READY STATUS RESTARTS AGE
..
rancher-856f674f7d-5dqb6 0/1 Terminating 0 3d22h
rancher-856f674f7d-h4vsw 1/1 Running 23 (68m ago) 33d
rancher-856f674f7d-m6s4r 0/1 Pending 0 3d19h
...
```

Please note:

1. From now on, this Harvester is `Unavailable` on Rancher server.

![Unavailable](./imgs/harvester_unavailable_on_rancher.png)

2. The Harvester Webui returns `503 Service Temporarily Unavailable`, all following options can be done via `kubectl`.

![503 Service Temporarily Unavailable](./imgs/harvester_503_error.png)

#### Shutdown Rancher downstream cluster machines(VMs)

1. Shutdown VM from the VM shell (e.g. Linux `shutdown` command).

2. Check the `vmi` instances, if any is still `Running`, stop it.

```
harvester$ kubectl get vmi
NAMESPACE NAME AGE PHASE IP NODENAME READY
default vm1 5m6s Running 10.52.0.214 harv41 True
harvester$ virtctl stop vm1 --namespace default
VM vm1 was scheduled to stop
harvester$ kubectl get vmi -A
NAMESPACE NAME AGE PHASE IP NODENAME READY
default vm1 5m6s Running 10.52.0.214 harv41 False
harvester$ kubectl get vmi -A
No resources found
harvester$ kubectl get vm -A
NAMESPACE NAME AGE STATUS READY
default vm1 7d Stopped False
```

### 3.3 Disable all addons

From Harvester UI [addon page](https://docs.harvesterhci.io/v1.3/advanced/addons), write down those none-Disabled addons, click `Disable` to disable them, wait until the state becomes `Disabled`

From CLI:

```
$ kubectl get addons.harvesterhci.io -A
NAMESPACE NAME HELMREPO CHARTNAME ENABLED
cattle-logging-system rancher-logging http://harvester-cluster-repo.cattle-system.svc/charts rancher-logging false
cattle-monitoring-system rancher-monitoring http://harvester-cluster-repo.cattle-system.svc/charts rancher-monitoring true
harvester-system harvester-seeder http://harvester-cluster-repo.cattle-system.svc/charts harvester-seeder false
harvester-system nvidia-driver-toolkit http://harvester-cluster-repo.cattle-system.svc/charts nvidia-driver-runtime false
harvester-system pcidevices-controller http://harvester-cluster-repo.cattle-system.svc/charts harvester-pcidevices-controller false
harvester-system vm-import-controller http://harvester-cluster-repo.cattle-system.svc/charts harvester-vm-import-controller false
Example: disable rancher-monitoring
$ kubectl edit addons.harvesterhci.io -n cattle-monitoring-system rancher-monitoring
...
spec:
chart: rancher-monitoring
enabled: false // set this field to be false
...
```

### 3.4 (Optional) Disable other workloads

If you have deployed some customized workloads on Harvester cluster directly, it is better to disable/remove them.

## 4. Shutdown nodes

Get all nodes from Harvester WebUI [Host Management](https://docs.harvesterhci.io/v1.3/host/).

From CLI:

```
harvester$ kubectl get nodes -A
NAME STATUS ROLES AGE VERSION
harv2 NotReady <none> 4d6h v1.27.10+rke2r1
harv41 Ready control-plane,etcd,master 34d v1.27.10+rke2r1
```

### 4.1 Shutdown worker nodes, witness node

1. Ssh to the Harvester worker nodes, witness node

2. Run command `sudo -i shutdown`

```
$ sudo -i shutdown
Shutdown scheduled for Mon 2024-07-22 06:58:56 UTC, use 'shutdown -c' to cancel.
```

3. Wait until all those nodes are downs

### 4.2 Shutdown control-plane nodes


### 4.2 Shutdown the last control-plane node

## 5. Restart

### 5.1 Restart the control-plane nodes


#### Restart the leader control-plane node

Wait unitl the Harvester UI is accessable and the [leader node on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) is `Active`.

#### Restart the rest of control-plane nodes

Wait until all [control-plane nodes on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) are `Active`.

#### Check the VIP

The following `EXTERNAL-IP` should be same are the VIP of Harvester cluster.

```
harvester$ kubectl get service -n kube-system ingress-expose
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-expose LoadBalancer 10.53.50.107 192.168.122.144 443:32701/TCP,80:31480/TCP 34d
```

### 5.2 Restart the worker nodes and witness node

Wait until all [nodes on Harvester UI](https://docs.harvesterhci.io/v1.3/host/) are `Active`.

### 5.3 Enable addons

Wait until they are `DepoloySuccessful`.

### 5.4 Start VMs

Wait until they are `Running`.

### 5.5 Restroe the Connection to Rancher server

Run following 1, 2 commands on Harvester cluster.

1. Set the `management.cattle.io/scale-available` of `deployment rancher` to be the value record above.

This change will enable the auto-scaling.

```
harvester$ kubectl edit deployment -n cattle-system rancher
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
...
management.cattle.io/scale-available: "3" // record in above steps
...
generation: 16
labels:
app: rancher
app.kubernetes.io/managed-by: Helm
...
name: rancher
namespace: cattle-system
```

2. Scale up the Rancher deployment.

```
harvester$ kubectl scale deployment -n cattle-system rancher --replicas=3
deployment.apps/rancher scaled
harvester$ get deployment -n cattle-system rancher
NAME READY UP-TO-DATE AVAILABLE AGE
rancher 0/0 0 0 33d
```

3. Check the imported Harvester cluster on Rancher server

The Harvester cluster continues to be [active on Rancher Vitualization Management](https://docs.harvesterhci.io/v1.3/rancher/virtualization-management) .

4. Check the Harvester cluster WebUI

You should be albe to access the Harvester WebUI.
Binary file added kb/2024-07-22/imgs/harvester_503_error.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ad434a6

Please sign in to comment.