From 9b644d04ebc3946e21005386cbd8230be495de81 Mon Sep 17 00:00:00 2001 From: Vicente Cheng Date: Wed, 31 Jan 2024 14:50:08 +0800 Subject: [PATCH] KB: add the article for potential risk with fstrim - Also mentioned how to avoid this risk. Signed-off-by: Vicente Cheng Co-authored-by: Kiefer Chang --- .../the_potential_risk_with_fstrim.md | 44 +++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 kb/2024-01-30/the_potential_risk_with_fstrim.md diff --git a/kb/2024-01-30/the_potential_risk_with_fstrim.md b/kb/2024-01-30/the_potential_risk_with_fstrim.md new file mode 100644 index 00000000..c24337c3 --- /dev/null +++ b/kb/2024-01-30/the_potential_risk_with_fstrim.md @@ -0,0 +1,44 @@ +--- +title: The potential risk with fstrim +description: The potential risk with fstrim and how to avoid it +slug: the_potential_risk_with_fstrim +authors: + - name: Vicente Cheng + title: Senior Software Engineer + url: https://github.com/Vicente-Cheng + image_url: https://github.com/Vicente-Cheng.png +tags: [harvester, rancher integration, longhorn, fstrim] +hide_table_of_contents: false +--- + +The `fstrim` is the common way to release the unused space of the filesystem. However, we encounter the known issue with `fstrim` on the Longhorn volume. This article shares the potential risk with `fstrim` and how to avoid it. + +The known issue is that executing the `fstrim` on the Longhorn volume may result in IOErrors if the volume is rebuilding. Related issue: (You can find more details in the issues) + - https://github.com/harvester/harvester/issues/4739 + - https://github.com/longhorn/longhorn/issues/7103 + +## The potential risk and affection with fstrim + +If you encounter the known issue on the above, that will result in the IOErrors. The IOErrors will cause the VM that uses this volume to be stuck. If the VM is critical, it will cause the application to be unavailable. For example, Harvester usually uses the Longhorn volume as the VM disk. After encountering this issue, the VM will flap in pause and running state until the volume rebuild is completed. + +That does not affect the data integrity, but it will cause some panic issues for users. It caused the VM to hang, and the application will be unavailable. Consider the guest Kubernetes cluster scenario. When the VM is unavailable, it means the etcd service is not available. If half of the etcd service is unavailable, the Kubernetes cluster will be unavailable. Meanwhile, any services running on this Kubernetes cluster will be unavailable. + +## How to avoid the potential risk + +The way to avoid the potential risk is to disable the `fstrim` in VMs. The `fstrim` is enabled by default on various modern Linux distributions. +You can check the following items for the potential `fstrim`. + +:::note +The following items are for VMs that use the Longhorn volume, so `fstrim` will cause the above issue. +::: + + - Check the service `fstrim.timer`. You can **disable** it or **edit** the service file to make the `fstrim` does not execute almost simultaneously. + + Please check the following section and modify it to distribute the `fstrim` timing. + ``` + [Timer] + OnCalendar=weekly + AccuracySec=1h + Persistent=true + RandomizedDelaySec=6000 + ``` \ No newline at end of file