diff --git a/kb/2024-01-30/the_potential_risk_with_fstrim.md b/kb/2024-01-30/the_potential_risk_with_fstrim.md index 5cbf1bfb..1150e1ab 100644 --- a/kb/2024-01-30/the_potential_risk_with_fstrim.md +++ b/kb/2024-01-30/the_potential_risk_with_fstrim.md @@ -11,42 +11,39 @@ tags: [harvester, rancher integration, longhorn, fstrim] hide_table_of_contents: false --- -The `fstrim` is the common way to release the unused space of the filesystem. However, we encounter the known issue with `fstrim` on the Longhorn volume. This article shares the potential risk with `fstrim` and how to avoid it. +Using fstrim is a common way to release unused space in a mounted filesystem. However, this utility is known to cause IO errors when used with Longhorn volumes that are rebuilding. For more information about the errors, see the following issues: -The known issue is that executing the `fstrim` on the Longhorn volume may result in IOErrors if the volume is rebuilding. Related issue: (You can find more details in the issues) - - https://github.com/harvester/harvester/issues/4739 - - https://github.com/longhorn/longhorn/issues/7103 +- Harvester: [Issue 4793](https://github.com/harvester/harvester/issues/4739) +- Longhorn: [Issue 7103](https://github.com/longhorn/longhorn/issues/7103) -## The potential risk and affection with fstrim +## Risks Associated with fstrim Usage -If you encounter the known issue on the above, that will result in the IOErrors. The IOErrors will cause the VM that uses this volume to be stuck. If the VM is critical, it will cause the application to be unavailable. For example, Harvester usually uses the Longhorn volume as the VM disk. After encountering this issue, the VM will flap in pause and running state until the volume rebuild is completed. +A consequence of the IO errors caused by fstrim is that VMs using affected Longhorn volumes become stuck. If a VM is in critical condition, running applications become unavailable. This is significant because Harvester typically uses Longhorn volumes as VM disks. The IO errors will cause VMs to flap between running and paused states until volume rebuilding is completed. -That does not affect the data integrity, but it will cause some panic issues for users. It caused the VM to hang, and the application will be unavailable. Consider the guest Kubernetes cluster scenario. When the VM is unavailable, it means the etcd service is not available. If half of the etcd service is unavailable, the Kubernetes cluster will be unavailable. Meanwhile, any services running on this Kubernetes cluster will be unavailable. +Although the described system behavior does not affect data integrity, it might induce panic in some users. Consider the guest Kubernetes cluster scenario. In a stuck VM, the etcd service is unavailable. The effects of this failure cascade from the Kubernetes cluster becoming unavailable to services running on the cluster becoming unavailable. -## How to avoid the potential risk +## Risk Mitigation -The way to avoid the potential risk is to disable the `fstrim` in VMs. The `fstrim` is enabled by default on various modern Linux distributions. -You can check the following items for the potential `fstrim`. +One way to mitigate the described risks is to disable fstrim in VMs. fstrim is enabled by default in many modern Linux distributions. +You can determine if fstrim is enabled in VMs that use affected Longhorn volumes by checking the following: -:::note -The following items are for VMs that use the Longhorn volume, so `fstrim` will cause the above issue. -::: + - `/etc/fstab`: Some root filesystems mount with the *discard* option. - - Check the `/etc/fstab`. Some root filesystem will mount with `discard` option. Like following: + Example: ``` /dev/mapper/rootvg-rootlv / xfs defaults,discard 0 0 ``` - You can remove the `discard` option to disable the `fstrim` on the root filesystem. + You can disable fstrim on the root filesystem by removing the *discard* option. ``` /dev/mapper/rootvg-rootlv / xfs defaults 0 0 <-- remove the discard option ``` - After removing the `discard` option, you can remount the root filesystem via `mount -o remount /` or just reboot the VM. + After removing the *discard* option, you can remount the root filesystem using the command `mount -o remount /` or by rebooting the VM. - - Check the service `fstrim.timer`. You can **disable** it or **edit** the service file to make the `fstrim` does not execute almost simultaneously. + - `fstrim.timer`: When this service is enabled, fstrim executes weekly by default. You can either disable the service or edit the service file to prevent simultaneous fstrim execution on VMs. - Please check the following section and modify it to distribute the `fstrim` timing. + You can modify the values in the following section to force fstrim to execute at different times. ``` [Timer] OnCalendar=weekly