From 7d2e60fc1719115762a5a47054e9eabaf4fa98de Mon Sep 17 00:00:00 2001 From: Sheogorath <sheogorath@shivering-isles.com> Date: Mon, 18 Dec 2023 03:16:50 +0100 Subject: [PATCH] feat(longhorn): Add alert for share-manager problems It seems upstream has some problems with RWX volumes in longhorn 1.5.3 where share-mananger status isn't properly assessed and as a result not restarted when single share-managers go down. The fix for now is to manually patch the share-manager objects with an error state, which triggers longhorn to repair itself. In order to notice that, this alert triggers and allows to manually fix the problem until a new upstream release is ready, that fixes the problem. References: https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359 https://github.com/longhorn/longhorn/wiki/Release-Known-Issues --- infrastructure/longhorn/monitoring.yaml | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/infrastructure/longhorn/monitoring.yaml b/infrastructure/longhorn/monitoring.yaml index bf5627dc8..96eed4097 100644 --- a/infrastructure/longhorn/monitoring.yaml +++ b/infrastructure/longhorn/monitoring.yaml @@ -103,4 +103,12 @@ spec: labels: issue: Longhorn node {{$labels.node}} experiences high CPU pressure. severity: warning - + - alert: LonghornShareManagerOff + annotations: + description: Longhorn share manager count is off by {{$value}}. This is likely due to a recent bug in Longhorn. https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359 + summary: Longhorn share manager count is off by {{$value}} for 5m. + expr: count(sum by (namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_access_mode{access_mode="ReadWriteMany"}) * on (namespace, persistentvolumeclaim) group_right kube_persistentvolumeclaim_info{storageclass=~"longhorn.*"}) - sum(kube_pod_info{namespace="longhorn-system", pod=~"share-manager-.*"}) > 0 + for: 5m + labels: + issue: Longhorn share manager count is off by {{$value}} for 5m. + severity: critical -- GitLab