From 7d2e60fc1719115762a5a47054e9eabaf4fa98de Mon Sep 17 00:00:00 2001
From: Sheogorath <sheogorath@shivering-isles.com>
Date: Mon, 18 Dec 2023 03:16:50 +0100
Subject: [PATCH] feat(longhorn): Add alert for share-manager problems

It seems upstream has some problems with RWX volumes in longhorn 1.5.3
where share-mananger status isn't properly assessed and as a result not
restarted when single share-managers go down. The fix for now is to
manually patch the share-manager objects with an error state, which
triggers longhorn to repair itself.

In order to notice that, this alert triggers and allows to manually fix
the problem until a new upstream release is ready, that fixes the
problem.

References:
https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359
https://github.com/longhorn/longhorn/wiki/Release-Known-Issues
---
 infrastructure/longhorn/monitoring.yaml | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/infrastructure/longhorn/monitoring.yaml b/infrastructure/longhorn/monitoring.yaml
index bf5627dc8..96eed4097 100644
--- a/infrastructure/longhorn/monitoring.yaml
+++ b/infrastructure/longhorn/monitoring.yaml
@@ -103,4 +103,12 @@ spec:
       labels:
         issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
         severity: warning
-
+    - alert: LonghornShareManagerOff
+      annotations:
+        description: Longhorn share manager count is off by {{$value}}. This is likely due to a recent bug in Longhorn. https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359
+        summary: Longhorn share manager count is off by {{$value}} for 5m.
+      expr: count(sum by (namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_access_mode{access_mode="ReadWriteMany"}) * on (namespace, persistentvolumeclaim) group_right kube_persistentvolumeclaim_info{storageclass=~"longhorn.*"}) - sum(kube_pod_info{namespace="longhorn-system", pod=~"share-manager-.*"}) > 0
+      for: 5m
+      labels:
+        issue: Longhorn share manager count is off by {{$value}} for 5m.
+        severity: critical
-- 
GitLab