feat(longhorn): Add alert for share-manager problems

It seems upstream has some problems with RWX volumes in longhorn 1.5.3 where share-mananger status isn't properly assessed and as a result not restarted when single share-managers go down. The fix for now is to manually patch the share-manager objects with an error state, which triggers longhorn to repair itself. In order to notice that, this alert triggers and allows to manually fix the problem until a new upstream release is ready, that fixes the problem. References: https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359 https://github.com/longhorn/longhorn/wiki/Release-Known-Issues

feat(longhorn): Add alert for share-manager problems
7d2e60fc · Sheogorath · 60e7a1d9 · 7d2e60fc
Verified Commit 7d2e60fc authored 1 year ago by Sheogorath
--- a/infrastructure/longhorn/monitoring.yaml
+++ b/infrastructure/longhorn/monitoring.yaml
@@ -103,4 +103,12 @@ spec:
      labels:
        issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
        severity: warning
+    - alert: LonghornShareManagerOff
+      annotations:
+        description: Longhorn share manager count is off by {{$value}}. This is likely due to a recent bug in Longhorn. https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359
+        summary: Longhorn share manager count is off by {{$value}} for 5m.
+      expr: count(sum by (namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_access_mode{access_mode="ReadWriteMany"}) * on (namespace, persistentvolumeclaim) group_right kube_persistentvolumeclaim_info{storageclass=~"longhorn.*"}) - sum(kube_pod_info{namespace="longhorn-system", pod=~"share-manager-.*"}) > 0
+      for: 5m
+      labels:
+        issue: Longhorn share manager count is off by {{$value}} for 5m.
+        severity: critical