Skip to content
Snippets Groups Projects
Verified Commit 7d2e60fc authored by Sheogorath's avatar Sheogorath :european_castle:
Browse files

feat(longhorn): Add alert for share-manager problems

It seems upstream has some problems with RWX volumes in longhorn 1.5.3
where share-mananger status isn't properly assessed and as a result not
restarted when single share-managers go down. The fix for now is to
manually patch the share-manager objects with an error state, which
triggers longhorn to repair itself.

In order to notice that, this alert triggers and allows to manually fix
the problem until a new upstream release is ready, that fixes the
problem.

References:
https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359
https://github.com/longhorn/longhorn/wiki/Release-Known-Issues
parent 60e7a1d9
No related branches found
No related tags found
No related merge requests found
...@@ -103,4 +103,12 @@ spec: ...@@ -103,4 +103,12 @@ spec:
labels: labels:
issue: Longhorn node {{$labels.node}} experiences high CPU pressure. issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
severity: warning severity: warning
- alert: LonghornShareManagerOff
annotations:
description: Longhorn share manager count is off by {{$value}}. This is likely due to a recent bug in Longhorn. https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359
summary: Longhorn share manager count is off by {{$value}} for 5m.
expr: count(sum by (namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_access_mode{access_mode="ReadWriteMany"}) * on (namespace, persistentvolumeclaim) group_right kube_persistentvolumeclaim_info{storageclass=~"longhorn.*"}) - sum(kube_pod_info{namespace="longhorn-system", pod=~"share-manager-.*"}) > 0
for: 5m
labels:
issue: Longhorn share manager count is off by {{$value}} for 5m.
severity: critical
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment