CephClusterWarningState
2. Overview
Storage cluster is in warning state for more than 10m.
The rook-ceph-mgr job (Prometheus) has been in a warning state for an unacceptable amount of time. Check for other alerts that would have triggered prior to this one and troubleshoot those alerts first.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
| jq -r .kubeconfig
3.2. Check Alerts
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
WIP: Separate prometheus stack. Verify route.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. (Optional) Document OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
4. Alert
4.1. Make changes to solve alert
General troubleshooting will be required in order to determine the cause of this alert. This alert will trigger along with other (usually many other) alerts. Please view and troubleshoot the other alerts first.
oc describe pod/${MYPOD}
oc logs pod/${MYPOD} oc describe pod/${MYPOD}
If you are at this step, then the pod is ok. Proceed to check the service.
oc get pod | grep rook-ceph-mgr # Examine the output for a rook-ceph-mgr that is in the pending state, not running or not ready MYPOD=<pod identified as the problem pod> oc describe pod/${MYPOD}
Look for resource limitations or pending pvcs. Otherwise, check for node assignment.
oc get pod/${MYPOD} -o wide
If a node was assigned, check kubelet on the node.
oc project openshift-storage oc get pod | grep rook-ceph-mgr
(Optional log gathering)
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6