CephOSDDiskNotResponding
2. Overview
Disk device {{ $labels.device }} not responding on host {{ $labels.host }}.
Detailed Description: A disk device is not responding. Check alert details for disk and host. Replace disk if necessary.
3. Prerequsities
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
List clusters you have permission to access:
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP
Grab your credentials and place in kubeconfig:
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
| jq -r .kubeconfig
3.2. Check Alerts
Get the route to this cluster’s alertmanager:
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
WIP: Separate prometheus stack. Verify route.
Check all alerts
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. (Optional) Document OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Step 2: From the rsh command prompt, run the following and capture the output.
ceph status
ceph osd status
exit