CephNodeDown

1. Check Description

Severity: Error

Potential Customer Impact: High

2. Overview

A storage node went down. Please check the node immediately. The alert should contain the node name.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:
ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP

Grab your credentials and place in kubeconfig:
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
  | jq -r .kubeconfig

3.2. Check Alerts

Get the route to this cluster’s alertmanager:
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

WIP: Separate prometheus stack. Verify route.

Check all alerts
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. (Optional) Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Step 2: From the rsh command prompt, run the following and capture the output.
ceph status
ceph osd status
exit

4. Alert

4.1. Make changes to solve alert

Document the current OCS pods (running and failing):
oc -n openshift-storage get pods

WIP: Link to node replacement SOP

The OCS resource requirements must be met in order for the osd pods to be scheduled on the new node. This may take a few minutes as the ceph cluster recovers data for the failing but now recovering osd.

To watch this recovery in action ensure the osd pods were actually placed on the new worker node.

Check if the previous failing osd pods are now running:
oc -n openshift-storage get pods

If the previously failing osd pods have not been scheduled, use describe and check events for reasons the pods were not rescheduled.

Describe events for failing osd pod:
oc -n openshift-storage get pods | grep osd

Find a failing osd pod(s):

oc -n openshift-storage describe pods/<osd podname from previous step>

In the event section look for failure reasons, such as resources not being met.

In addition, you may use the rook-ceph-toolbox to watch the recovery. This step is optional but can be helpful for large Ceph clusters.

Deploy and Access the Toolbox

If the toolbox is not deployed, please follow the directions below.

Deploy Rook-Ceph toolbox if needed:
oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
Run the following to rsh to the toolbox pod:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
From the rsh command prompt, run the following and watch for "recovery" under the io section:
ceph status

How long recovery takes will depend on the size of the Ceph cluster. WIP: Note about health status and no recovery for small quick clusters.

(Optional log gathering)

Document Ceph Cluster health check:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

  • Issue encountered while following the SOP

Any issues while following the SOP should be documented here.