CephMdsMissingReplicas

Table of Contents

1. Check Description
2. Overview
3. Prerequisites
4. Alert
- 4.1. Make changes to solve alert
5. Troubleshooting

1. Check Description

Severity: Warning

Potential Customer Impact: High

2. Overview

Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.

Detailed Description: Minimum required replicas for the storage metadata service (MDS) are not available. MDS is responsible for file metadata. Degradation of the MDS service can affect the working of the storage cluster and should be fixed as soon as possible.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:

ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP

Grab your credentials and place in kubeconfig:

ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
  | jq -r .kubeconfig

3.2. Check Alerts

Get the route to this cluster’s alertmanager:

MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

WIP: Separate prometheus stack. Verify route.

Check all alerts

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. (Optional) Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Step 2: From the rsh command prompt, run the following and capture the output.

ceph status
ceph osd status
exit

4. Alert

4.1. Make changes to solve alert

Follow the general pod debug workflow outlined below.

pod status: NOT pending, running, but NOT ready → Check readiness probe.

oc describe pod/${MYPOD}

pod status: NOT pending, but NOT running → Check for app or image issues.

oc logs pod/${MYPOD}
oc describe pod/${MYPOD}

If you are at this step, then the pod is ok. Proceed to check the service.

pod status: NOT pending, running, ready, no access to app → Start debug workflow for service.

pod status: pending → Check for resource issues, pending pvcs, node assignment, kubelet problems.

oc get pod | grep rook-ceph-mds
# Examine the output for a rook-ceph-mds that is in the pending state, not running or not ready
MYPOD=<pod identified as the problem pod>
oc describe pod/${MYPOD}

Look for resource limitations or pending pvcs. Otherwise, check for node assignment.

oc get pod/${MYPOD} -o wide

If a node was assigned, check kubelet on the node.

Get pod status:

oc project openshift-storage
oc get pod | grep rook-ceph-mds

(Optional log gathering)

Document Ceph Cluster health check:

oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

Issue encountered while following the SOP

Any issues while following the SOP should be documented here.