CephOSDNearFull

1. Check Description

Severity: Warning

Potential Customer Impact: High

2. Overview

Utilization of back-end storage device OSD has crossed 75% on host <hostname>. Free up some space or expand the storage cluster or contact support.

Detailed Description: One of the OSD storage devices is nearing full.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:
ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP

Grab your credentials and place in kubeconfig:
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
  | jq -r .kubeconfig

3.2. Check Alerts

Get the route to this cluster’s alertmanager:
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

WIP: Separate prometheus stack. Verify route.

Check all alerts
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. (Optional) Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Step 2: From the rsh command prompt, run the following and capture the output.
ceph status
ceph osd status
exit

4. Alert

4.1. Notify Customer: Make changes to solve the alert

4.1.1. Delete Data

OCS CEPH CLUSTER IS NOT IN READONLY MODE. The following instructions only apply to OCS clusters that are near or full but NOT in readonly mode. Readonly mode would prevent any changes including deleting data (i.e. PVC/PV deletions)

The customer may delete data and the cluster will resolve the alert through self healing processes.

4.1.2. Current size < 1 TB, Expand to 4 TB

WIP: Assess ability to expand. Note, if OCS nodes are not dedicated (regular worker nodes) it is not possible to guarantee resources between the time SRE assesses and customer responds.

The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes.

4.1.3. Current size = 4TB

Please contact Dedicated Support.

(Optional log gathering)

Document Ceph Cluster health check:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

  • Issue encountered while following the SOP

Any issues while following the SOP should be documented here.