CephOSDCriticallyFull
2. Overview
Utilization of back-end storage device (OSD) has crossed 85%. Immediately free up some space or expand the storage cluster or contact support.
Detailed Description: One of the OSD storage devices is critically full. Expand the cluster immediately.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
| jq -r .kubeconfig
3.2. Check Alerts
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
WIP: Separate prometheus stack. Verify route.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. (Optional) Document OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
4. Alert
4.1. Notify Customer: Make changes to solve the alert
4.1.1. Delete Data
OCS CEPH CLUSTER IS NOT IN READONLY MODE. The following instructions only apply to OCS clusters that are near or full but NOT in readonly mode. Readonly mode would prevent any changes including deleting data (i.e. PVC/PV deletions) |
The customer may delete data and the cluster will resolve the alert through self healing processes.
4.1.2. Current size < 1 TB, Expand to 4 TB
WIP: Assess ability to expand. Note, if OCS nodes are not dedicated (regular worker nodes) it is not possible to guarantee resources between the time SRE assesses and customer responds.
The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes.