CephClusterReadOnly
2. Overview
Storage cluster utilization has crossed 85%.
Detailed Description: Storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately. It is common to see alerts related to OSD devices full or near full prior to this alert.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
| jq -r .kubeconfig
3.2. Check Alerts
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
WIP: Separate prometheus stack. Verify route.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. (Optional) Document OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
4. Alert
4.1. Notify Customer: Make changes to solve the alert
Determine if you can proceed with scaling *UP* or *OUT* for expansion. Choose the appropriate tab for your deployment below.
4.1.1. Delete Data
The customer may NOT delete data while in readonly mode.
If the customer wants to delete data after the cluster is in readonly mode, the resolution procedure is:
-
Raise the threshold for readonly
-
Allow the cluster to drop of out readonly
-
Instruct the customer to delete data
-
Restore the original thresholds
This is essentially the same procedure stated in Scaling up (adding new disks) OCS 4 fails due to cluster state being unhealthy because of lack of capacity.
Deploy and Access the Toolbox
If the toolbox is not deployed, please follow the directions below.
oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph osd set-full-ratio 0.9
ceph health
Notify the customer once the cluster is healthy. When the customer is done deleting data, return the threshold to the previous value.
ceph osd set-full-ratio 0.85
Do not forget to exit
exit
4.1.2. Current size < 1 TB, Expand to 4 TB
WIP: Assess ability to expand. Note, if OCS nodes are not dedicated (regular worker nodes) it is not possible to guarantee resources between the time SRE assesses and customer responds.
The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes.