CephClusterReadOnly

Table of Contents

1. Check Description
2. Overview
3. Prerequisites
4. Alert
- 4.1. Notify Customer: Make changes to solve the alert
5. Troubleshooting

1. Check Description

Severity: Error

Potential Customer Impact: High

2. Overview

Storage cluster utilization has crossed 85%.

Detailed Description: Storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately. It is common to see alerts related to OSD devices full or near full prior to this alert.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:

ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP

Grab your credentials and place in kubeconfig:

ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
  | jq -r .kubeconfig

3.2. Check Alerts

Get the route to this cluster’s alertmanager:

MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

WIP: Separate prometheus stack. Verify route.

Check all alerts

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. (Optional) Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Step 2: From the rsh command prompt, run the following and capture the output.

ceph status
ceph osd status
exit

4. Alert

4.1. Notify Customer: Make changes to solve the alert

Determine if you can proceed with scaling *UP* or *OUT* for expansion. Choose the appropriate tab for your deployment below.

4.1.1. Delete Data

The customer may NOT delete data while in readonly mode.

If the customer wants to delete data after the cluster is in readonly mode, the resolution procedure is:

Raise the threshold for readonly
Allow the cluster to drop of out readonly
Instruct the customer to delete data
Restore the original thresholds

This is essentially the same procedure stated in Scaling up (adding new disks) OCS 4 fails due to cluster state being unhealthy because of lack of capacity.

Deploy and Access the Toolbox

If the toolbox is not deployed, please follow the directions below.

Deploy Rook-Ceph toolbox if needed:

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

Run the following to rsh to the toolbox pod:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Raise the full threshold:

ceph osd set-full-ratio 0.9

Watch the Ceph cluster come out of readonly mode:

ceph health

Notify the customer once the cluster is healthy. When the customer is done deleting data, return the threshold to the previous value.

Return threshold to previous value (.85 is default):

ceph osd set-full-ratio 0.85

Do not forget to exit

exit

4.1.2. Current size < 1 TB, Expand to 4 TB

WIP: Assess ability to expand. Note, if OCS nodes are not dedicated (regular worker nodes) it is not possible to guarantee resources between the time SRE assesses and customer responds.

The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes.

4.1.3. Current size = 4TB

Please contact Dedicated Support.

(Optional log gathering)

Document Ceph Cluster health check:

oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

Issue encountered while following the SOP

Any issues while following the SOP should be documented here.