CephOSDVersionMisMatch

1. Check Description

Severity: Warning

Potential Customer Impact: High

2. Overview

There are {{ $value }} different versions of Ceph OSD components running.

Detailed Description: Typically this alert triggers during an upgrade that is taking a long time.

3. Prerequsities

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:
ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP

Grab your credentials and place in kubeconfig:
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
  | jq -r .kubeconfig

3.2. Check Alerts

Get the route to this cluster’s alertmanager:
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

WIP: Separate prometheus stack. Verify route.

Check all alerts
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. (Optional) Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Step 2: From the rsh command prompt, run the following and capture the output.
ceph status
ceph osd status
exit

4. Alert

4.1. Make changes to solve alert

Check if an operator upgrade is in progress.

Checking on the OCS operator status involves checking the operator subscription status and the operator pod health.

4.1.1. OCS Operator Subscription Health

Check the ocs-operator subscription status
oc get sub ocs-operator -n openshift-storage  -o json | jq .status.conditions

Like all operators, the status conditions types are:

CatalogSourcesUnhealthy, InstallPlanMissing, InstallPlanPending, InstallPlanFailed

The status for each type should be False. For example:

[
  {
    "lastTransitionTime": "2021-01-26T19:21:37Z",
    "message": "all available catalogsources are healthy",
    "reason": "AllCatalogSourcesHealthy",
    "status": "False",
    "type": "CatalogSourcesUnhealthy"
  }
]

The output above shows a false status for type CatalogSourcesUnHealthly, meaning the catalog sources are healthy.

4.1.2. OCS Operator Pod Health

Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.

WIP: Find specific status for upgrade (pending?)

To find and view the status of the OCS operator:
 oc get pod -n openshift-storage | grep ocs-operator
 OCSOP=$(oc get pod -n openshift-storage  -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator)
 echo $OCSOP
 oc get pod/${OCSOP} -n openshift-storage
 oc describe pod/${OCSOP} -n openshift-storage

If you determine the OCS operator is in progress, please be patient, wait 5 minutes and this alert should resolve itself.

If you have waited or see a different error status condition, please continue troubleshooting.

(Optional log gathering)

Document Ceph Cluster health check:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

  • Issue encountered while following the SOP

Any issues while following the SOP should be documented here.