CephOSDVersionMisMatch
2. Overview
There are {{ $value }} different versions of Ceph OSD components running.
Detailed Description: Typically this alert triggers during an upgrade that is taking a long time.
3. Prerequsities
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please contact: WIP
ocm get /api/clusters_mgmt/v1/clusters/<cluster id>/credentials \
| jq -r .kubeconfig
3.2. Check Alerts
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
WIP: Separate prometheus stack. Verify route.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. (Optional) Document OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox. .Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
4. Alert
4.1. Make changes to solve alert
Check if an operator upgrade is in progress.
Checking on the OCS operator status involves checking the operator subscription status and the operator pod health.
4.1.1. OCS Operator Subscription Health
oc get sub ocs-operator -n openshift-storage -o json | jq .status.conditions
Like all operators, the status conditions types are:
CatalogSourcesUnhealthy, InstallPlanMissing, InstallPlanPending, InstallPlanFailed
The status for each type should be False. For example:
[
{
"lastTransitionTime": "2021-01-26T19:21:37Z",
"message": "all available catalogsources are healthy",
"reason": "AllCatalogSourcesHealthy",
"status": "False",
"type": "CatalogSourcesUnhealthy"
}
]
The output above shows a false status for type CatalogSourcesUnHealthly, meaning the catalog sources are healthy.
4.1.2. OCS Operator Pod Health
Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.
WIP: Find specific status for upgrade (pending?)
oc get pod -n openshift-storage | grep ocs-operator OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) echo $OCSOP oc get pod/${OCSOP} -n openshift-storage oc describe pod/${OCSOP} -n openshift-storage
If you determine the OCS operator is in progress, please be patient, wait 5 minutes and this alert should resolve itself.
If you have waited or see a different error status condition, please continue troubleshooting.
(Optional log gathering)
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6