In this guide, we explain how to get started troubleshooting issues with your Private Cloud’s Ceph cluster. A goal of this guide is to collect common troubleshooting scenarios and outline a method of addressing them.
Table of Contents
Root Access to OpenStack Control Plane
Root access to your cloud’s control plane nodes is required.
Get Ceph’s Status
In most troubleshooting cases, you can get an overview of your Ceph cluster by
checking its status. To check your Ceph cluster’s status, use
# ceph status cluster: id: 34fa49b3-fff8-4702-8b17-4e8d873c845f health: HEALTH_WARN clock skew detected on mon.focused-capybara, mon.lovely-ladybug 2 daemons have recently crashed services: mon: 3 daemons, quorum relaxed-flamingo,focused-capybara,lovely-ladybug (age 5d) mgr: relaxed-flamingo(active, since 5d), standbys: focused-capybara, lovely-ladybug osd: 4 osds: 4 up (since 5d), 4 in (since 13d) rgw: 3 daemons active (focused-capybara.rgw0, lovely-ladybug.rgw0, relaxed-flamingo.rgw0) task status: data: pools: 13 pools, 337 pgs objects: 110.16k objects, 388 GiB usage: 1.1 TiB used, 11 TiB / 12 TiB avail pgs: 337 active+clean io: client: 381 KiB/s rd, 1.2 MiB/s wr, 444 op/s rd, 214 op/s wr
Ceph Log Files
Ceph’s log files are stored in
/var/log/ceph/ within each control plane
For example, listed are all log files for host
# ls -1 /var/log/ceph/*.log /var/log/ceph/ceph.audit.log /var/log/ceph/ceph.log /var/log/ceph/ceph-mgr.focused-capybara.log /var/log/ceph/ceph-mon.focused-capybara.log /var/log/ceph/ceph-osd.1.log /var/log/ceph/ceph-rgw-focused-capybara.rgw0.log /var/log/ceph/ceph-volume.log
A OpenMetal Ceph cluster is comprised of several services: Ceph’s Manager, Monitor, OSD, and RADOSGW
Ceph has a primary log file, log files for each service, and additional log files.
- Primary Log File:
- Ceph Monitor Log File:
- Ceph RADOSGW Log File:
If you are unsure which Ceph service’s log to look through, consider starting
with the primary log file,
Ceph has a number of health checks, including one for clock skew, called
MON_CLOCK_SKEW. For more, see Ceph’s Health Checks guide and
look for the text MON_CLOCK_SKEW. Ceph in our configuration uses
chronyd to sync each node’s clock. Kolla Ansible is responsible for
installing and configuring
chronyd into a Docker container for each Ceph
Monitor node. To administer
chronyd you must do so through Docker.
Confirm Ceph’s Health
To confirm the status of this health check, execute
ceph status and
examine the output.
cluster: id: 34fa49b3-fff8-4702-8b17-4e8d873c845f health: HEALTH_WARN clock skew detected on mon.focused-capybara, mon.lovely-ladybug [...output truncated...]
ceph health detail to only see the status of health checks.
HEALTH_WARN clock skew detected on mon.focused-capybara, mon.lovely-ladybug [WRN] MON_CLOCK_SKEW: clock skew detected on mon.focused-capybara, mon.lovely-ladybug mon.focused-capybara clock skew 0.663159s > max 0.05s (latency 0.000399254s) mon.lovely-ladybug clock skew 0.368233s > max 0.05s (latency 0.000385143s)
Examine Chrony Logs
From here, you may want to examine the logs for each
docker logs chrony
Alternatively, consider viewing logs on the local file system for
Addressing Clock Skew
There may be a number of methods to addressing clock skew. In this example, we
outline addressing this issue by restarting
chrony for each node.
To address the
MON_CLOCK_SKEW for the example output in this section, the
chrony was restarted for each node. For example:
# docker restart chrony chrony
Next, perform the same Ceph health check as before to confirm the status. For example:
# ceph health detail HEALTH_OK
If the clock skew issue is no longer present, you should see the status of
HEALTH_OK assuming there are no other issues with the Ceph cluster.
Note! — Restarting
chrony may be a heavy handed approach to
addressing the issue. Consider alternatively making using of
sourcestats subcommands to diagnose clock