The purpose of this guide is to go over failure and maintenance scenarios in the OpenStack cloud that could occur and what to do to address them.

Things can and will go wrong. It is good to be prepared for these events.

What should be done should a hardware node fail?

If a hardware node fails or needs to come down for maintenance, you should know what steps to take. Depending on the maintenance required, you will either require the assistance of our data center staff or you can do the maintenance yourself.

Planned maintenance

This section describes the steps needed to take a hardware compute node out of the cloud in the event work needs to be done on it or the cloud needs to be reduced in size.

NOTE! – If you know a node or nodes need maintenance that require a hardware modification you’ll need to create a ticket from the Flex Metal Central control panel to our data center staff to perform that task for you.

For official documentation on this subject, see OpenStack’s Compute Node Failures and Maintenance guide.

The general work flow for bringing a compute node down will involve first disabling that node, finding the instances on that node, migrating those instances to another node, and removing any ceph Object Storage Daemons (OSDs). Optionally, you can migrate the instances back to the original node when the maintenance is done.

OpenStackClient will be required to perform the maintenance.

Procedure for removing a compute node

Start with disabling the nova-compute service on the appropriate node:

$ openstack compute service set --disable --disable-reason 
maintenance COMPUTE_NODE_NAME nova-compute

List the instances on that node:

$ openstack server list --host COMPUTE_NODE_NAME --all-projects

Migrate the instances to another node:

$ openstack server migrate INSTANCE_UUID --live-migration

NOTE – This deployment of OpenStack is using ceph as the backend shared storage so there is no need to pass the --block-migration flag to openstack server migrate.

Because OpenStack has been deployed using Kolla Ansible, each OpenStack service runs in a docker container.

Stop the nova_compute docker container:

# docker stop nova_compute

Perform the needed maintenance, and then restart the nova_compute service:

# docker start nova_compute

Verify the nova_compute docker container is running:

# docker ps | grep nova_compute
286e1b2e2ae5        kolla/centos-binary-nova-compute:train-centos8
"dumb-init --single-…"   2 months ago        Up 18 minutes

Finally, verify the nova service has connected to the messaging service, AMQP:

$ grep AMQP /var/log/kolla/nova/nova-compute.log

Unplanned maintenance

There are times where unplanned maintenance is required. This section will describe what can be done in the event a compute node goes down unexpectedly.

The primary concern is that instances associated with the compute node that has failed will no longer work.


Ceph failure scenarios and recovery

Ceph by nature is resilient to hardware failure and self-healing.

The primary concern with ceph is failed hard drives. How can an operator be alerted to a failed hard drive? Will ceph continue to function if a drive is lost?

Generally, ceph will continue to function if a drive is lost, however the drive should be replaced as soon as possible.

How do you know if a hard drive has failed?

Currently there is no monitoring for failed ceph drives, however the intention is to monitor for these events in the future. Due to this, it is recommended monitoring of drives be put into place. Software such as Icinga or Nagios are viable options for monitoring.

If it is suspected a drive has failed, you should first determine if this really is the case.

The overall procedure for determining if a drive has failed is to:

  • Check Ceph health
  • See if the OSD associated with the drive in question can be started if it is stopped
  • Check the OSD’s mount point using `df -h`
  • Use smartctl on the drive in question

The following explains these steps in more detail.



From one of the hardware nodes, perform the following checks:

Check if ceph is healthy:

# ceph health


Find the location of the OSD within the CRUSH map:

# ceph osd tree | grep -i down


On the node that houses the OSD, try to start the OSD using systemctl where OSD is a placeholder for the actual OSD identifier:

# systemctl start ceph-osd@OSD.service


The systemctl unit file for the OSD will vary depending on which OSD has failed. In this case the systemctl unit file is called ceph-osd@0.service.

If a hard drive has failed, our data center team will need to replace it. A ticket will need to made in Flex Metal Central to alert of team of the failure. The drive or drives will be replaced by our team.