About russellbryant

I'm an open source software engineer working for Red Hat on the OpenStack project.

Implementation of Pacemaker Managed OpenStack VM Recovery

I’ve discussed the use of Pacemaker as a method to detect compute node failures and recover the VMs that were running there.  The implementation of this is ready for testing.  Details can be found in this post to rdo-list.

The post mentions one pending enhancement to Nova that would improve things further:

Currently fence_compute loops, waiting for nova to recognise that the failed host is down, before we make a host-evacuate call which triggers nova to restart the VMs on another host. The discussed nova API extensions will speed up recovery times by allowing fence_compute to proactively push that information into nova instead.

The issue here is that the default backend for Nova’s servicegroup API relies on the nova-compute service to periodically check in to the Nova database to indicate that it is still running.  The delay in the recovery process is caused by Nova waiting on a configured timeout since the last time the service checked in.  Pacemaker is going to know about the failure much sooner, so it would be helpful if there was an API to tell Nova “trust me, this node is gone”.  This proposed spec intends to provide such an API.

The Different Facets of OpenStack HA

Last October, I wrote about a particular aspect of providing HA for workloads running on OpenStack. The HA problem space for OpenStack is much more broad than what was addressed there. There has been a lot of work around HA for the OpenStack services themselves. The problems for OpenStack services seem to be pretty well understood. There are reference architectures with detailed instructions available from vendors and distributions that have been integrated into deployment tools. The upstream OpenStack project also has an HA guide that covers this topic.

Another major area that has received less attention and effort is HA for compute workloads running on top of OpenStack. Requirements are very diverse in this area, so I’d like to provide some broader context around this topic and expand on my thoughts about useful work that can be done either in or around OpenStack to better support legacy workloads.

Approaches to Recovery

My original thinking around all of this was that we should be building infrastructure that can automatically handle the recovery in response to failures.  It turns out that there is also a lot of demand for the ability to completely custom manage the recovery process.  This was particularly made clear to me while discussing availability requirements for NFV workloads on OpenStack with participants in the OPNFV project.  With that said, we need to think about all failure types in two ways.

  1. Completely Automated Recovery – In this case, I envision a use wanting to enable an option that says “keep my VM running”.  The rest would be automatically handled by the cloud infrastructure.
  2. Custom Recovery – In this case, a set of applications has more strict requirements around failure detection (or even prediction) and availability.  They want to be notified of failures and be given the APIs necessary to implement their own recovery.

Types of Failures

We can break down the types of failures into 3 major categories that require different work for an OpenStack deployment.

  1. Failure of Infrastructure – This type of failure is when the hardware infrastructure support OpenStack workloads fails. A hardware failure on a given hypervisor is a prime example.
  2. Failure of the Guest Operating System – In this case, the base operating system in the VM fails for some reason.  Imagine a kernel panic in your VM that causes the VM to stop running, even though the hypervisor is still operating just fine.
  3. Failure of the Application – Something at the application layer may also fail.

Now let’s look at each failure type and consider the work that could be done to support each approach to recovery.

Infrastructure Failure

Failure of the infrastructure is what I wrote about last October.  I discussed the use of some recent enhancements to Pacemaker that would allow Pacemaker to monitor compute nodes.  This makes a lot of sense as Pacemaker is often already included in the HA architecture for the underlying OpenStack services.  There has since been work in cooperation with the Pacemaker team to build a proof-of-concept implementation of this approach.  Details for review and experimentation will be released soon.

The focus so far has been on completely automating the recovery process.  In particular that means issuing a nova evacuate API call for all of the instances that were running on the dead node.  However, this same architecture can be adapted to support the custom recovery approach.

Instead of fencing the node and issuing an evacuate, Pacemaker could optionally fence the node and emit a notification about the failure.  This notification could be put on the existing OpenStack notification message bus.  A deployment could choose to write something that consumes these notifications.  Another option could be to enhance Ceilometer to consume these notifications and turn it into an alarm that notifies an API consumer that the host that was running a VM has failed.

Guest Operating System Failure

The libvirt/KVM driver in OpenStack already has great support for handling this type of failure and providing automated recovery.  In particular, you can set the “hw:watchdog_action” property in either the extra_specs field of a flavor or as an image property.  When this property is set, a watchdog device will be configured for the VM. If the value for this property is set to “reset”, the VM will automatically be rebooted if the guest operating system crashes, triggering the watchdog.

To support the custom recovery case, we could add a new watchdog_action called “notify”.  In this case, Nova could simply emit a notification to the OpenStack notification message bus.  Again, a deployment could have a custom application that consumes these notifications or a service like Ceilometer could turn it into something consumable by a public REST API.

Application Failure

In my opinion, providing an answer for application failures is the least important part of HA from an OpenStack perspective.  Dealing with this is not new for legacy applications.  It was necessary before running on OpenStack and there are a lot of solutions available.  For example, it’s possible to run a virtual Pacemaker cluster inside a set of VMs just like you would on a physical deployment.  This has not always been the case, though.  When we looked at this in much earlier days of OpenStack (pre-Neutron), it was much more difficult to accomplish.  Neutron gives you enough control so that IP addresses can be predictable before creating the VMs, making it easy to get all of the addresses needed for Pacemaker configuration into the VM at the time it’s created.

There is still some discussion about coming up with a cloud native approach to application monitoring and recovery handling.  Heat is often brought up in this context.  There was a thread on the openstack-dev list this past December about this.  It seems there is potential there, but it’s unclear if it’s really high enough on anyone’s priority list to work on.

Pulling Things Together

A major point with all of this is that there are different layers failures can happen and there is not one solution for handling them all.  Breaking the problem space into pieces helps make it more manageable.  Once broken down, it seems clear to me that there is very reasonable work that can be done to enhance the ability of OpenStack to run traditional “pet” workloads.  Once these solutions are further along, it will be important to figure out how they integrate with deployment and management solutions, but let’s start with just making it work.

OpenStack Instance HA Proposal

In a perfect world, every workload that runs on OpenStack would be a cloud native application that is horizontally scalable and fault tolerant to anything that may cause a VM to go down.  However, the reality is quite different.  We continue to see a high demand for support of traditional workloads running on top of OpenStack and the HA expectations that come with them.

Traditional applications run on top of OpenStack just fine for the most part.  Some applications come up with availability requirements that a typical OpenStack deployment will not provide automatically.  If a hypervisor goes down, there is nothing in place that tries to rescue VMs that were running there.  There are some features in place that allow manual rescue, but it requires manual intervention from a cloud operator or an external orchestration tool.

This proposal discusses what it would take to provide automated detection of a failed hypervisor and the recovery of the VMs that were running there.  There are some differences to the solution based on what hypervisor you’re using.  I’m primarily concerned with libvirt/KVM, so I assume that for the rest of this post.  Except where libvirt is specifically mentioned, I think everything applies just as well to the use of the xenserver driver.

This topic is raised on a regular basis in the OpenStack community.  There has been pushback against putting this functionality directly in OpenStack.  Regardless of what components are used, I think we need to provide an answer to the question of how this problem should be approached.  I think this is quite achievable today using existing software.

Scope

This proposal is specific to recovery from infrastructure failures.  There are other types of failures that can affect application availability.  The guest operating system or the application itself could fail.  Recovery from these types of failures is primarily left up to the application developer and/or deployer.

It’s worth noting that the libvirt/KVM driver in OpenStack does contain one feature related to guest operating system failure.  The libvirt-watchdog blueprint was implemented in the Icehouse release of Nova.  This feature allows you to set the hw_watchdog_action property on either the image or flavor.  Valid values include poweroff, reset, pause, and none.  When this is enabled, libvirt will enable the i6300esb watchdog device for the guest and will perform the requested action if the watchdog is triggered.  This may be a helpful component of your strategy for recovery from guest failures.

Architecture

A solution to this problem requires a few key components:

  1. Monitoring – A system to detect that a hypervisor has failed.
  2. Fencing – A system to fence failed compute nodes.
  3. Recovery – A system to orchestrate the rescue of VMs from the failed hypervisor.

Monitoring

There are a two main requirements for the monitoring component of this solution.

  1. Detect that a host has failed.
  2. Trigger an automatic response to the failure (Fencing and Recovery).

It’s often suggested that the solution for this problem should be a part of OpenStack.  Many people have suggested that all of this functionality should be built into Nova.  The problem with putting it in Nova is that it assumes that Nova has proper visibility into the health of the infrastructure that Nova itself is running on.  There is a servicegroup API that does very basic group membership.  In particular, it keeps track of active compute nodes.  However, at best this can only tell you that the nova-compute service is not currently checking in.  There are several potential causes for this that would still leave the guest VMs running just fine.  Getting proper infrastructure visibility into Nova is really a layering violation.  Regardless, it would be a significant scope increase for Nova, and I really don’t expect the Nova team to agree to it.

It has also been proposed that this functionality be added to Heat.  The most fundamental problem with that is that a cloud user should not be required to use Heat to get their VM restarted if something fails.  There have been other proposals to use other (potentially new) OpenStack components for this.  I don’t like that for many of the same reasons I don’t think it should be in Nova.  I think it’s a job for the infrastructure supporting the OpenStack deployment, not OpenStack itself.

Instead of trying to figure out which OpenStack component to put it in, I think we should consider this a feature provided by the infrastructure supporting an OpenStack deployment.  Many OpenStack deployments already use Pacemaker to provide HA for portions of the deployment.  Historically, there have been scaling limits in the cluster stack that made Pacemaker not an option for use with compute nodes since there’s far too many of them.  This limitation is actually in Corosync and not Pacemaker itself.  More recently, Pacemaker has added a new feature called pacemaker_remote, which allows a host to be a part of a Pacemaker cluster, without having to be a part of a Corosync cluster.  It seems like this may be a suitable solution for OpenStack compute nodes.

Many OpenStack deployments may already be using a monitoring solution like Nagios for their compute nodes.  That seems reasonable, as well.

Fencing

To recap, fencing is an operation that completely isolates a failed node.  It could be IPMI based where it ensures that the failed node is powered off, for example.  Fencing is important for several reasons.  There are many ways a node can fail, and we must be sure that the node is completely gone before starting the same VM somewhere else.  We don’t want the same VM running twice.  That is certainly not what a user expects.  Worse, since an OpenStack deployment doing automatic evacuation is probably using shared storage, running the same VM twice can result in data corruption, as two VMs will be trying to use the same disks.  Another problem would be having the same IPs on the network twice.

A huge benefit of using Pacemaker for this is that it has built-in integration with fencing, since it’s a key component of any proper HA solution.  If you went with Nagios, fencing integration may be left up to you to figure out.

Recovery

Once a failure has been detected and the compute node has been fenced, the evacuation needs to be triggered.  To recap, evacuation is restarting an instance that was running on a failed host by moving it to another host.  Nova provides an API call to evacuate a single instance.  For this to work properly, instance disks should be on shared storage.  Alternatively, they could all be booted from Cinder volumes.  Interestingly, the evacuate API will still run even without either of these things.  The result is just a new VM from the same base image but without any data from the old one.  The only benefit then is that you get a VM back up and running under the same instance UUID.

A common use case with evacuation is “evacuate all instances from a given host”.  Since this is common enough, it was scripted as a feature in the novaclient library.  So, the monitoring tool can trigger this feature provided by novaclient.

If you want this functionality for all VMs in your OpenStack deployment, then we’re in good shape.  Many people have made the additional request that users should be able to request this behavior on a per-instance basis.  This does indeed seem reasonable, but poses an additional question.  How should we let a user indicate to the OpenStack deployment that it would like its instance automatically recovered?

The typical knobs used are image properties and flavor extra-specs.  That would certainly work, but it doesn’t seem quite flexible enough to me.  I don’t think a user should have to create a new image to mark it as “keep this running”.  Flavor extra-specs are fine if you want this for all VMs of a particular flavor or class of flavors.  In either case, the novaclient “evacuate a host” feature would have to be updated to optionally support it.

Another potential solution to this is by using a special tag that would be specified by the user.  There is a proposal up for review right now to provide a simple tagging API for instances in Nova.  For this discussion, let’s say the tag would be automatic-recovery.  We could also update the novaclient feature we’re using with support for “evacuate all instances on this host that have a given tag”.  The monitoring tool would trigger this feature and ask novaclient to evacuate a host of all VMs that were tagged with automatic-recovery.

Conclusions and Next Steps

Instance HA is clearly something that many deployments would like to provide.  I believe that this could be put together for a deployment today using existing software, Pacemaker in particular.  A next step here is to provide detailed information on how to set this up and also do some testing.

I expect that some people might say, “but I’m already using system Foo (Nagios or whatever) for monitoring my compute nodes”.  You could go this route, as well.  I’m not sure about fencing integration with something like Nagios.  If you skip the use of fencing in this solution, you get to keep the pieces when it breaks.  Aside from that, your monitoring system could trigger the evacuation functionality of novaclient just like Pacemaker would.

Some really nice future development around this would be integration into an OpenStack management UI.  I’d like to have a dashboard of my deployment that shows me any failures that have occurred and what responses have been triggered.  This should be possible since pcsd offers a REST API (WIP) that could export this information.

Lastly, it’s worth thinking about this problem specifically in the context of TripleO.  If you’re deploying OpenStack with OpenStack, should the solution be different?  In that world, all of your baremetal nodes are OpenStack resources via Ironic.  Ceilometer could be used to monitor the status of those resources.  At that point, OpenStack itself does have enough information about the supporting infrastructure to perform this functionality.  Then again, instead of trying to reinvent all of this in OpenStack, we could just use the more general Pacemaker based solution there, as well.

PTLs and Project Success in OpenStack

We’re in the middle of another PTL change cycle.  Nominations have occurred over the last week.  We’ve also seen several people step down this cycle (Keystone, TripleO, Cinder, Heat, Glance).  This is becoming a regular thing in OpenStack.  The PTL position for most projects has changed hands over time.  The Heat project changes every cycle.  Nova has its 3rd PTL from a 3rd company (about to enter his 2nd cycle).  With all of this change, some people express some amount of discomfort and concern.

I think the change that we see is quite healthy.  This is our open governance model working well.  We should be thrilled that OpenStack is healthy enough that we don’t rely on any one person to move forward.

I’d like to thank everyone who steps up to fill a PTL position.  It is a huge commitment.  The recognition you get is great, but it’s really hard work.  It’s quite a bit more than just technical leadership.  It also involves project management and community management.  It’s a position subject to a lot of criticism and a lot of the work is thankless.  So, thank you very much to those that are serving as a PTL or have served as one in the past.

It’s quite important that everyone also realize that it takes a lot more than a PTL to make a project successful.  Our project governance and culture includes checks and balances.  There is a Technical Committee (TC) that is responsible for ensuring that OpenStack development overall remains healthy.  Some good examples of TC influence on projects would be the project reviews the TC has been doing over the last cycle, working with projects to apply course corrections (Neutron, Trove, Ceilometer, Horizon, Heat, Glance).  Most importantly, the PTL still must work for consensus of the project’s members (though is empowered to make the final call if necessary).

As a contributor to an OpenStack project, there is quite a bit you can do to help ensure project success beyond just writing code.  Here are some of those things:

Help the PTL

While the PTL is held accountable for the project, they do not have to be responsible for all of the work that needs to get done.  The larger projects have started officially delegating responsibilities.  There is talk about formalizing aspects of this, so take a look at that thread for examples.

If you’re involved in a project, you should work to understand these different aspects of keeping the project running.  See if there’s an area you can help out with.  This stuff is critically important.

If you aspire to be a PTL at some point in the future, I would say getting involved in these areas is the best way you can grow your skills, influence, and visibility in the project to make yourself a strong candidate in the future.

Participate in Project Discussions

The direction of a project is a result of many discussions.  You should be aware of these and participate in them as much as you can.  Watch the openstack-dev mailing list for discussions affecting your project.  It’s quite surprising how many people may be a core reviewer for a project but rarely participate in discussions on the mailing list.

Most projects also have an IRC channel.  Quite a bit of day-to-day discussion happens there.  This can be difficult due to time zones or other commitments.  Just join when you can.  If it’s time zone compatible, you should definitely make it a priority to join weekly project IRC meetings.  This is an important time to discuss current happenings in the project.

Finally, attend and participate in the design summit.  This is the time that projects try to sync up on goals for the cycle.  If you really want to play a role in affecting project direction and ensuring success, it’s important that you attend if possible.  Of course, there are many legitimate reasons some people may not be able to travel and those should be understood and respected by fellow project members.

Also keep in mind that project discussions span more than the project’s technical issues.  There are often project process and structure issues to work out.  Try to raise your awareness of these issues, provide input, and propose new ideas if you have them.  Some good recent examples of contributors doing this would be Daniel Berrange putting forth a detailed proposal to split out the virt drivers from nova, or Joe Gordon and John Garbutt pushing forward on evolving the blueprint handling process.

Do the Dirty Work

Most contributors to OpenStack are contributing on behalf of an employer.  Those employers have customers (which may be internal or external) and those customers have requirements.  It’s understandable that some amount of development time goes toward implementing features or fixing problems that are important to those customers.

It’s also critical that everyone understands that there is a good bit of common work that must get done.  If you want to build goodwill in a project while also helping its success, help in these areas.  Some of those include:

See You in Paris!

I hope this has helped some contributors think of new ways of helping ensuring the success of OpenStack projects.  I hope to see you on the mailing list, IRC, and in Paris!

OpenStack Board Meeting – 2014-09-18

I’m not a member of the OpenStack board, but the board meetings are open with the exception of the occasional Executive Session for topics that really do need to be private.  I attended the meeting on September 18, 2014.  Jonathan Bryce has since posted a summary of the meeting, but I wanted to share some additional commentary.

Mid-Cycle Meetups

Rob Hirschfeld raised the topic of mid-cycle meetups.  This wasn’t discussed for too long in the meeting.  The specific request was that the foundation staff evaluate what kind of assistance the foundation could and should be providing these events.  So far they are self-organized by teams within the OpenStack project.

These have been increasing in popularity.  The OpenStack wiki lists 12 such events during the Juno development cycle.  Some developers involved in cross project efforts attended several events.  This increase in travel has resulted in some tension in the community, as well.  Some people have legitimate personal reasons for not wanting the additional travel burden.  Others mention the increased impact to company travel budgets.  Overall, the majority opinion is that the events are incredibly productive and useful.

One of the most insightful outcomes of the discussions about meetups has been that the normal design summit is no longer serving its original purpose well enough.  The design summit is supposed to be the time that we sync on project goals.  We need to work on improving the design summit so that it better achieves that goal so that the mid-cycle meetups can be primarily focused on working sessions to execute on the plan from the design summit.

There was a thread on the openstack-dev list discussing improvements we can make to the design summit based on mid-cycle meetup feedback.

Foundation Platinum Membership

Jonathan raised a topic about how the foundation and board should fill a platinum sponsor spot that is opening up soon.  The discussion was primarily about the process the board should use for handling this.

This may actually be quite interesting to watch.  Nebula is stepping down from its platinum sponsor spot.  There is a limit of 8 platinum sponsors.  This is the first time since the foundation launched that a platinum sponsor spot has been opened.  OpenStack has grown an enormous amount since the foundation launched, so it’s reasonable to expect that there will be contention around this spot.

Which companies will apply for the spot?  How will the foundation and board handle contention?  And ultimately, which company will be the new platinum sponsor of the foundation?

DefCore

This topic took up the majority of the board meeting and was quite eventful.  Early in the discussion there was concern about the uncertainty caused by the length of the process so far.  There seemed to be general agreement that this is a real concern.  Naturally, the conversation proceeded to how to move forward.

The goal of DefCore has been around defining a single minimal definition to support the single OpenStack Powered trademark program.  The most important outcome from this discussion was that the board consensus seemed to be that this is causing far too much contention and that the DefCore committee should work with the foundation to propose a small set of trademark programs instead of trying to have a single one that covered all cases.

There were a couple of different angles discussed on creating additional trademark programs.  The first is about functionality.  So far, everything has been included in one program.  There was discussion of separate compute, storage, and networking trademark programs, for example.

Another angle that seemed quite likely based on the discussion was having OpenStack powered vs. OpenStack compatible trademark programs.  An OpenStack compatible program would only focus on compatibility with capabilities.  The OpenStack powered trademarks would include both capabilities and required code (designated sections).  The nice thing about this angle is that it would allow the board to press forward on the compatibility marks in support of the goal of interoperability, without getting held up by the debate around designated sections.

For more information on this, follow the ongoing discussion on the defcore-committee mailing list.

Next Board Meeting

The next OpenStack board meeting will be Nov. 2, 2014 in Paris, just before the OpenStack Summit.  There may also be an additional board meeting to continue the DefCore discussion.  Watch the foundation list for any news on additional meetings.

How to use the telephone (Phone Etiquette, or How I Learned to Love the Mute Button)

Many of us spend a lot of time on conference calls. Dan Smith and I recently put together an internal wiki page with some helpful information for our fellow conference call attendees. Dan wrote the instructions and I provided some additional visual aids.  Here is the content of that page. Feel free to share this information with anyone that you feel may benefit.

How to use the telephone

  1. Dial the number you wish to call
  2. Press the mute button
  3. If you wish to talk, place your finger on your mute button and press it, but keep your finger poised over the button
  4. Speak
  5. When finished speaking press the mute button before you return your finger to your keyboard
  6. If the call is continuing, go to step 3
  7. If the call is finished, hang up the telephone

Additional Tips

  1. Try to avoid using soft phones. They’re horrible and they often introduce echo and/or noise for everyone else on the call
  2. Try to avoid using cell phones. They’re horrible and they often introduce echo and/or noise for everyone else on the call
  3. Use a headset. Speakerphones aren’t as good as you think they are. If you have to use a speakerphone, be even more vigilant about your mute button.
  4. Stick to landline or proper voip phones whenever possible. Always use your mute button when you’re not actually speaking.
  5. If you’re using something that resembles but poorly emulates a real telephone, use the *6 mute feature as it mutes all the audio coming from your line, something that your emulated phone may not do well.
  6. Keep track of your mute status. It’s not that hard. It’s either on or off. Don’t announce to everyone that you forgot whether you were muted or not. If you need help determining if you’re muted or not try the following procedure below.
  7. Hold is not mute. Don’t use your phone’s hold function in any way when you’re connected to a conference. Your phone or phone system may play music or beep periodically into the conference, making it unusable until you return.

Steps to determine if you’re muted

  1. If you’re not talking right now, you’re muted (see step 3 in the How to use a telephone instructions)
  2. There is no step two. If you messed up step one and think you need a step two, review step one.

 Helpful Images

A phone

Note: This is a Polycom IP 335. Your phone may differ.

polycom_soundpoint_ip_335

 

Location of Mute Button

It’s the single big red button.  Note that your phone should remain black and NOT turn grey while locating the mute button.

polycom_soundpoint_ip_335_mute_location

 

Light On While Muted

Note that the phone should remain black while muted.  It will not turn grey.

 

polycom_soundpoint_ip_335_mute_light

Juno Preview for OpenStack Compute (Nova)

We’re now well into the Juno release cycle. Here’s my take on a preview of some of what you can expect in Juno for Nova.

NFV

One area receiving a lot of focus this cycle is NFV. We’ve started an upstream NFV sub-team for OpenStack that is tracking and helping to drive requirements and development efforts in support of NFV use cases. If you’re not familiar with NFV, here’s a quick overview that was put together by the NFV sub-team:

NFV stands for Network Functions Virtualization. It defines the
replacement of usually stand alone appliances used for high and low
level network functions, such as firewalls, network address translation,
intrusion detection, caching, gateways, accelerators, etc, into virtual
instance or set of virtual instances, which are called Virtual Network
Functions (VNF). In other words, it could be seen as replacing some of
the hardware network appliances with high-performance software taking
advantage of high performance para-virtual devices, other acceleration
mechanisms, and smart placement of instances. The origin of NFV comes
from a working group from the European Telecommunications Standards
Institute (ETSI) whose work is the basis of most current
implementations. The main consumers of NFV are Service providers
(telecommunication providers and the like) who are looking to accelerate
the deployment of new network services, and to do that, need to
eliminate the constraint of slow renewal cycle of hardware appliances,
which do not autoscale and limit their innovation.

NFV support for OpenStack aims to provide the best possible
infrastructure for such workloads to be deployed in, while respecting
the design principles of a IaaS cloud. In order for VNF to perform
correctly in a cloud world, the underlying infrastructure needs to
provide a certain number of functionalities which range from scheduling
to networking and from orchestration to monitoring capacities. This
means that to correctly support NFV use cases in OpenStack,
implementations may be required across most, if not all, main OpenStack
projects, starting with Neutron and Nova.

The opportunities for OpenStack in the NFV space are huge. The work being tracked by the NFV sub-team spans more than just Nova, but here are some of the NFV related projects for Nova:

Upgrades

The road to live upgrades has been a long one. Progress has been made over the last several releases. The Icehouse release was the first release that supported live upgrades in some form. From Havana to Icehouse, you can do a control plane upgrade with some API downtime without having to upgrade your compute nodes at the same time. You can roll through upgrading the compute nodes with the control plane already upgraded to Icehouse.

For Juno we are continuing to improve on this in several areas. Since Nova is a highly distributed system, one of the biggest requirements for doing this is versioning everything about the interactions between components. First we went through and versioned all of the interfaces between components. Next we have been versioning all of the data passed between components. This versioning of the data is part of what Nova Objects provide. Nova Objects are an internal implementation detail, but are critical to upgrade support. The entire code base had not been converted as of the Icehouse release. For Juno we continue to do conversions over to this new object model.

The other major improvement being looked at this release cycle is how we can reduce the downtime needed on the control plane by reducing how long it takes for database schema migrations to run. This is largely about developing new best practices about how migrations need to work going forward.

Finally, for the Icehouse release we added some basic testing of the live upgrade sceanario to the OpenStack CI system. This testing runs OpenStack using the previous release and then upgrades everything except the nova-compute service. At that point, everything should continue to work. One goal for the Juno cycle is to improve this testing to verify that we can also run an older instance of the nova-network service with an upgraded control plane. This is critical for deployments that use nova-network in multi-host mode. In that case, you have nova-network running on each compute node, so we need to support a mixed version environment for nova-network, as well as nova-compute.

Scheduler

There’s always a lot of interest in improving the way host scheduling works in Nova. In the Icehouse cycle we identified that we wanted to split the scheduler out into a new project (codenamed Gantt). Doing so requires decoupling Nova’s scheduler as much as possible from the rest of Nova. This decoupling effort is the primary goal for the Juno cycle. Once the scheduler is independent of Nova, we can investigate ways to integrate other projects so that scheduling can use information that currently only lives in other projects such as Neutron or Cinder.

Docker

The Docker driver for Nova was moved to Stackforge during the Icehouse development cycle. The primary reason was the lack of CI running for the driver. However, there were a number of feature gaps that made getting CI with tempest working as it needed to. Moving to stackforge gave an opportunity for the team working on this driver to iterate quicker and fill these gaps.

There has been a lot of progress on the Docker driver in the Juno cycle. Some of the feature gap work has resulting in improvements to Docker itself, which is really great to see. For example, Docker now supports pause and unpause, which is a feature of the Nova API that the Docker driver is now able to support. Another area that has seen some focus is Cinder support. To make this work, we have to be able to support exposing block devices to Docker containers at creation time, as well as later on after they are already running. There has been work on Docker itself in this area, which will eventually lead to support in the Nova Docker driver.

Finally, there has been ongoing work to get CI with tempest running. It’s now running in OpenStack’s CI infrastructure. The progress is great to see, but it also seems most likely that the driver will return to Nova in the K release cycle instead of Juno.

Ironic

Nova introduced the baremetal driver in the Grizzly release.  This driver allows you to use Nova’s API to do provisioning of bare metal instead of virtual machines.  There was immediately a lot of interest in this functionality for OpenStack.  Soon after this driver was introduced, it was decided that we should start a new project dedicated to bare metal management.  That project is Ironic.

Ironic has come a long way since then.  The project is currently incubated and could potentially graduate for the K release.  One of the major tasks in moving towards graduation is getting the Ironic driver for Nova merged.  The spec has been approved and the code seems to be in good shape.  I’m very hopeful that we will have this step completed in the Juno release.

Database Integration

OpenStack has been a long time user of the SQLAlchemy library for its integration with relational databases.  More recently, some OpenStack projects have begun using Alembic for managing database schema migrations.  Michael Bayer, author of SQLAlchemy and Alembic, recently joined Red Hat to help with OpenStack, as well as continue to maintain SQLAlchemy and Alembic.  He has been surveying OpenStack’s current usage of SQLAlchemy and identifying areas where we can improve.  He has written up a fascinating wiki page with his findings.  I expect this to result in some very nice improvements to many OpenStack projects, including Nova.

Other

There are many other features being worked on right now for Nova. The best place to get an idea of what’s going on is to look at either the list of approved design specs or the list of specs under review.

Availability Zones and Host Aggregates in OpenStack Compute (Nova)

UPDATE 2014-06-18: There was a talk at the last OpenStack Summit in Atlanta on this topic, Divide and Conquer: Resource Segregation in the OpenStack Cloud.

Confusion around Host Aggregates and Availabaility Zones in Nova seems to be very common. In this post I’ll attempt to show how each are used. All information in this post is based on the way things work in the Grizzly version of Nova.

First, go ahead and forget everything you know about things called Availability Zones in other systems.  They are not the same thing and trying to map Nova’s concept of Availability Zones to what something else calls Availability Zones will only cause confusion.

The high level view is this: A host aggregate is a grouping of hosts with associated metadata.  A host can be in more than one host aggregate.  The concept of host aggregates is only exposed to cloud administrators.

A host aggregate may be exposed to users in the form of an availability zone. When you create a host aggregate, you have the option of providing an availability zone name. If specified, the host aggregate you have created is now available as an availability zone that can be requested.

Here is a tour of some commands.

Create a host aggregate:

$ nova aggregate-create test-aggregate1
+----+-----------------+-------------------+-------+----------+
| Id | Name            | Availability Zone | Hosts | Metadata |
+----+-----------------+-------------------+-------+----------+
| 1  | test-aggregate1 | None              |       |          |
+----+-----------------+-------------------+-------+----------+

Create a host aggregate that is exposed to users as an availability zone. (This is not creating a host aggregate within an availability zone! It is creating a host aggregate that is the availability zone!)

$ nova aggregate-create test-aggregate2 test-az
+----+-----------------+-------------------+-------+----------+
| Id | Name            | Availability Zone | Hosts | Metadata |
+----+-----------------+-------------------+-------+----------+
| 2  | test-aggregate2 | test-az           |       |          |
+----+-----------------+-------------------+-------+----------+

Add a host to a host aggregate, test-aggregate2. Since this host aggregate defines the availability zone test-az, adding a host to this aggregate makes it a part of the test-az availability zone.

nova aggregate-add-host 2 devstack
Aggregate 2 has been successfully updated.
+----+-----------------+-------------------+---------------+------------------------------------+
| Id | Name            | Availability Zone | Hosts         | Metadata                           |
+----+-----------------+-------------------+---------------+------------------------------------+
| 2  | test-aggregate2 | test-az           | [u'devstack'] | {u'availability_zone': u'test-az'} |
+----+-----------------+-------------------+---------------+------------------------------------+

Note that the novaclient output shows the availability zone twice. The data model on the backend only stores the availability zone in the metadata. There is not a separate column for it. The API returns the availability zone separately from the general list of metadata, though, since it’s a special piece of metadata.

Now that the test-az availability zone has been defined and contains one host, a user can boot an instance and request this availability zone.

$ nova boot --flavor 84 --image 64d985ba-2cfa-434d-b789-06eac141c260 \
> --availability-zone test-az testinstance
$ nova show testinstance
+-------------------------------------+----------------------------------------------------------------+
| Property                            | Value                                                          |
+-------------------------------------+----------------------------------------------------------------+
| status                              | BUILD                                                          |
| updated                             | 2013-05-21T19:46:06Z                                           |
| OS-EXT-STS:task_state               | spawning                                                       |
| OS-EXT-SRV-ATTR:host                | devstack                                                       |
| key_name                            | None                                                           |
| image                               | cirros-0.3.1-x86_64-uec (64d985ba-2cfa-434d-b789-06eac141c260) |
| private network                     | 10.0.0.2                                                       |
| hostId                              | f038bdf5ff35e90f0a47e08954938b16f731261da344e87ca7172d3b       |
| OS-EXT-STS:vm_state                 | building                                                       |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000002                                              |
| OS-EXT-SRV-ATTR:hypervisor_hostname | devstack                                                       |
| flavor                              | m1.micro (84)                                                  |
| id                                  | 107d332a-a351-451e-9cd8-aa251ce56006                           |
| security_groups                     | [{u'name': u'default'}]                                        |
| user_id                             | d0089a5a8f5440b587606bc9c5b2448d                               |
| name                                | testinstance                                                   |
| created                             | 2013-05-21T19:45:48Z                                           |
| tenant_id                           | 6c9cfd6c838d4c29b58049625efad798                               |
| OS-DCF:diskConfig                   | MANUAL                                                         |
| metadata                            | {}                                                             |
| accessIPv4                          |                                                                |
| accessIPv6                          |                                                                |
| progress                            | 0                                                              |
| OS-EXT-STS:power_state              | 0                                                              |
| OS-EXT-AZ:availability_zone         | test-az                                                        |
| config_drive                        |                                                                |
+-------------------------------------+----------------------------------------------------------------+

All of the examples so far show how host-aggregates provide an API driven mechanism for cloud administrators to define availability zones. The other use case host aggregates serves is a way to tag a group of hosts with a type of capability. When creating custom flavors, you can set a requirement for a capability. When a request is made to boot an instance of that type, it will only consider hosts in host aggregates tagged with this capability in its metadata.

We can add some metadata to the original host aggregate we created that was *not* also an availability zone, test-aggregate1.

$ nova aggregate-set-metadata 1 coolhardware=true
Aggregate 1 has been successfully updated.
+----+-----------------+-------------------+-------+----------------------------+
| Id | Name            | Availability Zone | Hosts | Metadata                   |
+----+-----------------+-------------------+-------+----------------------------+
| 1  | test-aggregate1 | None              | []    | {u'coolhardware': u'true'} |
+----+-----------------+-------------------+-------+----------------------------+

A flavor can include a set of key/value pairs called extra_specs. Here’s an example of creating a flavor that will only run on hosts in an aggregate with the coolhardware=true metadata.

$ nova flavor-create --is-public true m1.coolhardware 100 2048 20 2
+-----+-----------------+-----------+------+-----------+------+-------+-------------+-----------+
| ID  | Name            | Memory_MB | Disk | Ephemeral | Swap | VCPUs | RXTX_Factor | Is_Public |
+-----+-----------------+-----------+------+-----------+------+-------+-------------+-----------+
| 100 | m1.coolhardware | 2048      | 20   | 0         |      | 2     | 1.0         | True      |
+-----+-----------------+-----------+------+-----------+------+-------+-------------+-----------+
$ nova flavor-key 100 set coolhardware=true
$ nova flavor-show 100
+----------------------------+----------------------------+
| Property                   | Value                      |
+----------------------------+----------------------------+
| name                       | m1.coolhardware            |
| ram                        | 2048                       |
| OS-FLV-DISABLED:disabled   | False                      |
| vcpus                      | 2                          |
| extra_specs                | {u'coolhardware': u'true'} |
| swap                       |                            |
| os-flavor-access:is_public | True                       |
| rxtx_factor                | 1.0                        |
| OS-FLV-EXT-DATA:ephemeral  | 0                          |
| disk                       | 20                         |
| id                         | 100                        |
+----------------------------+----------------------------+

Hopefully this provides some useful information on what host aggregates and availability zones are, and how they are used.

OpenStack Compute (Nova) Roadmap for Havana

The Havana design summit was held mid-April.  Since then we have been documenting the Havana roadmap and going full speed ahead on development of these features.  The list of features that developers have committed to completing for the Havana release is tracked using blueprints on Launchpad. At the time of writing, we have 74 blueprints listed that cover a wide range of development efforts.  Here are some highlights in no particular order:

Database Handling

Vish Ishaya made a change at the very beginning of the development cycle that will allow us to backport database migrations to the Grizzly release if needed. This is needed in case we need to backport a bug fix that requires a migration.

Dan Smith and Chris Behrens are working on a unified object model. One of the things that has been in the way of rolling upgrades of a Nova deployment is that the code and the database schema are very tightly coupled. The primary goal of this effort is to decouple these things. This effort is bringing in some other improvements, as well, including better object serialization handling for rpc, as well as object versioning.

Boris Pavlovic continues to do a lot of cleanup of database support in Nova.  He’s adding tests (and more tests), adding unique constraints, improving session handling, and improving archiving.

Chris Behrens has been working on a native MySQL database driver that performs much better than the SQLAlchemy driver for use in large scale deployments.

Mike Wilson is working on supporting read-only database slaves. This will allow distributing some queries to other database servers to help scaling in large scale deployments.

Bare Metal

The Grizzly release of Nova included the bare metal provisioning driver. Interest in this functionality has been rapidly increasing. Devananda van der Veen proposed that the bare metal provisioning code be split out into a new project called Ironic. The new project was approved for incubation by the OpenStack Technical Committee last week. Once this has been completed, there will be a driver in Nova that talks to the Ironic API. The Ironic API will present some additional functionality that doesn’t make sense to use to present in the Compute API in Nova.

Prior to the focus shift to Ironic, some new features were added to the bare metal driver. USC-ISI added support for Tilera and Devananda added a feature that allows you to request a specific bare metal node when provisioning a server.

Version 3 (v3) of the Compute API

The Havana release will include a new revision of the compute REST API in Nova. This effort is being led by Christopher Yeoh, with help from others. The v3 API will include a new framework for implementing extensions, extension versioning, and a whole bunch of cleanup: (1) (2) (3) (4).

Networking

The OpenStack community has been maintaining two network stacks for some time. Nova includes the nova-network service. Meanwhile, the OpenStack Networking project has been developed from scratch to support much more than nova-network does. Nova currently supports both. OpenStack Networking is expected to reach and surpass feature parity with nova-network in the Havana cycle. As a result, it’s time to deprecate nova-network. Vish Ishaya (from the Nova side) and Gary Kotton (from the OpenStack Networking side) have agreed to take on the challenging task of figuring out how to migrate existing deployments using nova-network to an updated environment that includes OpenStack Networking.

Scheduling

The Havana roadmap includes a mixed bag of scheduler features.

Andrew Laski is going to make the changes required so that the scheduler becomes exclusively a resource that gets queried. Currently, when starting an instance, the request is handed off to the scheduler, which then hands it off to the compute node that is selected. This change will make it so proxying through nova-scheduler is no longer done. This will mean that every operation that uses the scheduler will interact with it the same way, as opposed to some operations querying and others proxying.

Phil Day will be adding an API extension that allows you to discover which scheduler hints are supported.  Phil is also looking at adding a way to allocate an entire host to a single tenant.

Inbar Shapira is looking at allowing multiple scheduling policies to be in effect at the same time.  This will allow you to have different sets of scheduler filters activated depending on some type of criteria (perhaps the requested availability zone).

Rerngvit Yanggratoke is implementing support for weighting scheduling decisions based on the CPU utilization of existing instances on a host.

Migrations

Nova includes support for different types of migrations. We have cold migrations (migrate) and live migrations (live-migrate). We also have resize and evactuate, which are very related functions. The code paths for all of these features have evolved separately. It turns out that we can rework all of these things to share a lot of code. While we’re at it, we are restructuring the way these operations work to be primarily driven by the nova-conductor service.  This will allow the tasks to be tracked in a single place, as opposed to the flow of control being passed around between compute nodes. Having compute nodes tell each other what to do is also a very bad thing from a security perspective. These efforts are well underway. Tiago Rodrigues de Mello is working on moving cold migrations to nova-conductor and John Garbutt is working on moving live migrations. All of this is tracked under the parent blueprint for unified migrations.

And More!

This post doesn’t include every feature on the roadmap. You can find that here. I fully expect that more will be added to this list as Havana progresses. We don’t always know what features are being worked on in advance. If you have another feature you would like to propose, let’s talk about it on the openstack-dev list!

Deployment Considerations for nova-conductor Service in OpenStack Grizzly

The Grizzly release of OpenStack Nova includes a new service, nova-conductor. Some previous posts about this service can be found here and here. This post is intended to provide some additional insight into how this service should be deployed and how the service should be scaled as load increases.

Smaller OpenStack Compute deployments typically consist of a single controller node and one or more compute (hypervisor) nodes. The nova-conductor service fits in the category of controller services. In this style of deployment you would run nova-conductor on the controller node and not on the compute nodes. Note that most of what nova-conductor does in the Grizzly release is doing database operations on behalf of compute nodes. This means that the controller node will have more work to do than in previous releases. Load should be monitored and the controller services should be scaled out if necessary.

Here is a model of what this size of deployment might look like:

nova-simple

As Compute deployments get larger, the controller services are scaled horizontally. For example, there would be multiple instances of the nova-api service on multiple nodes sitting behind a load balancer. You may also be running multiple instances of the nova-scheduler service across multiple nodes. Load balancing is done automatically for the scheduler by the AMQP message broker, RabbitMQ or Qpid. The nova-conductor service should be scaled out in this same way. It can be run multiple times across multiple nodes and the load will be balanced automatically by the message broker.

Here is a second deployment model. This gives an idea of how a deployment grows beyond a single controller node.

nova-complex

There are a couple of ways to monitor the performance of nova-conductor to see if it needs to be scaled out. The first is by monitoring CPU load. The second is by monitoring message queue size. If the queues are getting backed up, it is likely time to scale out services.  Both messaging systems provide at least one way to look at the state of message queues. For Qpid, try the qpid-stat command. For RabbitMQ, see the rabbitmqctl list_queues command.