Metal³ – Metal Kubed, Bare Metal Provisioning for Kubernetes

Project Introduction

There are a number of great open source tools for bare metal host provisioning, including Ironic.  Metal³ aims to build on these technologies to provide a Kubernetes native API for managing bare metal hosts via a provisioning stack that is also running on Kubernetes.  We believe that Kubernetes Native Infrastructure, or managing your infrastructure just like your applications, is a powerful next step in the evolution of infrastructure management.

The Metal³ project is also building integration with the Kubernetes cluster-api project, allowing Metal³ to be used as an infrastructure backend for Machine objects from the Cluster API.

Metal3 Repository Overview

There is a Metal³ overview and some more detailed design documents in the metal3-docs repository.

The baremetal-operator is the component that manages bare metal hosts.  It exposes a new BareMetalHost custom resource in the Kubernetes API that lets you manage hosts in a declarative way.

Finally, the cluster-api-provider-baremetal repository includes integration with the cluster-api project.  This provider currently includes a Machine actuator that acts as a client of the BareMetalHost custom resources.

Demo

The project has been going for a few months now, and there’s enough now to show some working code.

For this demonstration, I’ve started with a 3 node Kubernetes cluster installed using OpenShift.

$ kubectl get nodes
NAME       STATUS   ROLES    AGE   VERSION
master-0   Ready    master   24h   v1.13.4+d4ce02c1d
master-1   Ready    master   24h   v1.13.4+d4ce02c1d
master-2   Ready    master   24h   v1.13.4+d4ce02c1d

Machine objects were created to reflect these 3 masters, as well.

$ kubectl get machines
NAME              INSTANCE   STATE   TYPE   REGION   ZONE   AGE
ostest-master-0                                             24h
ostest-master-1                                             24h
ostest-master-2                                             24h

For this cluster-api provider, a Machine has a corresponding BareMetalHost object, which corresponds to the piece of hardware we are managing.  There is a design document that covers the relationship between Nodes, Machines, and BareMetalHosts.

Since these hosts were provisioned earlier, they are in a special “externally provisioned” state, indicating that we enrolled them in management while they were already running in a desired state.  If changes are needed going forward, the baremetal-operator will be able to automate them.

$ kubectl get baremetalhosts
NAME                 STATUS   PROVISIONING STATUS      MACHINE           BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ostest-master-0   ipmi://192.168.111.1:6230                      true     
openshift-master-1   OK       externally provisioned   ostest-master-1   ipmi://192.168.111.1:6231                      true     
openshift-master-2   OK       externally provisioned   ostest-master-2   ipmi://192.168.111.1:6232                      true

Now suppose we’d like to expand this cluster by adding another bare metal host to serve as a worker node.  First we need to create a new BareMetalHost object that adds this new host to the inventory of hosts managed by the baremetal-operator.  Here’s the YAML for the new BareMetalHost:

---
apiVersion: v1
kind: Secret
metadata:
  name: openshift-worker-0-bmc-secret
type: Opaque
data:
  username: YWRtaW4=
  password: cGFzc3dvcmQ=

---
apiVersion: metalkube.org/v1alpha1
kind: BareMetalHost
metadata:
  name: openshift-worker-0
spec:
  online: true
  bmc:
    address: ipmi://192.168.111.1:6233
    credentialsName: openshift-worker-0-bmc-secret
  bootMACAddress: 00:ab:4f:d8:9e:fa

Now to add the BareMetalHost and its IPMI credentials Secret to the cluster:

$ kubectl create -f worker_crs.yaml 
secret/openshift-worker-0-bmc-secret created
baremetalhost.metalkube.org/openshift-worker-0 created

The list of BareMetalHosts now reflects a new host in the inventory that is ready to be provisioned.  It will remain in this “ready” state until it is claimed by a new Machine object.

$ kubectl get baremetalhosts
NAME                 STATUS   PROVISIONING STATUS      MACHINE           BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ostest-master-0   ipmi://192.168.111.1:6230                      true     
openshift-master-1   OK       externally provisioned   ostest-master-1   ipmi://192.168.111.1:6231                      true     
openshift-master-2   OK       externally provisioned   ostest-master-2   ipmi://192.168.111.1:6232                      true     
openshift-worker-0   OK       ready                                      ipmi://192.168.111.1:6233   unknown            true

We have a MachineSet already created for workers, but it scaled down to 0.

$ kubectl get machinesets
NAME              DESIRED   CURRENT   READY   AVAILABLE   AGE
ostest-worker-0   0         0                             24h

We can scale this MachineSet to 1 to indicate that we’d like a worker provisioned.  The baremetal cluster-api provider will then look for an available BareMetalHost, claim it, and trigger provisioning of that host.

$ kubectl scale machineset ostest-worker-0 --replicas=1

After the new Machine was created, our cluster-api provider claimed the available host and triggered it to be provisioned.

$ kubectl get baremetalhosts
NAME                 STATUS   PROVISIONING STATUS      MACHINE                 BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ostest-master-0         ipmi://192.168.111.1:6230                      true     
openshift-master-1   OK       externally provisioned   ostest-master-1         ipmi://192.168.111.1:6231                      true     
openshift-master-2   OK       externally provisioned   ostest-master-2         ipmi://192.168.111.1:6232                      true     
openshift-worker-0   OK       provisioning             ostest-worker-0-jmhtc   ipmi://192.168.111.1:6233   unknown            true

This process takes some time.  Under the hood, the baremetal-operator is driving Ironic through a provisioning process.  This begins with wiping disks to ensure the host comes up in a clean state.  It will eventually write the desired OS image to disk and then reboot into that OS.  When complete, a new Kubernetes Node will register with the cluster.

$ kubectl get baremetalhosts
NAME                 STATUS   PROVISIONING STATUS      MACHINE                 BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ostest-master-0         ipmi://192.168.111.1:6230                      true     
openshift-master-1   OK       externally provisioned   ostest-master-1         ipmi://192.168.111.1:6231                      true     
openshift-master-2   OK       externally provisioned   ostest-master-2         ipmi://192.168.111.1:6232                      true     
openshift-worker-0   OK       provisioned              ostest-worker-0-jmhtc   ipmi://192.168.111.1:6233   unknown            true     

$ kubectl get nodes
NAME       STATUS   ROLES    AGE   VERSION
master-0   Ready    master   24h   v1.13.4+d4ce02c1d
master-1   Ready    master   24h   v1.13.4+d4ce02c1d
master-2   Ready    master   24h   v1.13.4+d4ce02c1d
worker-0   Ready    worker   68s   v1.13.4+d4ce02c1d

The following screen cast demonstrates this process, as well:

Removing a bare metal host from the cluster is very similar.  We just have to scale this MachineSet back down to 0.

$ kubectl scale machineset ostest-worker-0 --replicas=0

Once the Machine has been deleted, the baremetal-operator will deprovision the bare metal host.

$ kubectl get baremetalhosts
NAME                 STATUS   PROVISIONING STATUS      MACHINE           BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ostest-master-0   ipmi://192.168.111.1:6230                      true     
openshift-master-1   OK       externally provisioned   ostest-master-1   ipmi://192.168.111.1:6231                      true     
openshift-master-2   OK       externally provisioned   ostest-master-2   ipmi://192.168.111.1:6232                      true     
openshift-worker-0   OK       deprovisioning                             ipmi://192.168.111.1:6233   unknown            false

Once the deprovisioning process is complete, the bare metal host will be back to its “ready” state, available in the host inventory to be claimed by a future Machine object.

$ kubectl get baremetalhosts
NAME                 STATUS   PROVISIONING STATUS      MACHINE           BMC                         HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ostest-master-0   ipmi://192.168.111.1:6230                      true     
openshift-master-1   OK       externally provisioned   ostest-master-1   ipmi://192.168.111.1:6231                      true     
openshift-master-2   OK       externally provisioned   ostest-master-2   ipmi://192.168.111.1:6232                      true     
openshift-worker-0   OK       ready                                      ipmi://192.168.111.1:6233   unknown            false

Getting Involved

All development is happening on github.  We have a metal3-dev mailing list and use #cluster-api-baremetal on Kubernetes Slack to chat.  Occasional project updates are posted to @metal3_io on Twitter.

OVS 2.6 and The First Release of OVN

In January of 2015, the Open vSwitch team announced that they planned to start a new project within OVS called OVN (Open Virtual Network).  The timing could not have been better for me as I was looking around for a new project.  I dove in with a goal of figuring out whether OVN could be a promising next generation of Open vSwitch integration for OpenStack and have been contributing to it ever since.

OVS 2.6.0 has now been released which includes the first non-experimental version of OVN.  As a community we have also built integration with OpenStack, Docker, and Kubernetes.

OVN is a system to support virtual network abstraction. OVN complements the existing capabilities of OVS to add native support for virtual network abstractions, such as virtual L2 and L3 overlays and security groups.

Some high level features of OVN include:

  • Provides virtual networking abstraction for OVS, implemented using L2 and L3 overlays, but can also manage connectivity to physical networks
  • Supports flexible ACLs (security policies) implemented using flows that use OVS connection tracking
  • Native support for distributed L3 routing using OVS flows, with support for both IPv4 and IPv6
  • ARP and IPv6 Neighbor Discovery suppression for known IP-MAC bindings
  • Native support for NAT and load balancing using OVS connection tracking
  • Native fully distributed support for DHCP
  • Works with any OVS datapath (such as the default Linux kernel datapath, DPDK, or Hyper-V) that supports all required features (namely Geneve tunnels and OVS connection tracking. See the datapath feature list in the FAQ for details.)
  • Supports L3 gateways from logical to physical networks
  • Supports software-based L2 gateways
  • Supports TOR (Top of Rack) based L2 gateways that implement the hardware_vtep schema
  • Can provide networking for both VMs and containers running inside of those VMs, without a second layer of overlay networking

Support for large scale deployments is a key goal of OVN.  So far, we have seen physical deployments of several hundred nodes.  We’ve also done some larger scale testing by simulating deployments of thousands of nodes using the ovn-scale-test project.

OVN Architecture

Components

ovn-architecture

OVN is a distributed system.  There is a local SDN controller that runs on every host, called ovn-controller.  All of the controllers are coordinated through the southbound database.  There is also a centralized component, ovn-northd, that processes high level configuration placed in the northbound database. OVN’s architecture is discussed in detail in the ovn-architecture document.

OVN uses databases for its control plane. One benefit is that scaling databases is a well understood problem.  OVN currently makes use of ovsdb-server as its database.  The use of ovsdb-server is particularly convenient within OVN as it introduces no new dependencies since ovsdb-server is already in use everywhere OVS is used.  However, the project is also currently considering adding support for, or fully migrating to etcd v3, since v3 includes all of the features we wanted for our system.

We have also found that this database driven architecture is much more reliable than RPC based approaches taken in other systems we have worked with.  In OVN, each instance of ovn-controller is always working with a consistent snapshot of the database.  It maintains a connection to the database and gets a feed of relevant updates as they occur.  If connectivity is interrupted, ovn-controller will always catch back up to the latest consistent snapshot of the relevant database contents and process them.

Logical Flows

OVN introduces a new intermediary representation of the system’s configuration called logical flows.  A typical centralized model would take the desired high level configuration, calculate the required physical flows for the environment, and program the switches on each node with those physical flows.  OVN breaks this problem up into a couple of steps.  It first calculates logical flows, which are similar to physical OpenFlow flows in their expressiveness, but operate only on logical entities.  The logical flows for a given network are identical across the whole environment.  These logical flows are then distributed to the local controller on each node, ovn-controller, which converts logical flows to physical flows.  This means that some deployment-wide computation is done once and the node-specific computation is fully distributed and done local to the node it applies to.

Logical flows have also proven to be powerful when it comes to implementing features.  As we’ve built up support for new capabilities in the logical flow syntax, most features are now implemented at the logical flow layer, which is much easier to work with than physical flows.

Data Path

OVN implements features natively in OVS wherever possible.  One such example is the implementation of security policies using OVS+conntrack integration.  I wrote about this in more detail previously.  This approach has led to significant data path performance improvements as compared to previous approaches.  The other area this makes a huge impact is how OVN implements distributed L3 routing.  Instead of combining OVS with several other layers of technology, we provide L3 routing purely with OVS flows.  In addition to the performance benefits, we also find this to be much simpler than the alternative approaches that other projects have taken to build routing on top of OVS.  Another benefit is that all of these features work with OVS+DPDK since we don’t rely on Linux kernel-specific features.

Integrations

OpenStack

Integration with OpenStack was developed in parallel with OVN itself.  The OpenStack networking-ovn project contains an ML2 driver for OpenStack Neutron that provides integration with OVN.  It differs from Neutron’s original OVS integration in some significant ways.  It no longer makes use of the Neutron Python agents as all equivalent functionality has been moved into OVN.  As a result, it no longer uses RabbitMQ.  Neutron’s use of RabbitMQ for RPC has been replaced by OVN’s database driven control plane.  The following diagram gives a visual representation of the architecture of Neutron using OVN.  Even more detail can be found in our documented reference architecture.

neutron-ovn-architecture

There are a few different ways to test out OVN integration with OpenStack.  The most popular development environment for OpenStack is called DevStack.  We provide integration with DevStack, including some instructions on how to do simple testing with DevStack.

If you’re a Vagrant user, networking-ovn includes a vagrant setup for doing multi-node testing of OVN using DevStack.

The OpenStack TripleO deployment project includes support for OVN as of the OpenStack Newton release.

Finally, we also have manual installation instructions to help with integrating OVN into your own OpenStack environment.

Kubernetes

There is active development on a CNI plugin for OVN to be used with Kubernetes.  One of the key goals for OVN was to have containers in mind from the beginning, and not just VMs.  Some important features were added to OVN to help support this integration.  For example, ovn-kubernetes makes use of OVN’s load balancing support, which is built on native load balancing support in OVS.

The README in that repository contains an overview, as well as instructions on how to use it.  There is also support for running an ovn-kubernetes environment using vagrant.

Docker

There is OVN integration with Docker networking, as well.  This currently resides in the main OVS repo, though it could be split out into its own repository in the future, similar to ovn-kubernetes.

Getting Involved

We would love feedback on your experience trying out OVN.  Here are some ways to get involved and provide feedback:

  • OVS and OVN are discussed on the OVS discuss mailing list.
  • OVN development occurs on the OVS development mailing list.
  • OVS and OVN are discussed in #openvswitch on the Freenode IRC network.
  • Development of the OVN Kubernetes integration occurs on Github but can be discussed on either the Open vSwitch IRC channel or discuss mailing list.
  • Integration of OVN with OpenStack is discussed in #openstack-neutron-ovn on Freenode, as well as the OpenStack development mailing list.