Kubinception Archives - OVHcloud Blog

Kubinception and etcd

Horacio Gonzalez — Fri, 08 Feb 2019 16:16:38 +0000

Running Kubernetes over Kubernetes was a good idea for the stateless components of the control plane… but what about the etcd?

In our precedent post, we described the Kubinception architecture, how we run Kubernetes over Kubernetes for the stateless components of the customer clusters’ control planes. But what about the stateful component, the etcd?

The need is clear: each customer cluster need to have access to an etcd to be able to store and retrieve data. All the question is where and how to deploy the etcd to make it available to each customer cluster.

The simplest idea is not always the good one

The first approach would be simply following the Kubinception logic: for each customer cluster deploying an etcd cluster as pods running on the admin cluster.

Full Kubinception for the etcd: deploying etcd cluster as pods

This full Kubinception approach has the merit to be simple, it seems like an extension of what we are doing with the stateless components. But when looking at it in detail, it shows its flaws. Deploying an etcd cluster is not as easy and straightforward as deploying stateless and being critical to the cluster operation, we couldn’t simply handle it manually, we needed an automated approach to manage it at a higher level.

Using the operator

We weren’t the only one to think that the complexity of dealing with the deployment and operation on of an etcd cluster on Kubernetes were excessive, the people of CoreOS had noticed it and in 2016 they released an elegant solution to the problem: the etcd operator.

An operator is a specific controller that extends the Kubernetes API to easily create, configure and operate instances of complex (often distributed) stateful applications on Kubernetes. For the record, the concept of operator was introduced by CoreOS with the etcd operator.

Using the etcd operator to deploy the etcd clusters

The etcd operator manages etcd clusters deployed to Kubernetes and automates operational tasks: creation, destruction, resizing, failover, rolling upgrades, backups…

As in the precedent solution, the etcd cluster for each customer cluster is deployed as pods in the admin cluster. By default, the etcd operator deploys the etcd cluster using local, non persistent storage for each etcd pod. That means that if all the pods die (unlikely) or are re-scheduled and spawned in another node (far more likely) we could lose the etcd data. And without it, the customer Kubernetes are bricked.

The etcd operator can be configured to use persistent volumes (PV) to store the data, so theoretically the problem was solved. Theoretically because the volume management wasn’t mature enough when we tested it, and if an etcd pod was killed and re-scheduled, the new pod failed retrieve its data on the PV. So the risk of total quorum loss and the bricking of customer cluster was still there with the etcd operator.

In brief, we worked quite a bit with the etcd operator, and we found it not mature enough for our use.

The StatefulSet

Setting aside the operator, another solution was to use a StatefulSet, a kind of distributed Deployment well suited for managing distributed stateful applications.

There is an official ETCD Helm chart that allows to deploy ETCD clusters as StafefulSets, that trades off some of the operator flexibility and user-friendliness for a more robust PV management that guarantees that a re-scheduled etcd pod will retrieve its data.

Using StatefulSets for the etcd clusters

The etcd StatefulSet is less convenient that the etcd operator, as it doesn’t offer the easy API for operations as scaling, failover, rolling upgrades or backups. In exchange you get some real improvements in the PV management. The StatefulSet maintains a sticky identity for each of the etcd posts, and that persistent identifier is maintained across any rescheduling, making it simply to pair it to its PV.

The system is so resilient that, even if we lose all the etcd pods, when Kubernetes re-schedule them they will find their data and the cluster will continue working without problem.

Persistent Volumes, latency, and a simple costs calcul

The etcd StatefulSet seemed a good solution… until we began to use it in an intensive way. The etcd StatefulSet uses PV, i.e. network storage volumes. And etcd is rather sensible to network latency, its performance degrades heavily when faced to latency.

Even if the latency could be kept under control (and that’s a big if), the more we thought about the idea, the more it seemed an expensive solution. For each customer cluster we would need to deploy three pods (effectively doubling the pod count) and three associated PV, it doesn’t scale well for a managed service.

In OVH Managed Kubernetes service we bill our customers according to the number of worker nodes they consume, i.e. the control plane is free. That means that for the service to be competitive it’s important to keep under control the resources consumed by the control planes, so the need not to double the pod count with the etcd.

With Kubinception we had tried to think outside the box, it seemed that for etcd we needed to get out of that box once again.

Multi-tenant etcd cluster

If we didn’t want to deploy the etcd inside Kubernetes, the alternative was to deploy it outside. We chose to deploy a multi-tenant etcd cluster on dedicated servers. All the customer clusters would use the same ETCD, every API server getting its own space in this multi-tenant etcd cluster.

Multi-tenant bare-metal dedicated etcd cluster

With this solution the resiliency is guaranteed by the usual etcd mechanisms, there is no latency problem as the data is in the local disk of each etcd node, and the pod count remains under control, so it solves the main problems we had with the other solution. The trade-off here is that we need to install and operate this external etcd cluster, and manage the access control to be sure that every API server access only to its own data.

What’s next?

In the next posts in the Kubernetes series we will dive into other aspects of the construction of OVH Managed Kubernetes, and we will give the keyboard to some of our beta customers to narrate their experience using the service.

Next week let’s focus on another topic, we will deal with TSL query language, and why did we create and open sourced it…

Kubinception: using Kubernetes to run Kubernetes

Horacio Gonzalez — Fri, 25 Jan 2019 09:13:08 +0000

When faced with the challenge of building a managed Kubernetes service at OVH, fully based on open-source tools, we had to take some tough design decisions. Today we review one of them…

One of the most structural choices we made while building OVH Managed Kubernetes service was to deploy our customers’ clusters over our own ones. Kubinception indeed…

In this post we are relating our experience running Kubernetes over Kubernetes, with hundreds of customers’ clusters. Why did we choose this architecture? What are the main stakes with such a design? What problems did we encounter? How did we deal with those issues? And, even more important, if we had to take the decision today, would we choose again to do the Kubinception?

How does a Kubernetes cluster work?

To fully understand why we run Kubernetes on Kubernetes, we need at least a basic understanding of how a Kubernetes cluster works. A full explanation on this topic is out of the context of this post, but let’s do a quick summary:

A working Kubernetes cluster is composed of:

A control plane that makes global decisions about the cluster, and detects and responds to cluster events. This control plane is composed of several master components.
A set of nodes, worker instances containing the services necessary to run pods, with some node components running on every node, maintaining running pods and providing the Kubernetes runtime environment.

Simplified Kubernetes architecture

Master components

In this category of components we have:

API Server: exposes the Kubernetes API. It is the entry-point for the Kubernetes control plane.
Scheduler: watches newly created pods and selects a node for them to run on, managing ressource allocation.
Controller-manager: run controllers, control loops that watch the state of the cluster and move it towards the desired state.
ETCD: Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data. This topic deserves its own blog post, so we speaking about it in the coming weeks.

Node components

In every node we have:

Kubelet: agent that makes sure that containers described in PodSpecs are running and healthy. It’s the link between the node and the control plane.
Kube-proxy: network proxy running in each node, enabling the Kubernetes service abstraction by maintaining network rules and performing connection forwarding.

Our goal: quick and painless cluster deployment

How come did we go from this that simple Kubernetes architecture to a Kubernetes over Kubernetes one? The answer lies on one of our our main goals building OVH Managed Kubernetes service: to be able to deploy clusters in the simplest and most automated way.

And we didn’t only want to deploy clusters, we wanted the deployed clusters to be:

Resilient
Isolated
Cost-optimized

Kubinception: running Kubernetes over Kubernetes

The idea is to use a Kubernetes cluster we are calling admin cluster to deploy customer clusters.

As every Kubernetes cluster, the customer clusters have a set of nodes and a control plane, composed of several master components (API server, scheduler…).

What we are doing is to deploy those customer cluster master components as pods in the admin cluster nodes.

Simplified Kubinception architecture

So now we have the stateless components of the customer cluster control plane running as pods in the admin cluster nodes. We haven’t spoken about the ETCD, as we will speak about it in a next post, for the moment let’s only say that is it a dedicated component, living outside Kubernetes.

And the customer cluster worker nodes? They are normal Kubernetes nodes: OVH public cloud instances connecting to the customer cluster API server running in an admin cluster pod.

Customer cluster with nodes and ETCD

Our goal is to manage lots of cluster, not only one, so how can we add another customer cluster? As you could expect, we deploy the new customer cluster control plane on the admin cluster nodes.

Two customer clusters on Kubinception

From the admin cluster point of view, we have simple deployed three new pods. Then we spawn some new node instances and connect an ETCD and the cluster is up.

If something can fail, it will do it

We have now an architecture that allows us to quickly deploy new clusters, but if we go back to our goal, quickly deployment was only half of it, we wanted the cluster to be resilient. Let’s begin with the resiliency.

The customer cluster nodes are already resilient, as they are vanilla Kubernetes nodes, and the ETCD resiliency will be detailed in a specific blog post, so let’s see the control plane resiliency, as it’s the specific part of our architecture.

And that’s the beauty of the Kubinception architecture, we are deploying the customer clusters control plane as simple, standard, vanilla pods in our admin cluster. And that means they are as resilient as any other Kubernetes pod: if one of the customer cluster master components goes down, the controller-manager of the admin cluster will detect it and the pod will be rescheduled and redeployed, without any manual action on our side.

What a better way to be sure that our Kubernetes is solid enough…

Basing our Managed Kubernetes service on Kubernetes made us to stumble on facets of Kubernetes we hadn’t found before, to learn lots of things about installing, deploying and operating Kubernetes. And all those knowledge and tooling was directly applied to our customers clusters, making the experience better for everybody.

And what about scaling

The whole system has been designed from the ground up with this idea of scaling The Kubernetes over Kubernetes architecture allows easy horizontal scaling. When an admin cluster begins to get too big, we can simply spawn a new one and deploy there the next customer control planes.

What’s next?

As this post is already long enough, so I leave the explanation on the ETCD for the next post in the series, in two weeks.

Next week let’s focus on another topic, we will deal with Apache Flink and how do we use it for handling alerts at OVH-scale…