Running Kubernetes over Kubernetes was a good idea for the stateless components of the control plane… but what about the etcd?
In our precedent post, we described the Kubinception architecture, how we run Kubernetes over Kubernetes for the stateless components of the customer clusters’ control planes. But what about the stateful component, the etcd?
The need is clear: each customer cluster need to have access to an etcd to be able to store and retrieve data. All the question is where and how to deploy the etcd to make it available to each customer cluster.
The simplest idea is not always the good one
The first approach would be simply following the Kubinception logic: for each customer cluster deploying an etcd cluster as pods running on the admin cluster.
This full Kubinception approach has the merit to be simple, it seems like an extension of what we are doing with the stateless components. But when looking at it in detail, it shows its flaws. Deploying an etcd cluster is not as easy and straightforward as deploying stateless and being critical to the cluster operation, we couldn’t simply handle it manually, we needed an automated approach to manage it at a higher level.
Using the operator
We weren’t the only one to think that the complexity of dealing with the deployment and operation on of an etcd cluster on Kubernetes were excessive, the people of CoreOS had noticed it and in 2016 they released an elegant solution to the problem: the etcd operator.
An operator is a specific controller that extends the Kubernetes API to easily create, configure and operate instances of complex (often distributed) stateful applications on Kubernetes. For the record, the concept of operator was introduced by CoreOS with the etcd operator.
The etcd operator manages etcd clusters deployed to Kubernetes and automates operational tasks: creation, destruction, resizing, failover, rolling upgrades, backups…
As in the precedent solution, the etcd cluster for each customer cluster is deployed as pods in the admin cluster. By default, the etcd operator deploys the etcd cluster using local, non persistent storage for each etcd pod. That means that if all the pods die (unlikely) or are re-scheduled and spawned in another node (far more likely) we could lose the etcd data. And without it, the customer Kubernetes are bricked.
The etcd operator can be configured to use persistent volumes (PV) to store the data, so theoretically the problem was solved. Theoretically because the volume management wasn’t mature enough when we tested it, and if an etcd pod was killed and re-scheduled, the new pod failed retrieve its data on the PV. So the risk of total quorum loss and the bricking of customer cluster was still there with the etcd operator.
In brief, we worked quite a bit with the etcd operator, and we found it not mature enough for our use.
The StatefulSet
Setting aside the operator, another solution was to use a StatefulSet, a kind of distributed Deployment well suited for managing distributed stateful applications.
There is an official ETCD Helm chart that allows to deploy ETCD clusters as StafefulSets, that trades off some of the operator flexibility and user-friendliness for a more robust PV management that guarantees that a re-scheduled etcd pod will retrieve its data.
The etcd StatefulSet is less convenient that the etcd operator, as it doesn’t offer the easy API for operations as scaling, failover, rolling upgrades or backups. In exchange you get some real improvements in the PV management. The StatefulSet maintains a sticky identity for each of the etcd posts, and that persistent identifier is maintained across any rescheduling, making it simply to pair it to its PV.
The system is so resilient that, even if we lose all the etcd pods, when Kubernetes re-schedule them they will find their data and the cluster will continue working without problem.
Persistent Volumes, latency, and a simple costs calcul
The etcd StatefulSet seemed a good solution… until we began to use it in an intensive way. The etcd StatefulSet uses PV, i.e. network storage volumes. And etcd is rather sensible to network latency, its performance degrades heavily when faced to latency.
Even if the latency could be kept under control (and that’s a big if), the more we thought about the idea, the more it seemed an expensive solution. For each customer cluster we would need to deploy three pods (effectively doubling the pod count) and three associated PV, it doesn’t scale well for a managed service.
In OVH Managed Kubernetes service we bill our customers according to the number of worker nodes they consume, i.e. the control plane is free. That means that for the service to be competitive it’s important to keep under control the resources consumed by the control planes, so the need not to double the pod count with the etcd.
With Kubinception we had tried to think outside the box, it seemed that for etcd we needed to get out of that box once again.
Multi-tenant etcd cluster
If we didn’t want to deploy the etcd inside Kubernetes, the alternative was to deploy it outside. We chose to deploy a multi-tenant etcd cluster on dedicated servers. All the customer clusters would use the same ETCD, every API server getting its own space in this multi-tenant etcd cluster.
With this solution the resiliency is guaranteed by the usual etcd mechanisms, there is no latency problem as the data is in the local disk of each etcd node, and the pod count remains under control, so it solves the main problems we had with the other solution. The trade-off here is that we need to install and operate this external etcd cluster, and manage the access control to be sure that every API server access only to its own data.
What’s next?
In the next posts in the Kubernetes series we will dive into other aspects of the construction of OVH Managed Kubernetes, and we will give the keyboard to some of our beta customers to narrate their experience using the service.
Next week let’s focus on another topic, we will deal with TSL query language, and why did we create and open sourced it…