The OVHcloud Bastion - Part 1

Bastion? Are we talking about the indie game?

Not this time! (albeit it’s a good game!).

At OVHcloud, a fair amount of our infrastructures are built on top of Linux boxes. We have a lot of different flavours; such as Debian, Ubuntu, Red Hat… and the list goes on. We even had good old Gentoos once! These are all stored on bare metal servers, on VMs, and in containers everywhere. As long as it has a CPU (or vCPU), we probably booted some kind of Linux distro on it. But that’s not the whole story. We also had Solaris boxes, that later turned into OmniOS boxes, that have now turned into shiny FreeBSD boxes. We also have a lot of network devices, split against different constructors, spanning a wide range of model generations.

As you’ve probably guessed, we have heterogeneous systems that are running to provide a handful of different services. But regardless of this heterogeneity, do you know what they all have in common? Yes, they all ping, but there’s something more interesting: they can all be administered through ssh.

The problem

SSH has been the de-facto admin standard for quite some time now – replacing obsolete programs such as rlogin, that were happily transferring your password in plaintext over the network – so we use it all the time, as most of the industry does.

There are two regular ways of using it: either you just type your account password when the remote server asks for it, which is more or less like rlogin (without your password transmitted in plaintext over the wire); or you use a public key authentication, by generating a so-called “keypair”, with the private key sitting on your desk (or in a smartcard), and the corresponding public key sitting on a remote server.

The issue is that none of these two ways are really satisfactory in an enterprise context.

Password authentication

First, the password way. Well, we all already know that passwords suck. Either you pick one that is too easy to crack, or you pick some very complex one that you’ll never remember. This forces you to use a password manager that is protected by… a master password. Even strong passphrases such as “Correct Horse Battery Staple“, are nothing more than an elaborate password in the end. They bring a whole range of problems, such as the fact that they’re always subject to bruteforce attacks, and some users might get hit by the password reuse plague. As a sysadmin, you never really sleep well when you know that the security of your systems are just one password away. Of course there are ways to mitigate the risk, such as forcing a periodic password renew, a minimum password length and/or complexity, or disabling an account after several failures, etc. But you’re just putting additional burden on your users and still not achieving a satisfactory level of security.

Public key authentication

Second, the pubkey way. It goes a long way to fixing password issues, but then the problem becomes scalability. Pushing your public key to your home server is easy, but when you have tens of thousands of servers/devices, as well as thousands of employees, administering some always-changing subset of said servers/devices, it quickly becomes complicated. Indeed, doing it properly and maintaining it in the long-term in an enterprise context is a real challenge.

PKI-based authentication

For the sake of completeness – because I can hear you from here SSH gurus! – there is a third way in recent versions of SSH servers, namely authentication based on a PKI with a trusted Certificate Authority (CA). You install the public certificate of your CA on all your servers, and they’ll accept any connection authenticated by a certificate delivered by said CA, relying on the subjectName of the certificate. This specifies which account can be accessed on the server, among other things. This is a very centralized way of managing your accesses, with all the power in the hands of whoever controls your CA. It can be a highly successful if done very carefully, with a lot of security and processes around the certificates delivery workflows. Managing a CA correctly is no joke and can bite you quite hard if done improperly. This also happens to be a somewhat recent addition to OpenSSH, and given the heterogeneity we outlined above, it would have left a lot of systems on the side. There is also another reason why we haven’t chosen this method, but before diving into it, let’s talk about our needs.

What we needed

At OVHcloud, we have various technical teams that manage their own infrastructure, rather than relying on a generic internal IT department. This principle is part of the company culture and DNA. It does have its drawbacks; such as the added complexity in maintaining an exhaustive and up-to-date inventory of our own assets, but its advantages far outweigh them: multiple teams can iterate faster, as they can use existing OVHcloud products as building blocks to create new, innovative products. This must not, however, come at the cost of security, which is fundamental to everything we do at OVHcloud.

But how did we manage to develop security systems around the SSH management of all these servers without getting in the way of various operational teams?

A few important items are required:

DELEGATION
- Any kind of centralized “security team” responsible for handling access clearances for the whole company is a no-go. It doesn’t scale, no matter how you do it.
- Managers or technical leads should be completely autonomous in managing their own perimeter, in terms of servers/systems/devices, and regarding those persons who are granted access within their perimeter.
- A member from a team moving to another team or out of the company should be a completely seamless process, regardless of the kind of systems this person had access to (remember the heterogeneity above?).
- Giving access to a new team member must also be seamless, so they can get their hands dirty as fast as possible.
- Temporarily granting access to somebody outside of the team (or company) to a given asset for a limited amount of time should be easy.
- All of these actions should be easy to do autonomously

AUDITABILITY & TRACEABILITY
- Every action must be logged with a lot of details; be it a clearance modification, or a connection to a system; whether it’s successful or not. We also want it to be pushable to some SIEM.
- Every terminal session should be recorded. Yup, you read correctly. This is the kind of feature you don’t ever need.. until you do.
- It must be easy to generate reports for conducting access reviews.

SECURITY & RESILIENCE
- We must bring more security than a bare direct SSH access, with no additional cost.
- Any component that we have to add to answer those needs must be up and running at all times, even (and especially) when the rest of your infrastructure is falling apart, because that’s exactly when you’ll need SSH.

So what is the other reason we didn’t choose the PKI way? Well, this would have limited the autonomy of the team leads: only the CA would be able to deliver or revoke certificates, but we want this power in the hands of our team leads. With the PKI way, if we wanted to give some power to them, we would have had to implement a complex logic around the CA to make this possible, and we didn’t want to go down this route.

Enter the bastion!

To respond to our complex requirements, we have a specialized machine that sits between the admins and the infrastructures – a bastion – whose job it is to handle all the important items above, in addition to the decoupling of the authentication and the authorization phases. We’ll use public key authentication on both sides. Let’s take a moment to see a simple example of a connection workflow using this design:

An admin wants to connect to a machine named server42
He can’t SSH directly from his company laptop to server42 because server42 is firewalled, and only allows incoming SSH connections from the company’s bastion clusters
The admin starts an SSH session to the bastion instead, using his nominative account on it. His laptop negotiates the SSH session using his private key. This is the authentication phase: the bastion ensures that the admin presenting himself as John Admin is indeed this person, which is possible thanks to the fact that the public key of John Admin sits inside his bastion account. We call this the *ingress* connection.
Once John Admin is authenticated, he asks the bastion to open a connection to the root account on server42.
The bastion verifies whether John Admin is allowed to access the root account on server42, this is the authorization part. Let’s say for the sake of this example that John Admin is indeed allowed to connect to this server, using his team’s bastion private key (more details about this later).
The bastion initiates an SSH connection to server42, on John Admin’s behalf, using his team’s bastion private key.
The firewall of server42 allows incoming SSH connections from the bastion, and the connection is negotiated successfully as the John Admin team’s bastion public key is installed on server42’s root account. We call this the *egress* connection.

We now have two established SSH connections: the ingress connection, between John Admin and the bastion, and the egress connection, between the bastion and server42.

Now, some magic happens, and the bastion “plugs” these two connections together, using a pseudo-terminal (a pty) in between. John Admin is now under the impression that he’s directly connected to server42, and can interact with it as if this were the case.
Meanwhile, the bastion can record everything that is typed by John Admin (or, more accurately, everything that is *seen* by John Admin, we won’t record passwords he types on noecho terminals!), this is handled by the ovh-ttyrec program.

To be perfectly clear, server42 doesn’t know who John Admin is, and doesn’t need to: we’ve decoupled the authentication and authorization part. Only the bastion needs to know and authenticate the admin, the remote server only knows and trusts the bastion (or, more accurately, John Admin’s team existence on the bastion). This opens up a whole range of possibilities… but more about that in the next post!

This post is the first of a series of posts regarding the bastion. In the next posts, we’ll dig into the authorization part, namely the personal keys and accesses, the groups, and everything that goes along with those. We will also look at the different roles that exist on the bastion to make it so versatile. We’ll talk about some design choices, and how we want security to be at the centre of these choices – with some gory technical details. Click here to read the Part 2: Delegation Dizziness.

Stéphane Lesimple

+ posts

Head of Security Tools Squad