The Bastion Archives - OVHcloud Blog

The Bastion – Part 4 – A new era

Stéphane Lesimple — Thu, 29 Oct 2020 09:25:56 +0000

This is the last article in the series about The Bastion. In the previous parts, we covered the principles of The Bastion, and talked about how delegation was at the core of the system. Then we explained how Security was at the heart of the design principles, in a detailed but hopefully not too-long article.

Today, we’re announcing something special. You might have guessed it already, thanks to the (not so) little breadcrumbs trail we left in the previous articles. Without further ado, and because pictures can say a thousand words on their own:

We’re going open-source! We’re very excited to share this news with you, and to mark this new milestone in the lifecycle of The Bastion. We think it’s a perfect reason to bump to the next major version: v3.00.00! Obviously, all previous versions were internal-only.

The code is available at GitHub, and we’re also moving all the non-OVHcloud-specific development there from now on.

The documentation is also available online (as well as offline as reStructuredText files), we encourage you to read it. For the most impatient, there is also a docker image available on Docker hub if you want to give it a try: the TL;DR section of the README.md on GitHub will get you started.

Many of the more advanced features (such as PIV support, 2FA/MFA support, the notion of realms, the HTTPS proxy, etc.) are not yet fully documented, but all the basics are already there. We will enhance this during the next few weeks/months. A few features are not yet open-sourced either, such as the db plugin we talked about in the previous post. But it’ll make it to the open-source version eventually.

We hope it’ll be of use to the community, as much as it is to us, and we can’t wait to hear from you! The GitHub page is over here.

The Bastion – Part 3 – Security at the core

Stéphane Lesimple — Fri, 23 Oct 2020 15:33:49 +0000

In previous parts, we’ve covered the basic principles of the bastion. We then explained how delegation was at the core of the system. This time, we’ll dig into some governing principles of how The Bastion is written.

In a nutshell, the main purpose of the bastion is to ensure security, auditability and reliability in all cases. To this end, the bastion is engineered in a very specific way, with some principles that must be respected when implementing new features. Today we’re going to zoom in on how one of the functionalities of the bastion has been implemented to ensure an in-depth security. There are technical details ahead, so viewer discretion is advised!

The operating system is not just a scheduler

One of the engineering principles of the bastion is to leverage the underlying operating system’s security features, as additional guards on top of the code’s logic itself.

Usually, when developing a program, one doesn’t really need to think about the OS it’ll be running on, because all the business logic goes directly into the code. At its basic level, the OS’s job is to ensure the program runs on top of the hardware it has in charge, by abstracting it, along with the other pieces of software that might be sharing this hardware. In other words, most of the time the OS is mainly a scheduler, whose job is to ensure all the programs are running properly, and don’t step on each other’s toes.

To this end, an OS has the notion of user (or “account”), who may be the owner of some running programs and some files on the filesystem, alongside the notion of group (of users), so that e.g. a folder can be written to by several users. We’ll go back to this in a few minutes.

Now, let’s talk about applications. Most of the time, applications needing to handle users have a database with a “users” table, detailing the information about each user. In that case, the application’s code logic handles all the behaviour the program must have with respect to its users. For example, to authenticate a user, it stores a hash of each user password in the database, and checks whether the entered password’s hash matches what is stored in the database. If it does, then it deems the user to be successfully logged in. All this logic is entirely expressed in the code, the operating system plays no role in the process whatsoever.

There is then, only one operating system user dedicated to the application, regardless of how many users exist in the application’s database. The application will run under this OS user, and all files logically pertaining to different users in the application’s functional view, will be owned by this same OS user. It works because the segregation between the functional users is done entirely by the code: even if the application can technically access all its users files, it will only allow, through its code logic, access to the proper files for the proper user.

Code has bugs, but it shouldn’t matter

Now, let’s imagine we’re talking about a program – let’s name it MySuperCloudApp – whose job is to store files for its users, so that they can later fetch them from the cloud. Let’s imagine there is a flaw in the code (of course, this never happens), which doesn’t properly escape the user’s requested file name. If, once logged in as my user, I request a download of the file named myfile.txt, the application will allow it because I’m logged in.

But what happens if I request ../somebodyelse/herfile.txt, instead? If the code hasn’t been engineered to detect and filter out this weird request, it’ll just pass the read command to the underlying filesystems, which will allow it because, remember, the application runs under one OS user and all the actual user logic is handled by the application itself. All the application files are owned by the same OS user, so the request seems completely legitimate from an OS standpoint. I’ve just found a way to steal all the other users files. This type of flaw is called a path traversal, and is, unfortunately, pretty common.

For the bastion, the OS is more than a scheduler: every bastion user is actually mapped to an operating system user underneath. Likewise, every bastion group is mapped to an operating system group underneath. So are all the group roles we’ve talked about in the previous post. This is a strong design choice: we end up with an application that is deeply intertwined with the OS it’s running on, and this comes with some cons. However, for a security asset, which the bastion is, the pros vastly outgrow them.

Had MySuperCloudApp have adopted this design, mapping its application users to actual OS users, then the attack we’ve talked about before wouldn’t have worked. Even if the application’s code was flawed, and passed the read request to the OS below, the OS would have denied it, because down at the OS level, ../somebodyelse/herfile.txt is not owned by the same user. This is where the OS comes to rescue a flawed portion of code (which still needs to be corrected in all cases, of course!).

To take a more Bastion-y example, if a user pertains to groupA, and tricks the code into thinking it also pertains to groupB (because of a flaw in the bastion’s code logic), then it doesn’t matter too much because the OS will deny this user access to groupB‘s keys, as he won’t have access to read the file down to the OS level. So he still won’t be able to access any of groupB‘s servers. Technically, this is done by offloading the authentication part to sshd, which is well-known and does it quite well. When this phase succeeds, sshd creates a session under the proper OS user, and starts the bastion code entry point under this session.

We use the OS as an additional safety net in case there is a logic error or a vulnerability in the code: even if the code is tricked into taking bad decisions, the underlying OS will be there to deny the action, hence nullifying the impact.

In other words, all the OS bastion users have the bastion code declared as their system shell (instead of the usual /bin/sh). We’re even going further than that: the code is engineered in such a way that if a user succeeded in getting a real shell on the bastion, i.e. being able to run any command he’d like on the OS itself, completely bypassing all of the bastion code’s logic and checks, then he shouldn’t be able to do much more that what the normal bastion code logic allows him to. That’s another strong design principle, but helps to drastically reduce the impact of a security vulnerability, should it happen.

Trust no one

For some features to work correctly, the design choices we’ve outlined above implies that the bastion must sometimes create and delete users on the OS level. This can’t be done using unprivileged accounts, hence some parts of the code need to run under elevated privileges.

In The Bastion jargon, those portions of the code are called helpers, and are separated from the other portions of the code, normally running under the OS user corresponding to the functional bastion user who’s running them.

The helpers don’t trust the rest of the bastion code, so they never blindly trust what is passed as input to them, even if theoretically, this input has already been validated by the bastion code launching the helper. Their higher privilege is granted using the sudo command, with a very strict sudoers configuration which ensures that the caller can only run the helpers it’s supposed to run, and with the parameters it’s supposed to be allowed to specify. Once the helper has finished working, it communicates back information to its caller using JSON.

Let’s take the example of the groupAddServer command. As its name implies, this command is used by a group aclkeeper to add a new server to a bastion group. Let’s say the user guybrush is a gatekeeper of the bastion group island. On the OS level, the OS user guybrush will be a member of the island-aclkeeper system group. One part of the sudoers configuration will say this:

%island-aclkeeper ALL=(island) NOPASSWD: /usr/bin/env perl -T /opt/bastion/bin/helper/osh-groupAddServer --group island *

This line translates to:

all the members of the island-aclkeeper system group (i.e. all the aclkeepers of the island bastion group) can run, as the island system user, the osh-groupAddServer perl script, in tainted mode, but with the command line options forced to start with --group island

The island system user is not mapped to a logical user of the bastion, this is a technical account representing the island bastion group. The file listing the servers of the island bastion group is owned by this system user, and only the aclkeepers, through this sudo rule, can impersonate this system user to add a server to their group. Also note, that the Perl taint mode is used here (-T). This is a special mode that instructs Perl to immediately halt execution of the program (here, the helper) if an attempt is made to use a variable influenced (tainted) by the outside environment, without checking for its validity first. This is an additional protection to ensure that an improperly sanitized input can’t make it through the program’s execution flow.

Going down the rabbit hole with minijail

For some plugins, we even went one level deeper. For example, we have a plugin to allow users to connect to a PostgreSQL database, using the classic psql client, but directly from the bastion. The idea is that the password to access the database is known to the bastion, not to the user, so the password can be extremely complex, and change every day if necessary. This is completely transparent to the user, who just connects to the bastion and asks to run the database plugin. This scheme is the same than when using SSH on both sides: as seen in the first post of this series, the ingress connection is between the user and the bastion (SSH), and the egress connection is between the bastion and the remote server. The only difference is that, in this case, the egress connection is not SSH, but SQL.

But how to secure psql so that, when running on the bastion, the user can’t escape from it? The problem is the same with the mysql client. Those programs are engineered to be run from the local computer, where the user can already run any command, so there’s no real reason to add a configuration option to those programs that forbids local execution of arbitrary commands (shell escape). However on the bastion, we don’t want to allow that. Of course maintaining a forked version of these SQL clients is a complete no-no, because the time we would allocate to maintaining these forks would be of better use in other projects. Instead, we’ve used a tool named minijail, whose purpose is to make readily available, to any program, the (not so) recent features from the Linux Kernel – such as namespaces, capabilities, seccomp, the no_new_privs prctl() flag, etc. We’re not going to detail each and every one of these features, there’s a lot of material online about these, but rather zoom in on how we’ve used them in the context of The Bastion.

Let’s start with the conclusion: here is how it looks on the bastion system itself, while somebody is using the database plugin:

The Bastion - Part 3 - Security at the core > screen2.png" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/10/screen2.png 803w, https://blog.ovhcloud.com/wp-content/uploads/2020/10/screen2-300x77.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/10/screen2-768x197.png 768w" sizes="auto, (max-width: 803px) 100vw, 803px" />

Don’t Panic yet, let’s go through this line by line.

The first line (PID 16) is the sshd system daemon. Nothing fancy here, this is your usual friendly daemon, listening on port 22 for incoming SSH connections.

The second line (PID 413) is the privileged process specially spawned when guybrush logged in successfully on the server. This is also completely standard SSH behavior: when somebody logs in, two sshd processes are spawned by the daemon, a privileged one, and an unprivileged one. Both are dedicated to handling the user, while the parent (the daemon) continues listening for new connections.

The third line (PID 417) is the corresponding unprivileged sshd process for guybrush. This one is responsible for starting up guybrush‘s shell as soon as he’s logged in. Note that from now on, and until further notice, all code is executed under the own user’s (absence of) privileges.

The fourth line (PID 418) is guybrush‘s shell. This is where it’s starting to differ from your usual server. In this case, the shell is not /bin/bash or /bin/zsh, but a portion of the code of the Bastion. As explained above, the bastion is declared as the user’s shell, so when somebody logs in, this is what gets executed instead of a more regular POSIX shell. This portion of the code is responsible for parsing the command-line the user specified, and executing the corresponding action, if this action is allowed. In this case, the user passed the -i parameter, which asks the bastion to start in interactive mode. This is a special mode where it’s easier to launch several bastion commands without having to re-authenticate oneself each time. So, this process is listening for commands from the user. Note that, at this stage, the user has already been authenticated by the system – as this is completely delegated to sshd. If the authentication fails, the user’s shell (here, the bastion code) is never executed.

The fifth line (PID 497) is the child of the interactive process, re-executing the users shell (osh.pl) with new parameters: --osh db, which will instruct this instance of the shell that the user wants to run the db bastion command.

The sixth line (PID 502) is the current bastion command the user is executing. This is the db plugin, and we can see part of the command line: --name lechuck, this tells the plugin that the users wants to connect to the database named lechuck.

The seventh line (PID 503) is the ttyrec parent process, as explained in the first post series, the entire console output of the session is being recorded by the bastion – this process is in charge of doing it.

The eighth line (PID 504) is the ttyrec child process, needed for pseudo-tty support, which in turn is needed for the recording. If you really want to know more about pseudo-ttys, head on to man openpty and/or the ttyrec code itself.

The ninth line (PID 505) is the sudo call to start minijail. This is needed because minijail needs to be root for a proper setup of the jail, before downgrading itself to an unprivileged account

The tenth line (PID 506) is sudo‘s child, this one is in charge of starting the subcommand (minijail in that case)

The eleventh line (PID 507) is the invocation of minijail. The complete command line we’re launching is:

/bin/minijail0 --logging=stderr -u guybrush -g guybrush -n -v --uts -d -P /tmp/chroot-guybrush-psql-wsvhp4 -S /etc/bastion/minijail/db-psql.seccomp -b /lib64 -b /lib -b /usr/lib -b /usr/share -k /home/guybrush/.psql /profile bind 0x10100E rw --set-env HOME=/ --set-env USER=guybrush --set-env LOGNAME=guybrush -- /usr/lib/postgresql/11/bin/psql --pset=pager=off -h dbserver.example.org -p 5432 -U lechuck -- lechuck

Quite a beast. But let’s go through this step by step.

This tells minijail to setup a new IPC namespace (--uts), and to set the no_new_privs flag (-n), so that any part of the process it creates (and those processes own children) will never ever be able to be root again, no matter what. Under a no_new_privs process, even having a wildcard sudoers file, or knowing the root password and attempting to use su, is not enough to get back to UID 0. You just can’t.

We also ask minijail to create a new mount namespace (-v) then pivot_root (-P) to a temporary empty directory, /tmp/chroot-guybrush-psql-wsvhp4, so that the whole filesystem becomes completely inaccessible. As we still need to be able to run an SQL client in this environment, we bind-mount a few important directories in this new namespace, such as /lib64, /lib and such, and also just one directory in read-write, located into the users’s own home directory, so that from inside this jail, it can still have its .psql_history and .psqlrc files from past sessions.

We also set a few environments variables, so that the SQL CLI is not lost (HOME, USER, LOGNAME), then setup a seccomp policy on top of all that, to limit which syscalls can be made from this environment. For example, the execve() syscall is forbidden: the SQL CLI can not create any other process, or it’ll get terminated. Last but not least, when all of this has been set up by minijail, it drops its privileges to the guybrush user (-u) and guybrush group (-g), before executing the psql binary.

The twelfth line (PID 508) is the psql process itself, running inside the jail we’ve built above. This way, it is extremely difficult to escape the psql binary and get out of the jail. The whole setup instantly disappears when the user disconnects. The only remains will be his .psql_history and .psqlrc files. Of course, the ttyrec session record of his SQL usage will remain, too (as executed outside of the jail).

This concludes the post, where we’ve been detailing how some design principles help in delivering a resilient and secure system. Next week, in the final post of this series, we’ll be announcing something special. Stay tuned!

The OVHcloud SSH Bastion – Part 2: delegation dizziness

Stéphane Lesimple — Fri, 11 Sep 2020 15:05:44 +0000

This is the second part of a blog series, here is part one. We’ve previously found that the bastion is not your usual SSH jumphost (in fact, we found it is not a jumphost at all) and we discussed how the delegation was one of the core features we’d originally needed. So, let’s dive into these concepts. There are two compatible accesses models on the bastion: personal and group-based.

Personal Accesses – Piece of Cake

On the bastion, each account has (at least) one set of personal egress keys. These beasts are generated when the account is first created. The personal egress private key sits in the bastion account home. The account user has no way to see it, or export it out of the bastion, but they can use it through the bastion’s code logic. The user can retrieve the corresponding public key at any time, and install it – or get it installed – on the remote servers he needs to access. Depending on your use case – and the level of autonomy you want to give to the teams – there are two ways of managing these personal accesses.

Help yourself

The first way mimics how you would manage accesses if you weren’t using an SSH bastion at all. This is a perfectly valid way to handle accesses on a simple level, without too many users and a limited number of machines. This allows anyone to grant themselves personal accesses on the bastion, without having to ask anyone else to do it. It sounds like a security hole, but it’s not. If someone adds themself a personal access to the remote server, it will only work if his personal egress public key has already been installed on the remote server. In other words, he either already had access to the remote server to do this – using means other than the bastion – or somebody who had access to the remote server accepted the addition of his key. Either way, he cannot magically grant himself personal access without the admins of the remote server first permitting his key.

Ask the IT crowd

Another way to handle this can be to grant a limited number of people, such as security teams, the right to add personal accesses to others. This way people are less autonomous, but it might be useful if adding accesses has to be enacted via normalized processes. It also has some nice effects: as a sysadmin, one of the pros is that you can create 3 separate accounts on the remote machine, and map them to each bastion account you’re adding. This is a good method for achieving end-to-end traceability; including on the remote server; where you might want to install auditd or similar tools. It’s also doable in the help yourself mode, but it may be harder to enforce.

To be clear, this access model doesn’t scale so efficiently when we’re dealing with whole teams, or big infrastructures – this is where group-based access comes handy.

Group Accesses – Let’s Rock

A group has three components:

A list of members (accounts, representing individual people)
At least one set of group egress keys
A list of servers (actually IPs)

Servers list

The servers list is actually a list of IPs, or IP blocks. They map to your servers, network devices, or anything else with SSH capability that has an IP (on which the egress group key has been installed). Technically, this list is actually composed of 3-tuple items: remote user, remote IP (or IP block), remote port. That which applies to the personal accesses, also applies here: adding a server to the list doesn’t magically give access to it, it is first necessary to install the egress group public key. Of course, managing the installation of these keys manually quickly becomes impractical, but you can consider these part of the configuration of the servers, hence they should be managed with whichever centralized configuration system you already use (Puppet, Chef, Ansible, /bin/cp… wait, no, strike this last one).

Members list

The members are people who can connect to any server listed in the group server list. They’ll be using the private egress group key they have access to, as members of said group. Of course, they have no way to extract this private key for their own use outside of the bastion, they can only use it through the bastion’s code logic.

Got a new team member? Just add them as a member of your group, and they instantly get access to all the group servers. Somebody leaves the company? Just delete there account on the bastion, and all the accesses are instantly gone. This is the case because all your servers should have incoming SSH sessions limited to your bastions. This way, any rogue SSH key that would have been added, is no longer of any use.

And some more

We’ve covered the basics of the group-based approach, but as we need a lot of flexibility and delegation, there is a little more to cover. Remember when I said a group had 3 components? Well, I lied. A group has more than just members. Additional group roles include:

Guests
Gatekeepers
Aclkeepers
Owners

All of these are lists of accounts that have a specific role in the group.

First, guests. These are a bit like members, but with less privileges: they can connect to remote machines using the group key, but not to all the machines of the group, only to a subset. This is useful when somebody outside of the team needs a specific access to a specific server, potentially for a limited amount of time (as such accesses can be set to expire).

Then, gatekeepers. Those guys manage the list of members and guests of the group. In other terms, they have the right to give the right to get access. Nothing too complicated here. Then, there are the aclkeepers. As you may have guessed, they manage the list of servers that are part of the group. If you happen to have some automation managing the provisioning of servers of your infrastructure, this role could be granted to a robot account whose sole purpose would be to update the servers list on the bastion, in a completely integrated way with your provisioning. You can even tag such accounts so that they’ll never be able to use SSH through the bastion, even if somebody grants them by mistake!

Last but not least, the owners have the highest privilege level on the group, which means they can manage the gatekeepers, aclkeepers and owners list. They are permitted to give the right to give the right to get access. Moreover, users can accumulate these roles, which means some accounts may be a member and a gatekeeper at the same time, for example.

Global roles – Come Get Some

Beyond the roles we have just described – which are all scoped to a group – there are two additional roles, which are scoped to the whole bastion: the ‘superowner’ and the ‘bastion admin’.

In a nutshell, a superowner is the implicit owner of all groups present on the bastion. This comes in handy if the group becomes ownerless, as superowners are able to nominate a brand new owner. See where I’m going? Superowners are permitted to give the right to give the right to give the right to get access.

Dizzy yet? Now, for the most powerful role: the bastion admin. This role should only be given to a few individuals, as they can impersonate anyone (even if, of course, when they do, this is logged, and makes our SIEM go red), and in practice should not be given to anyone who is not already root on the bastion’s operating system itself. Among other things, they manage the configuration of the bastion, where the superowners are declared. Hold your breath. Ready? They are permitted to give the right to give the right to give the right to give the right to get access. This is why delegation is at the core of the system: everybody has their own set of responsibilities, and potential action, without having to ask the bastion admin.

Wrapping up

All the access management concepts we’ve talked about are mapped to actual commands. These can be run on the bastion after the user has authenticated himself (the famous ingress connection). They’re called osh commands in bastion jargon. There are no egress connections in this case, as these commands interact with the bastion itself:

As you may notice in the above screenshot, the version of the bastion software seems to be very close to 3.00.00! Perhaps, an interesting milestone is coming up?

In the next part of this blog series, we dig into some implementation details of one of those osh plugins and, more precisely, on our security and defense-programming approach.

The OVHcloud Bastion – Part 1

Stéphane Lesimple — Wed, 03 Jun 2020 13:50:56 +0000

Bastion? Are we talking about the indie game?

Not this time! (albeit it’s a good game!).

At OVHcloud, a fair amount of our infrastructures are built on top of Linux boxes. We have a lot of different flavours; such as Debian, Ubuntu, Red Hat… and the list goes on. We even had good old Gentoos once! These are all stored on bare metal servers, on VMs, and in containers everywhere. As long as it has a CPU (or vCPU), we probably booted some kind of Linux distro on it. But that’s not the whole story. We also had Solaris boxes, that later turned into OmniOS boxes, that have now turned into shiny FreeBSD boxes. We also have a lot of network devices, split against different constructors, spanning a wide range of model generations.

As you’ve probably guessed, we have heterogeneous systems that are running to provide a handful of different services. But regardless of this heterogeneity, do you know what they all have in common? Yes, they all ping, but there’s something more interesting: they can all be administered through ssh.

The problem

SSH has been the de-facto admin standard for quite some time now – replacing obsolete programs such as rlogin, that were happily transferring your password in plaintext over the network – so we use it all the time, as most of the industry does.

There are two regular ways of using it: either you just type your account password when the remote server asks for it, which is more or less like rlogin (without your password transmitted in plaintext over the wire); or you use a public key authentication, by generating a so-called “keypair”, with the private key sitting on your desk (or in a smartcard), and the corresponding public key sitting on a remote server.

The issue is that none of these two ways are really satisfactory in an enterprise context.

Password authentication

First, the password way. Well, we all already know that passwords suck. Either you pick one that is too easy to crack, or you pick some very complex one that you’ll never remember. This forces you to use a password manager that is protected by… a master password. Even strong passphrases such as “Correct Horse Battery Staple“, are nothing more than an elaborate password in the end. They bring a whole range of problems, such as the fact that they’re always subject to bruteforce attacks, and some users might get hit by the password reuse plague. As a sysadmin, you never really sleep well when you know that the security of your systems are just one password away. Of course there are ways to mitigate the risk, such as forcing a periodic password renew, a minimum password length and/or complexity, or disabling an account after several failures, etc. But you’re just putting additional burden on your users and still not achieving a satisfactory level of security.

Public key authentication

Second, the pubkey way. It goes a long way to fixing password issues, but then the problem becomes scalability. Pushing your public key to your home server is easy, but when you have tens of thousands of servers/devices, as well as thousands of employees, administering some always-changing subset of said servers/devices, it quickly becomes complicated. Indeed, doing it properly and maintaining it in the long-term in an enterprise context is a real challenge.

PKI-based authentication

For the sake of completeness – because I can hear you from here SSH gurus! – there is a third way in recent versions of SSH servers, namely authentication based on a PKI with a trusted Certificate Authority (CA). You install the public certificate of your CA on all your servers, and they’ll accept any connection authenticated by a certificate delivered by said CA, relying on the subjectName of the certificate. This specifies which account can be accessed on the server, among other things. This is a very centralized way of managing your accesses, with all the power in the hands of whoever controls your CA. It can be a highly successful if done very carefully, with a lot of security and processes around the certificates delivery workflows. Managing a CA correctly is no joke and can bite you quite hard if done improperly. This also happens to be a somewhat recent addition to OpenSSH, and given the heterogeneity we outlined above, it would have left a lot of systems on the side. There is also another reason why we haven’t chosen this method, but before diving into it, let’s talk about our needs.

What we needed

At OVHcloud, we have various technical teams that manage their own infrastructure, rather than relying on a generic internal IT department. This principle is part of the company culture and DNA. It does have its drawbacks; such as the added complexity in maintaining an exhaustive and up-to-date inventory of our own assets, but its advantages far outweigh them: multiple teams can iterate faster, as they can use existing OVHcloud products as building blocks to create new, innovative products. This must not, however, come at the cost of security, which is fundamental to everything we do at OVHcloud.

But how did we manage to develop security systems around the SSH management of all these servers without getting in the way of various operational teams?

A few important items are required:

DELEGATION
- Any kind of centralized “security team” responsible for handling access clearances for the whole company is a no-go. It doesn’t scale, no matter how you do it.
- Managers or technical leads should be completely autonomous in managing their own perimeter, in terms of servers/systems/devices, and regarding those persons who are granted access within their perimeter.
- A member from a team moving to another team or out of the company should be a completely seamless process, regardless of the kind of systems this person had access to (remember the heterogeneity above?).
- Giving access to a new team member must also be seamless, so they can get their hands dirty as fast as possible.
- Temporarily granting access to somebody outside of the team (or company) to a given asset for a limited amount of time should be easy.
- All of these actions should be easy to do autonomously

AUDITABILITY & TRACEABILITY
- Every action must be logged with a lot of details; be it a clearance modification, or a connection to a system; whether it’s successful or not. We also want it to be pushable to some SIEM.
- Every terminal session should be recorded. Yup, you read correctly. This is the kind of feature you don’t ever need.. until you do.
- It must be easy to generate reports for conducting access reviews.

SECURITY & RESILIENCE
- We must bring more security than a bare direct SSH access, with no additional cost.
- Any component that we have to add to answer those needs must be up and running at all times, even (and especially) when the rest of your infrastructure is falling apart, because that’s exactly when you’ll need SSH.

So what is the other reason we didn’t choose the PKI way? Well, this would have limited the autonomy of the team leads: only the CA would be able to deliver or revoke certificates, but we want this power in the hands of our team leads. With the PKI way, if we wanted to give some power to them, we would have had to implement a complex logic around the CA to make this possible, and we didn’t want to go down this route.

Enter the bastion!

To respond to our complex requirements, we have a specialized machine that sits between the admins and the infrastructures – a bastion – whose job it is to handle all the important items above, in addition to the decoupling of the authentication and the authorization phases. We’ll use public key authentication on both sides. Let’s take a moment to see a simple example of a connection workflow using this design:

An admin wants to connect to a machine named server42
He can’t SSH directly from his company laptop to server42 because server42 is firewalled, and only allows incoming SSH connections from the company’s bastion clusters
The admin starts an SSH session to the bastion instead, using his nominative account on it. His laptop negotiates the SSH session using his private key. This is the authentication phase: the bastion ensures that the admin presenting himself as John Admin is indeed this person, which is possible thanks to the fact that the public key of John Admin sits inside his bastion account. We call this the *ingress* connection.
Once John Admin is authenticated, he asks the bastion to open a connection to the root account on server42.
The bastion verifies whether John Admin is allowed to access the root account on server42, this is the authorization part. Let’s say for the sake of this example that John Admin is indeed allowed to connect to this server, using his team’s bastion private key (more details about this later).
The bastion initiates an SSH connection to server42, on John Admin’s behalf, using his team’s bastion private key.
The firewall of server42 allows incoming SSH connections from the bastion, and the connection is negotiated successfully as the John Admin team’s bastion public key is installed on server42’s root account. We call this the *egress* connection.

We now have two established SSH connections: the ingress connection, between John Admin and the bastion, and the egress connection, between the bastion and server42.

Now, some magic happens, and the bastion “plugs” these two connections together, using a pseudo-terminal (a pty) in between. John Admin is now under the impression that he’s directly connected to server42, and can interact with it as if this were the case.
Meanwhile, the bastion can record everything that is typed by John Admin (or, more accurately, everything that is *seen* by John Admin, we won’t record passwords he types on noecho terminals!), this is handled by the ovh-ttyrec program.

To be perfectly clear, server42 doesn’t know who John Admin is, and doesn’t need to: we’ve decoupled the authentication and authorization part. Only the bastion needs to know and authenticate the admin, the remote server only knows and trusts the bastion (or, more accurately, John Admin’s team existence on the bastion). This opens up a whole range of possibilities… but more about that in the next post!

This post is the first of a series of posts regarding the bastion. In the next posts, we’ll dig into the authorization part, namely the personal keys and accesses, the groups, and everything that goes along with those. We will also look at the different roles that exist on the bastion to make it so versatile. We’ll talk about some design choices, and how we want security to be at the centre of these choices – with some gory technical details. Click here to read the Part 2: Delegation Dizziness.