DevOps Archives - OVHcloud Blog

F.A.I.R. Principles in Data for AI

Lex Avstreikh — Mon, 30 Sep 2024 22:44:43 +0000

How the FAIR Data Principles apply to Machine Learning Data and Infrastructure

At Hopsworks, the FAIR Guiding Principles for scientific data management and stewardship have been a cornerstone of our approach to build a better machine learning platform. F.A.I.R. principles initially became prevalent in academia and diverse fields of research in an effort to make sure that the ever growing amount of data could still be usable and beneficial for the society, and it has since been widely adopted. However, few people mention them in the context of machine learning systems and data management. Yet those principles are even more relevant today in the fast moving AI and LLMs landscape, where new legislation is changing the rules of the game.

AI professionals should consider how questions of ethics, data management, and open frameworks may influence their choice of tools and machine learning platforms when implementing modern ML systems. In Hopsworks, we follow the F.A.I.R. principles in the design of a platform for managing machine learning data and infrastructure.

What are the Four Core Concepts of F.A.I.R.?

‍Findable; referring to mechanics to make the data easily searchable and findable. Infrastructure, stakeholders, and projects need easy-to-use functionality for data discovery.

Data needs to follow clear naming conventions, be indexed for free-text search and have persistent uniquely identified metadata that clearly and explicitly describe the data.
The design and curation of metadata needs to have good system support.

‍Accessible; allow access not only to the data but the provenance of the data and metadata for the data.

Open, free, and universally implementable protocols that allow access to the data itself, the metadata and its provenance,
Access control support is required when sharing data. Role-based access control is good, but attribute-based access control and/or dynamic role-based access control provides even more fine-grained support for data sharing and reuse.

Interoperable; data should be easily shared between different computer systems. This is achieved by implementing open standards and formats for data

Open and accessible file formats and transport protocols for accessing the data.

Reusable; data produced by one system should be easy to reuse in downstream systems, without copying the data. In order to reuse data, it’s important to include metadata related to the data licenses,, provenance, community standards, and custom metadata that will allow other institutions, teams or groups to be able to reuse the data.

Versioning, cataloging, provenance/lineage, data integrity, and custom metadata make it easier for users of data to decide on whether they can use the shared data.

Why F.A.I.R. is challenging for AI platforms and ML Systems

Some of the FAIR principles are directly applicable in the context of machine learning systems: there are lots of open source frameworks, file systems, and programming languages that are used for the operation of AI products and services. Still, some very serious challenges do emerge that are specifically due to the way any ML System needs to operate.

Findable; while strategies that apply metadata and clear nomenclature can be applied in the context of operational machine learning systems, practitioners will find it challenging to create a clear centralized logic between the different data sources and databases needed to operate such services; a modern ML system might need to be connected to multiple sources, some of which may be real-time, or vector databases for large languages. Making a clear structure for the assets and the metadata becomes a complex endeavor without a centralized solution capable of catering to the different scenarios.

‍Interoperable & Accessible; When open frameworks and open file formats are used; core challenges in regards to accessibility and interoperability should be easier to resolve; in which case it becomes important to consider open standards, compute engines and avoid DSLs. One additional challenge that can span from the very nature of the underlying data is to make it accessible for auditing (for example; what was the data that the model in production last year trained on?), review and debugging whilst the systems continuously updates and appends data.

Reusability; Finally a fundamental characteristic of machine learning models is that some of them require the data processing to be directly tied to the model that will be trained; we call these model-dependent transformations. This process essentially compromises the integrity of the data and the underlying datasets can’t be re-used in a different scenario. And not only does it prevent the reuse of the data itself, it is also harder to understand for a human. This leads to significant holds on the ability of any organization to reuse their data in different models, leading to deduplication and the creation of monolithic pipelines that are notoriously harder to scale from.

Making Data for AI F.A.I.R. ‍

Use case ofHopsworks with the Human Exposome Assessment Platform

At Hopsworks, we have a strong heritage working with academia and research, participating in projects such as HEAP (Human Exposome Assessment Project) that manages personal data from numerous medical institutes across the world. We have always been mindful of the evident privacy and security concerns and needs of efficiency in managing data following FAIR principles; when approaching such project; we consider those principles as a blueprint on how to refine our own software;

Using open frameworks,
Using open languages,
Modular technologies,
Reusable file formats.

Additionally, striving to build strong abstractions and APIs that enable users and organizations to have a better understanding of the models they are building and more flexibility in reusing their data pipelines. Those are core aspects of the Hopsworks platform, which we believe all state-of-the-art ML platforms should follow to be within the FAIR framework.

FAIR principles in practice at Hopsworks

Sources;

5 ground rules to secure your storage

Charlotte Letamendia — Fri, 21 Apr 2023 16:39:01 +0000

My data is an asset. Let’s share the best practices to protect your data.

If you feel that security is a constraint, it’s time to think again! In this blog post, I will share with you 5 simple rules that can be easily implemented to secure your back-ups without headache thanks to the “Objects Storage Standard-S3 API” class of storage.

I am DevOps or DevSecOps, I am developing on my platform and want to stay concentrated on my business where I have added value. That is why I am managing by code the deployment and scale of my infrastructures. I am delegating the management of my infrastructure to my cloud provider.

While developing my business, the volume of my data grows exponentially, so my data has value too!

As my business grows, I collect more data and keep a historical set year after year. I am even deploying in new locations around the world!

All this data (applications, user data, logs, media, analytics, reporting) are stored and backed up in object storage for flexibility, metadata search, and easy scale. My data represents a great asset in my hand. Data drives my business and I want to protect it.

Data is an important asset that requires good governance!

Please don’t think, “cool I have copied my data to a secondary bucket, my backup is complete, I am safe.” Nope, this is not OK!

What S3 Object Storage doesn’t protect you from is yourself. Let’s take a look together at the 3 types of risks we need to protect ourselves from.

(1) The number one factor for data loss is human error, accidental deletion, or the overwriting of an object with garbage data. This is a scenario that you want to avoid.

(2) The second category relates to unpredictable events: software issues, hardware issues (drive failure), datacenter downtime, or natural/manmade disaster.

(3) The third category is the stuff that causes security experts to lose sleep at night-malicious actions: malware, ransomware & viruses, acts of sabotage, DDoS…

Security is important and non-negotiable, let’s take a look at 5 easy rules to protect against these risks.

…and continue to work in all serenity!

Rule n° 1 – versioning

Versioning helps to protect against accidental overwriting. You can reverse a version after accidental deletion or retrieve a specific version in the event of data corruption.

Rule n° 2 – immutability

When your primary storage systems must be open and available, your backup data should be isolated and immutable.

Implement the Write Once, Read Many (WORM) model using S3 object lock API.

You can define different parameters according to your needs, business, and type of data:

retention periods
legal mode
governance mode
compliance mode

This rule helps your organization to respect compliance. To keep logs for a legally limited period, the “compliance mode” will help you set the duration needed. It’s quite handy because logs are generated every second so it can be difficult to keep track. With compliant mode, you don’t need to worry about this anymore, you can set a period of 1, 3, or 5 years and the logs will remain protected throughout the designated period.

Rules n°3 – data replication off-site

To be protected against hardware failure issues or geographical events, follow the well-known model : 3+2+1

Before setting your copies of data, you need to evaluate the RPO and RTO of your target

RPO (real point objective) = in case of geographical failure, what is – in time – the most recent snapshot of your data that is acceptable for you to restore your data while losing the minimum amount of data
RTO (real time objective) = in case of geographical failure, what is the acceptable time to recover your data

Of course, everybody wants a 0-second recovery, but is it necessary?

Such a recovery plan requires costly resources to maintain. Good advice is to sort your data by category of criticality and fine-tune this plan by categories and put into place backup retention policies

Type	Back up policies
Nonbusiness critical	Weekly
Business critical	Every day for 1 month then monthly during 1 year
Archive	> 1 year

Rules n°4 – encryption

When your data is not used you can cipher it with your own key. We use a feature based on the AES-256 protocol.

Encrypt your data: using your own keys and encryption based on AES-256

Note that the data in transit is encrypted thanks to the TLS protocol.

Rules n°5 – user policies

Grant only the permissions that are required to perform a task using

User policy
Bucket policy (soon)
Bucket ACL

Extract your S3 policies every month and check them; It will not take too much time and can be automized. Verify that you know all users and that the rights are adapted to each profile. Never let a wildcard * provide access to all to a sensible bucket/object.

Be uncompromising in the implementation of security compliance.

If you are comfortable with these 5 rules you can rest assured. As for all security rules, a regular check-up/training is always useful!

Bonus rule – Traceability

When you are audited or you want to audit your architecture of your cloud provider it is important to have all the elements you need.

The S3 logging feature will help you provide the traceability needed in order to know who, when, and why data was accessed.

Thanks to our API, you can set up some triggers in order to be alerted in case of bad or simply abnormal behavior.

Want to know more about data protection? More blog posts are coming soon!

Meanwhile, feel free to consult our guides that will assist you in your data security implementation.

And discover OVHcloud Object Storage services with S3 API https://www.ovhcloud.com/en-ie/public-cloud/object-storage/

Who are you?	Guides
Data protection for SysAdmin	VMware user ? https://docs.ovh.com/ie/en/storage/object-storage/s3/veeam/ Nutanix user ? https://docs.ovh.com/ie/en/nutanix/hycu-backup/ https://docs.ovh.com/ie/en/nutanix/nutanix-veeam-backup/
Data protection with infrastructure as code	Kubernetes user ? https://docs.ovh.com/fr/kubernetes/backing-up-cluster-with-velero/ https://docs.ovh.com/ie/en/kubernetes/backup-and-restore-cluster-namespace-and-applications-with-trilio/#how-triliovault-for-kubernetes-works
Data protection for Developpers	S3 API user ? https://docs.ovh.com/ie/en/storage/object-storage/s3/managing-object-lock/ https://docs.ovh.com/ie/en/storage/object-storage/s3/rclone/ https://docs.ovh.com/ie/en/storage/object-storage/s3/encrypt-your-objects-with-sse-c/ https://docs.ovh.com/ie/en/storage/object-storage/s3/identity-and-access-management/ https://docs.ovh.com/fr/storage/object-storage/s3/bucket-acl/ https://docs.ovh.com/fr/storage/object-storage/s3/server-access-logging/

Modernize your application deployment – Part 1

Jean-Daniel Bonnetot — Wed, 16 Feb 2022 16:47:45 +0000

The good old time

Many years ago I was a SysAdmin. Do you remember this old job? Let me remind you of a few recurring scenarios:

ssh someone@somewhere
apt-get install something
vi /etc/something
/etc/init.d/something restart

ssh someone@somewhere
fdisk /dev/sdb
mkfs.ext4 /dev/sdb1
vi /etc/festab
mount /mountpoint

ssh someone@somewhere
modeprobe ipt_MASQUERADE
echo 1 > /proc/sys/net/ipv4/ip_forwarding
iptables -t nat ...

Sure, it was a bit more complicated than a few command lines, but I think you know what I mean. Oh, and does this one sound familiar to many of you?

ssh someone@somewhere
apt-get install pacemaker
#do many complicated stuffs
vi /etc/corosync/corosync.conf
#do other complicated requirements
crm configure property stonith...
crm configure property quorum...

I spent my time deploying and maintaining systems, building resilient architectures, and debugging or fixing failing servers. It was the good old days!

A few years ago I switched to more customer facing jobs, like product marketing management and technical evangelism, where I used my knowledge to promote products with a good technical approach.

But now guess what? Things has changed. No seriously, when I have to put my fingers on a terminal, a lot of things are different. It’s not that /etc/init.d has been replaced by systemctl and no one is using screen anymore, but the way of thinking about deployments and resilient architectures is totally different.

Of course I’m not totally out of it, as I used to play around with OpenStack and understand microservices architectures on paper. But it takes a bit more practice to be comfortable speaking with real customers about real applications, deployed with a scalable architecture or in a cloud-native way on Kubernetes.

Sounds like a new adventure begins

I therefore suggest that you follow me on this journey: moving from a standalone, rock-solid deployment to an scalable architecture using the new primitives. I know the hype is around Kubernetes, but I also know a lot of people are not ready to go in that direction, because the knowledge base is not that easy to manage and the step is high as it is case for me.

I’m going to share what I’m going to discover, I’ll try to make it as simple as possible and keep an educational approach.

On this journey, I will be taking small steps, one step at a time, and this post is the first describing the basics and theory for building a scalable architecture.

Leaving pets village

For those unfamiliar with the pets vs cattle analogy, let’s say we’re in a village. Each villager has a handful of animals, not many of them. And each animal requires a lot of time from its owner to feed it, take care of it, play and educate it. They are pets, almost members of the family. And of course, our villagers invest affect and more into each animal. Therefore, losing a pet is really critical as it is unique and cannot be replaced.

Coming back to IT, in the past we deployed applications and servers with the same idea. We spent time on installation, maintenance. One server had very little in common with the others and for critical services we invest so much time in building HA (high availability) architectures with voting solutions like quorum devices, fencing tools like STONITH (Shoot The Other Node In The Head) drivers, or dormant resources with active/passive components.

We had to address at least those three questions:

How can I scale my architecture when I need more power?
Usually the answer was “add more RAM” or “change to a better CPU”. Even if your infrastructure is virtualized, this approach has some limitations.
What should I do when something goes wrong?
Here was the debugging approach with a little urgency depending on your pet’s notoriety.
What if… (no words, it’s too hard) what if the worst happen?
And yes, you know that, like all of us… “shit happens”. This is where we used some corosync stuff and other cool tools.

Overall, you might have a smirk on Monday morning if you check the event logs and find that a big infrastructure failure popped up on Saturday night. You were in the middle of the 16th episode of the 5th season of Lost (yes we’re back in 2009), Locke just asked Ben to kill Jacob, and you completely missed the alert email asking you leave your sofa to fix the infra. But this Monday morning, you discovered with joy and happiness that your fantastic distributed system had worked as intended and put the system in degraded mode. And you were going to celebrate it with a double coffee because now you will have to go from degraded mode to normal mode and debug the failure.

That’s the pet village, the one many of us know pretty well. But like I said, the time has changed and we are leaving this place.

This is the pet village, the one that many of us know quite well. But as I said, times have changed and we are leaving this place.

Destination: the cattle land

The country we are going to is full of “collections of fabrics” or “groups of things”. There’s a big shed, a huge pasture, and you feed your cattle with tools like trucks and grain silos. In a sense, you are managing the cattle, not each head of cattle. I don’t want to sound like someone who doesn’t care about animal welfare, but in a cattle, a cow is much like another cow (sorry vegetarians). If a cow is missing, you can easily replace it, you see what I mean?

It’s the same for us, IT people. In this country, each component is not unique and shares a common configuration with the other in its group. The configuration has been industrialized, like a one-click deployment, and any component can be replaced by a command line, bringing us to a situation where losing a component is not a problem anymore.

And if we need to answer the precedent questions, it would look like:

How can I scale my architecture when I need more power?
Easy, friend, you just need to add another component in the group.
What should I do when something goes wrong?
You could call the vet… but here I know an easier solution. You will say that I am cruel and I will say that I am talking about machines and software, not animals. What did you have in mind…?
What if the worst happen?
Here again, we’ll have everything to replace it easily.

On the road to cattle land

So the question is: how to do it? How do you move from a standalone deployment to a scalable architecture?

You might think you’ll have to rewrite everything and drop your app to rewrite a new one, but we’ve said we will be doing things step by step… So I would identify three main actions for our first step. These actions need to be addressed to move into cattle land:

Identify the stateless and stateful components
Move stateful component to managed services
Industrialize stateless components

At this point, we may need some clarification on what a stateless component is: it’s a component that doesn’t store any data or status locally and share-nothing. All data that needs to be persisted should be stored in a stateful storage service, usually a database. In other words, you can lose/kill/remove/destroy (strike the unnecessary ones) any of the stateless components without impacting the application, because you won’t lose any data.

And the stateful components will be delegated to your cloud provider, like OVHcloud. They will be responsible for managing the high availability elements for you, usually the service offers this option.

Let’s try to explain it with some drawings. Let’s start with an application deployed in the pets village, a basic blog:

The large box represents our standalone server. Highlighted in yellow are the stateful components which are the database and the files (images). This is our data that we don’t want to lose a single byte. The rest of the application are stateless components.

Now see the target we are speaking about:

We’ll move our stateful component to managed services at OVHcloud. The MongoDB database will be hosted by Managed Databases for MongoDB service and the images will be pushed to Object Storage service which provide an S3 API.

Now we have dealt with the data, we can work to build our stateless component in a cattle mode. The road to the cattle land is marked out and we can move forward. Let’s see how to do that in the next post with a practical approach and some real demos.

Erlenmeyer and PromQL compatibility

Aurélien Hébert — Wed, 16 Dec 2020 11:03:27 +0000

Today in the monitoring world, we see the rise of the Prometheus tool. It’s a great tool to deploy in your infrastructure, as it allows you to scrap all of your servers or applications to retrieve, store and analyze the metrics. And all you have to do is to extract and run it, it does all the work by itself. Of course, Prometheus comes with some trade-offs (pull, how to handle late ingestion), and some limits, as you have your data only for a couple of days.

Context

How is it possible to handle Prometheus long-time storage? A vast amount of Time Series DataBase are now fully compatible with Prometheus. It’s easy to check that Prometheus ingest is working well, however, how can we validate the PromQL – or Prometheus queries – part? A few months ago, PromLab released a new tool called “PromQL compliance tester“. They recently created this page where they reference the result of several products PromQL compliance tests. On this blog post, we will see how this tool helps us improve our PromQL implementation.

Compliance tester

The PromQL compliance tester is open source and contains a full set of tests. When using this tool, it generates for you around 500 PromQL queries covering the vast majority of the language. It includes tests on simple scalar, selectors, time range functions, operators, and so on. This tool will execute a request on both a Prometheus instance and the tested backend. It will then expect you to get the same result as PromQL output. It expects an exact match for all metadata of a series (tags and names). It’s more flexible for the ticks as you can set a parameter to round your check at the milliseconds. Finally, the compliance tool checks the equality of both query values, as many things can impact the floating predictability, it computes an approximated equality.

Erlenmeyer

At Metrics, we used a Warp10 TSDB with it’s own analytical query engine WarpScript. We decided to build an open source tool to transpile PromQL queries into WarpScript called Erlenmeyer. This compliance tester was a great help to validate some of our implementation and to detect which query were not fully ISO.

Set up

To start testing our PromQL experience, we set up a local Prometheus with a default configuration. This configuration makes Prometheus run and collect some “Demo” Metrics, then we forwarded all of them to one of our Metrics regions using Prometheus remote write. We added a local instance of Erlenmeyer to query the data stored in a distributed Warp10 backend. Then, we iterated on each set of tests of the PromLab compliance tool to identify all issues and improved our existing PromQL implementation.

To be compliant, we had to reduce the precision for the value of the compliance tool. We set the precision to 0.001 instead of 0.00001. We also had to remove the Warp10 .app label from the result. As on Warp10 instance, we identify users based on this .app label.

A test query

When running the test, you will get a full report of your failing queries. Let’s take an example:

RESULT: FAILED: Query returned different results:
  model.Matrix{
  	&{
  		Metric: Inverse(DropResultLabels, s`{instance="demo.promlabs.com:10002", job="demo"}`),
  		Values: []model.SamplePair{
  			... // 52 identical elements
  			{Timestamp: s"1606323726.058", Value: Inverse(TranslateFloat64, float64(2.6928936527e+10))},
  			{Timestamp: s"1606323736.058", Value: Inverse(TranslateFloat64, float64(2.691644054725e+10))},
  			{
  				Timestamp: s"1606323746.058",
- 				Value:     Inverse(TranslateFloat64, float64(2.6922272529119648e+10)),
+ 				Value:     Inverse(TranslateFloat64, float64(2.689432207325e+10)),
  			},
  			{Timestamp: s"1606323756.058", Value: Inverse(TranslateFloat64, float64(2.6915188293125e+10))},
  			{Timestamp: s"1606323766.058", Value: Inverse(TranslateFloat64, float64(2.69215848005e+10))},
  			... // 4 identical elements
  		},
  	},
  }

The test reports includes all errors occurring during the test. In this example, we can see, that for a single series we have 56 correct values. However one is invalid, we see it on two lines. The first one is the one starting by “-“. This stands for the expected value. And the second one starting by a “+” corresponds to the tested instance value. In this case, the value isn’t precise enough (2.68 instead of 2.69).

Results

Now that we have a full test set-up running, we can see what we improved from its results. If you want to access the full detailed fixes, you can check the code update made here. This tool helped us to fix some implementation, sanitize known issues, to know what PromQL features we missed, and detect a few new bugs! Let’s review the change.

Quick implementation fixes

Running those test was a great help for us to understand some of implementations errors we had when trying to match PromQL behavior. For example, the time range function was sampling before computing the operation. Reversing those steps provided us a direct match with a native query. It also helped us also fix some minor bugs on how to handle the comparison operators or multiple functions as label_replace, holt_winters, predict_linear or the full set of time functions (hour, minute, month…).

We improved also our handling of PromQL operator aggregators : by and without.

Sanitize known issues

We discovered recently, that we were not matching PromQL behavior on the series name. As a result we were keeping the name for all compute operations. Prometheus has, however, a different approach as the name is only kept when it’s relevant. The compliance tester helps us on how to validate this specific update for all queries.

With this tool, we test the validity of a query compared to a native PromQL query, it helps us to sanitize our query output. We knew that, in case of missing values or empty series, we were not ISO compliant. We have corrected the part of the Erlenmeyer software handling the output to match all PromQL cases included in the tests.

Unimplemented features

Running the test, lead us to discover that we missed some PromQL native features. As a matter of fact, Erlenmeyer now supports the PromQL unary or the “bool” keywords. The support of unary allows the use of “-my_series” for example. In PromQL, the bool keywords convert the result to booleans. It returns as series values 1 or 0 depending on the condition, where 1 stands for true and zero for false.

Open issues

Running all compliance tests and improving our code base lead to us to around 91% of success. For the rest, we open new issues on Erlenmeyer, we detected that:

the handling of the over_time function is not correct when the range is below the data point frequency,
rate, delta, increase and predict_linear, our result isn’t precise enough to match PromQL output when then the range is below 5 minutes,
some minor bugs on series selector (!=), or on the label_replace (some checks are missing on parameters validators),
the PromQL subqueries, as well as, some functions are not implemented: ^ and % on two series set and the deriv function.

Those are the 4 missing points to cover the full PromQL feature set with Erlenmeyer. Our documentation already contained all the missing implementations.

Actions

This tool was a great help to improve our PromQL compliance and we are happy with our compliance result. Indeed we reach 91% with the provided test result:

General query tweaks:
*  Metrics test
================================================================================
Total: 496 / 541 (91.68%) passed

Our next action, is to release those fixes and improvements on all our Metrics regions. Looking forward to see what you think about our PromQL implementation!

We now see a lot of projects are implementing Prometheus writes and reads. These projects bring Prometheus a lot of missing features like long-term storage, delete, late ingestion, historical data analysis, HA… Being able to validate PromQL implementation is a big challenge, and is a great help in choosing the right backend according to the need.

Warden: the self-healing framework for local actions

Alexandre Gauthier — Wed, 09 Dec 2020 11:23:14 +0000

This article is the follow up to Selfheal at Webhosting – The External Part published on 2020-07-17.
Part two below covers the local self-healing system.

Introduction

With over 15-000 servers dedicated to providing services for 6 million websites and web applications of all sorts, across multiple data-centers and geographical zones, a certain amount of software failures are inevitable. They must be handled to ensure the servers are in a functional state to provide continuity-of-service.

The overhead only increases once you account for supporting pieces of the infrastructure that provide the service, or by clients to access and manage their data.

Generally speaking, restarting failed services and reacting to health checks failing with automatic operations can be done swiftly with a simple install of, for example – Monit, or Systemd Unit Parameters.

Web-hosting infrastructure, however, poses unique challenges that require a holistic response.

It’s not only large, but it’s distributed and highly available. A web host encountering a failure will not degrade the service, as another node in a cluster will immediately take its place to service client requests.

Additionally, providing Shared Hosting as a service means you are mostly running Unknown Workloads. No two websites have the same requirements, performance, or behavior. You can’t therefore make assumptions about what is normal, and what isn’t, which in turn makes establishing a baseline for Abnormal Behavior difficult.

In this context, it is generally an inevitable fact of life that sometimes those workloads will misbehave, crash, or put the system into a state it cannot recover from without intervention.

Trying to prevent this is therefore futile. Facilitating recovery within isolated fault domains is a more productive approach and is where self-healing becomes useful.

Self-healing systems

While the highly available nature of the infrastructure means failure states don’t necessarily degrade the service – the cause still needs to be investigated and the system recovered before being returned to the pool of available hosts to serve requests.

Without automated systems in place to achieve this, it can easily turn into a battle of attrition. Systems to diagnose and clear can pile up and eat into actual time spent on improvements and long-term mitigation of failure states.

We therefore employ two self-healing systems at Webhosting to automate the process:

Healer: External self-healing, which handles hardware problems, the absence of connectivity, and anything the local systems can’t resolve locally.
Warden: A local agent that exposes a framework for self-healing on local nodes. Warden is the component we will be exploring today.

Enter Warden

Warden was designed as a simple, lightweight daemon process that exposes a plugin API, allowing members of the SRE team to quickly write small pluggable python scripts that handle specific conditions found on the local system. It is meant to exist as an agent on every single server of the web-hosting fleet, where it will work to maintain integrity and record information about failure states.

Goals

Warden has a few specific long term goals, which are worth going over.

Maximize system availability

Warden attempts to detect scenarios that would degrade or otherwise disrupt the service and responds to fault events from the monitoring system. This allows for the quick return of the system to a functional, clean state; allowing it to reintegrate the available hosts pool and serve requests again. Being a local, per-server process, Warden is able to be reactive and process events in a timely fashion, avoiding network round trips and monitoring delays. This contributes to the general health of the infrastructure by keeping the amount of hosts in a failure state at a bare minimum.

Log diagnostic data for later analysis

Being a local agent present on every system, Warden is in the enviable position of being able to collect all sorts of surrounding data for export upon detecting a failure state.

Warden keeps a detailed record of the failure state and surrounding system state, to be queried later. This ensures diagnosis is not a blocking point for returning the host to duty. It is also important to remember the goal is not to sweep failure states under the carpet, or mask them.

Additionally, since many of these failure states are non-critical (as other hosts take over transparently), it may be multiple days by the time someone gets to look at it, at which point the relevant state to inspect is long gone, and we’re just left with an empty, yet offline server.

The primary goal here is actually to increase visibility into failure states, and to be able to quickly identify trends and underlying issues that must be mitigated or resolved, while ensuring the relevant data is kept while fresh.

At runtime, Warden generates snapshots of interesting system aspects. A long term goal is to capture a meaningful representation of the entire system state at the time of event, preventing the need to perform diagnostics directly on affected hosts.

Minimise human overhead

Analysis of failure states can be highly time consuming, especially if you’re flooded by hundreds of systems reporting mostly the same issue. It can also be irritating to constantly deal with transient failure states that are considered “normal”, either due to known popular application bugs, or other known circumstances. Just sorting the signal from the noise can be a full time job, especially if your team is actively trying to maintain general health and resolve the issue long term.

This can quickly turn into a battle of attrition where resources are expended on managing the alerts, failure states and problems over actively working to mitigate and resolve them.

Warden hopes to streamline this process massively, allowing SRE people to focus on what actually matters and makes a difference in terms of Quality of Service.

Make writing self-healing plugins easy

The API Warden is meant to be simple. It abstracts much of the nuts and bolts of the implementation process involved in execution.

Plugin authors should not have to worry about scheduling their own run, or writing complex logic to obtain the information they are after, nor should they have to write solid logging code.

All of this should be handled by Warden. Plugin authors should be able to focus on describing their conditions, selecting what relevant data they want to record, and writing an action that hopefully restores functionality.

How does it work?

Warden Core

As previously mentioned, Warden is a small daemon written entirely in Python. On boot, it will enumerate the plugins it is configured to activate, and place them in a queue.

Plugins may have configuration values as well, exposing easily tunable thresholds for response, or other settings. The Warden Core essentially serves to orchestrate everything, as well as provide the plugin API.

It also keeps track of various internal decisions, plugin states and how many times a plugin has done a self-healing action.

Then, once booted, the main workflow starts.

State Collection

Warden immediately goes and collects system states from its available sources. This could be, for example, a monitoring probe sink – which can be queried remotely as well as locally – or a snapshot of the process table.

Some deeper information is also generated, on demand, to keep the system load as light as possible.

This information is then sent to plugins matching the type of state collector. For example, plugins that operate on the process table will be gently fed this information.

Plugin hand off

A Warden plugin consists of essentially three primary callbacks, which should be easy to implement.

Plugins are encouraged to terminate early if they do not find actionable items in the system state.

Scan Phase

In this phase, a Warden plugin will receive information about the system state, in a form it can easily digest, using standard Python data structures.
The plugin can select some particular pieces of information it would like to further analyze, if necessary.

If an event is detected that the plugin can respond to immediately, then this is recorded to a Central Store (provided by our own Logs Data Platform product)

If at this point, a self-heal action is necessary, the plugin can signal it by setting its internal state accordingly.

Analysis Phase

During this phase, the plugin will further dissect the received status, and/or collect information about the system – either requesting them from Warden, or collecting them itself.

This is where the diagnostic information will be exported to a Central Store, alongside a plethora of useful metadata (where, when, who, how).

At this point, if not already signaled by the previous phase, the plugin can mark its internal state as requiring an action.

Heal Phase

Warden will then check the internal state of the plugin, and if it needs to perform an action, this final phase will be executed.
This is where the logic to resolve the situation is written. Services get restarted, processes get terminated, maintenance scripts called, etc.

Success (or failure) is reported, and Warden will dutifully log the Action and its results to the Central Store.

At this point, if an action was taken, Warden will refresh the corresponding state before moving on to the next plugin in the queue.

This process is repeated at configurable intervals that can be kept short, since plugins are lightweight and exit quickly if no issue is found.

Dashboards and Visibility

Extensive Grafana dashboards as well as Graylog interfaces have been built to closely monitor everything the Warden does.

They simply query the Central Store where every single system reports its events and actions.

We can tell how frequently a specific self-heal is triggered, for example, on what amount of systems, and where they occur the most.

We can also easily tell where self-heals fail the most, between individual failure domains, or down to individual systems within a cluster.

They are made to be easy to drill down into, to get a bird eye’s view of the global state as well as a detailed view of the exact actions taken by a single plugin.

Keeping this up on a TV Monitor in office has been of incredible value when it comes to casually noticing trends, as well as identifying which problems are recurrent and which are transient.

A Practical Example

As a practical example of how Warden can be tied into existing systems and handling their events, there exists a probe on our servers that verifies the availability of the hosting runtime stack, ensuring it functions and is in the correct state to process requests.

It would often raise an alarm after some specific code in our hosting stack either terminates abnormally, or creates a scenario where the stack was incapable of recovering on its own. This would generate an alert, mark the server as unavailable, and remove it from the active pool.

Rebooting the server or restarting the entire stack would obviously resolve the situation and return the system to the pool of available hosts, but this robs us of the opportunity to inspect the issue. Existing metrics and logs only shed partial light into what exactly had occurred in order to cause this; especially since reproducing it will often be dependent on specific applications we host. Not to mention that by the time someone got to look at it, the chances are that the interesting state has long left the system.

In order to mitigate this, a Warden plugin was written with the following logic:

It scans the local alert sink for the failure state (exiting if it is not present)
During the analysis phase, crash dumps are collected, the filesystem state is recorded, relevant logs are extracted.
The exact version of the hosting stack is also collected, alongside everything relevant.
This is then sent to the Central Store alongside information about the host, the site, and timestamps.
The plugin then marks itself as needing to take action.
Everything relevant having been collected will mean that the hosting stack is destroyed, cleaned, and relaunched.
Afterwards, the probe that raised the alert is refreshed. Congratulations, the system is now back online, and in a matter of minutes!

The turnaround time for writing the plugin was also reasonably short, and was deemed complete in two iterations (mostly to collect more data).

This information helped our developers pin-point exactly what was happening, as well as continuing to be a solid metric for gauging the health of our infrastructure.

In Conclusion

So far, Warden has helped not only lower the amount of human resources expended towards diagnosing and resolving issues, but has generated targeted improvements to various components of our stack.

It has also identified issues that would otherwise have gone unnoticed simply by graphing a visual trend of certain non-fatal states, which has led to more fixes and improvements.

On-call duty cycles have also been reasonably more peaceful as the bar for accessibility has been significantly lowered when it comes to automating resolution of simple issues.

It has generally allowed us to better focus our energy where we are able to make a difference, and through further improvements, will hopefully continue to do so.

The Bastion – Part 4 – A new era

Stéphane Lesimple — Thu, 29 Oct 2020 09:25:56 +0000

This is the last article in the series about The Bastion. In the previous parts, we covered the principles of The Bastion, and talked about how delegation was at the core of the system. Then we explained how Security was at the heart of the design principles, in a detailed but hopefully not too-long article.

Today, we’re announcing something special. You might have guessed it already, thanks to the (not so) little breadcrumbs trail we left in the previous articles. Without further ado, and because pictures can say a thousand words on their own:

We’re going open-source! We’re very excited to share this news with you, and to mark this new milestone in the lifecycle of The Bastion. We think it’s a perfect reason to bump to the next major version: v3.00.00! Obviously, all previous versions were internal-only.

The code is available at GitHub, and we’re also moving all the non-OVHcloud-specific development there from now on.

The documentation is also available online (as well as offline as reStructuredText files), we encourage you to read it. For the most impatient, there is also a docker image available on Docker hub if you want to give it a try: the TL;DR section of the README.md on GitHub will get you started.

Many of the more advanced features (such as PIV support, 2FA/MFA support, the notion of realms, the HTTPS proxy, etc.) are not yet fully documented, but all the basics are already there. We will enhance this during the next few weeks/months. A few features are not yet open-sourced either, such as the db plugin we talked about in the previous post. But it’ll make it to the open-source version eventually.

We hope it’ll be of use to the community, as much as it is to us, and we can’t wait to hear from you! The GitHub page is over here.

The Bastion – Part 3 – Security at the core

Stéphane Lesimple — Fri, 23 Oct 2020 15:33:49 +0000

In previous parts, we’ve covered the basic principles of the bastion. We then explained how delegation was at the core of the system. This time, we’ll dig into some governing principles of how The Bastion is written.

In a nutshell, the main purpose of the bastion is to ensure security, auditability and reliability in all cases. To this end, the bastion is engineered in a very specific way, with some principles that must be respected when implementing new features. Today we’re going to zoom in on how one of the functionalities of the bastion has been implemented to ensure an in-depth security. There are technical details ahead, so viewer discretion is advised!

The operating system is not just a scheduler

One of the engineering principles of the bastion is to leverage the underlying operating system’s security features, as additional guards on top of the code’s logic itself.

Usually, when developing a program, one doesn’t really need to think about the OS it’ll be running on, because all the business logic goes directly into the code. At its basic level, the OS’s job is to ensure the program runs on top of the hardware it has in charge, by abstracting it, along with the other pieces of software that might be sharing this hardware. In other words, most of the time the OS is mainly a scheduler, whose job is to ensure all the programs are running properly, and don’t step on each other’s toes.

To this end, an OS has the notion of user (or “account”), who may be the owner of some running programs and some files on the filesystem, alongside the notion of group (of users), so that e.g. a folder can be written to by several users. We’ll go back to this in a few minutes.

Now, let’s talk about applications. Most of the time, applications needing to handle users have a database with a “users” table, detailing the information about each user. In that case, the application’s code logic handles all the behaviour the program must have with respect to its users. For example, to authenticate a user, it stores a hash of each user password in the database, and checks whether the entered password’s hash matches what is stored in the database. If it does, then it deems the user to be successfully logged in. All this logic is entirely expressed in the code, the operating system plays no role in the process whatsoever.

There is then, only one operating system user dedicated to the application, regardless of how many users exist in the application’s database. The application will run under this OS user, and all files logically pertaining to different users in the application’s functional view, will be owned by this same OS user. It works because the segregation between the functional users is done entirely by the code: even if the application can technically access all its users files, it will only allow, through its code logic, access to the proper files for the proper user.

Code has bugs, but it shouldn’t matter

Now, let’s imagine we’re talking about a program – let’s name it MySuperCloudApp – whose job is to store files for its users, so that they can later fetch them from the cloud. Let’s imagine there is a flaw in the code (of course, this never happens), which doesn’t properly escape the user’s requested file name. If, once logged in as my user, I request a download of the file named myfile.txt, the application will allow it because I’m logged in.

But what happens if I request ../somebodyelse/herfile.txt, instead? If the code hasn’t been engineered to detect and filter out this weird request, it’ll just pass the read command to the underlying filesystems, which will allow it because, remember, the application runs under one OS user and all the actual user logic is handled by the application itself. All the application files are owned by the same OS user, so the request seems completely legitimate from an OS standpoint. I’ve just found a way to steal all the other users files. This type of flaw is called a path traversal, and is, unfortunately, pretty common.

For the bastion, the OS is more than a scheduler: every bastion user is actually mapped to an operating system user underneath. Likewise, every bastion group is mapped to an operating system group underneath. So are all the group roles we’ve talked about in the previous post. This is a strong design choice: we end up with an application that is deeply intertwined with the OS it’s running on, and this comes with some cons. However, for a security asset, which the bastion is, the pros vastly outgrow them.

Had MySuperCloudApp have adopted this design, mapping its application users to actual OS users, then the attack we’ve talked about before wouldn’t have worked. Even if the application’s code was flawed, and passed the read request to the OS below, the OS would have denied it, because down at the OS level, ../somebodyelse/herfile.txt is not owned by the same user. This is where the OS comes to rescue a flawed portion of code (which still needs to be corrected in all cases, of course!).

To take a more Bastion-y example, if a user pertains to groupA, and tricks the code into thinking it also pertains to groupB (because of a flaw in the bastion’s code logic), then it doesn’t matter too much because the OS will deny this user access to groupB‘s keys, as he won’t have access to read the file down to the OS level. So he still won’t be able to access any of groupB‘s servers. Technically, this is done by offloading the authentication part to sshd, which is well-known and does it quite well. When this phase succeeds, sshd creates a session under the proper OS user, and starts the bastion code entry point under this session.

We use the OS as an additional safety net in case there is a logic error or a vulnerability in the code: even if the code is tricked into taking bad decisions, the underlying OS will be there to deny the action, hence nullifying the impact.

In other words, all the OS bastion users have the bastion code declared as their system shell (instead of the usual /bin/sh). We’re even going further than that: the code is engineered in such a way that if a user succeeded in getting a real shell on the bastion, i.e. being able to run any command he’d like on the OS itself, completely bypassing all of the bastion code’s logic and checks, then he shouldn’t be able to do much more that what the normal bastion code logic allows him to. That’s another strong design principle, but helps to drastically reduce the impact of a security vulnerability, should it happen.

Trust no one

For some features to work correctly, the design choices we’ve outlined above implies that the bastion must sometimes create and delete users on the OS level. This can’t be done using unprivileged accounts, hence some parts of the code need to run under elevated privileges.

In The Bastion jargon, those portions of the code are called helpers, and are separated from the other portions of the code, normally running under the OS user corresponding to the functional bastion user who’s running them.

The helpers don’t trust the rest of the bastion code, so they never blindly trust what is passed as input to them, even if theoretically, this input has already been validated by the bastion code launching the helper. Their higher privilege is granted using the sudo command, with a very strict sudoers configuration which ensures that the caller can only run the helpers it’s supposed to run, and with the parameters it’s supposed to be allowed to specify. Once the helper has finished working, it communicates back information to its caller using JSON.

Let’s take the example of the groupAddServer command. As its name implies, this command is used by a group aclkeeper to add a new server to a bastion group. Let’s say the user guybrush is a gatekeeper of the bastion group island. On the OS level, the OS user guybrush will be a member of the island-aclkeeper system group. One part of the sudoers configuration will say this:

%island-aclkeeper ALL=(island) NOPASSWD: /usr/bin/env perl -T /opt/bastion/bin/helper/osh-groupAddServer --group island *

This line translates to:

all the members of the island-aclkeeper system group (i.e. all the aclkeepers of the island bastion group) can run, as the island system user, the osh-groupAddServer perl script, in tainted mode, but with the command line options forced to start with --group island

The island system user is not mapped to a logical user of the bastion, this is a technical account representing the island bastion group. The file listing the servers of the island bastion group is owned by this system user, and only the aclkeepers, through this sudo rule, can impersonate this system user to add a server to their group. Also note, that the Perl taint mode is used here (-T). This is a special mode that instructs Perl to immediately halt execution of the program (here, the helper) if an attempt is made to use a variable influenced (tainted) by the outside environment, without checking for its validity first. This is an additional protection to ensure that an improperly sanitized input can’t make it through the program’s execution flow.

Going down the rabbit hole with minijail

For some plugins, we even went one level deeper. For example, we have a plugin to allow users to connect to a PostgreSQL database, using the classic psql client, but directly from the bastion. The idea is that the password to access the database is known to the bastion, not to the user, so the password can be extremely complex, and change every day if necessary. This is completely transparent to the user, who just connects to the bastion and asks to run the database plugin. This scheme is the same than when using SSH on both sides: as seen in the first post of this series, the ingress connection is between the user and the bastion (SSH), and the egress connection is between the bastion and the remote server. The only difference is that, in this case, the egress connection is not SSH, but SQL.

But how to secure psql so that, when running on the bastion, the user can’t escape from it? The problem is the same with the mysql client. Those programs are engineered to be run from the local computer, where the user can already run any command, so there’s no real reason to add a configuration option to those programs that forbids local execution of arbitrary commands (shell escape). However on the bastion, we don’t want to allow that. Of course maintaining a forked version of these SQL clients is a complete no-no, because the time we would allocate to maintaining these forks would be of better use in other projects. Instead, we’ve used a tool named minijail, whose purpose is to make readily available, to any program, the (not so) recent features from the Linux Kernel – such as namespaces, capabilities, seccomp, the no_new_privs prctl() flag, etc. We’re not going to detail each and every one of these features, there’s a lot of material online about these, but rather zoom in on how we’ve used them in the context of The Bastion.

Let’s start with the conclusion: here is how it looks on the bastion system itself, while somebody is using the database plugin:

The Bastion - Part 3 - Security at the core > screen2.png" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/10/screen2.png 803w, https://blog.ovhcloud.com/wp-content/uploads/2020/10/screen2-300x77.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/10/screen2-768x197.png 768w" sizes="auto, (max-width: 803px) 100vw, 803px" />

Don’t Panic yet, let’s go through this line by line.

The first line (PID 16) is the sshd system daemon. Nothing fancy here, this is your usual friendly daemon, listening on port 22 for incoming SSH connections.

The second line (PID 413) is the privileged process specially spawned when guybrush logged in successfully on the server. This is also completely standard SSH behavior: when somebody logs in, two sshd processes are spawned by the daemon, a privileged one, and an unprivileged one. Both are dedicated to handling the user, while the parent (the daemon) continues listening for new connections.

The third line (PID 417) is the corresponding unprivileged sshd process for guybrush. This one is responsible for starting up guybrush‘s shell as soon as he’s logged in. Note that from now on, and until further notice, all code is executed under the own user’s (absence of) privileges.

The fourth line (PID 418) is guybrush‘s shell. This is where it’s starting to differ from your usual server. In this case, the shell is not /bin/bash or /bin/zsh, but a portion of the code of the Bastion. As explained above, the bastion is declared as the user’s shell, so when somebody logs in, this is what gets executed instead of a more regular POSIX shell. This portion of the code is responsible for parsing the command-line the user specified, and executing the corresponding action, if this action is allowed. In this case, the user passed the -i parameter, which asks the bastion to start in interactive mode. This is a special mode where it’s easier to launch several bastion commands without having to re-authenticate oneself each time. So, this process is listening for commands from the user. Note that, at this stage, the user has already been authenticated by the system – as this is completely delegated to sshd. If the authentication fails, the user’s shell (here, the bastion code) is never executed.

The fifth line (PID 497) is the child of the interactive process, re-executing the users shell (osh.pl) with new parameters: --osh db, which will instruct this instance of the shell that the user wants to run the db bastion command.

The sixth line (PID 502) is the current bastion command the user is executing. This is the db plugin, and we can see part of the command line: --name lechuck, this tells the plugin that the users wants to connect to the database named lechuck.

The seventh line (PID 503) is the ttyrec parent process, as explained in the first post series, the entire console output of the session is being recorded by the bastion – this process is in charge of doing it.

The eighth line (PID 504) is the ttyrec child process, needed for pseudo-tty support, which in turn is needed for the recording. If you really want to know more about pseudo-ttys, head on to man openpty and/or the ttyrec code itself.

The ninth line (PID 505) is the sudo call to start minijail. This is needed because minijail needs to be root for a proper setup of the jail, before downgrading itself to an unprivileged account

The tenth line (PID 506) is sudo‘s child, this one is in charge of starting the subcommand (minijail in that case)

The eleventh line (PID 507) is the invocation of minijail. The complete command line we’re launching is:

/bin/minijail0 --logging=stderr -u guybrush -g guybrush -n -v --uts -d -P /tmp/chroot-guybrush-psql-wsvhp4 -S /etc/bastion/minijail/db-psql.seccomp -b /lib64 -b /lib -b /usr/lib -b /usr/share -k /home/guybrush/.psql /profile bind 0x10100E rw --set-env HOME=/ --set-env USER=guybrush --set-env LOGNAME=guybrush -- /usr/lib/postgresql/11/bin/psql --pset=pager=off -h dbserver.example.org -p 5432 -U lechuck -- lechuck

Quite a beast. But let’s go through this step by step.

This tells minijail to setup a new IPC namespace (--uts), and to set the no_new_privs flag (-n), so that any part of the process it creates (and those processes own children) will never ever be able to be root again, no matter what. Under a no_new_privs process, even having a wildcard sudoers file, or knowing the root password and attempting to use su, is not enough to get back to UID 0. You just can’t.

We also ask minijail to create a new mount namespace (-v) then pivot_root (-P) to a temporary empty directory, /tmp/chroot-guybrush-psql-wsvhp4, so that the whole filesystem becomes completely inaccessible. As we still need to be able to run an SQL client in this environment, we bind-mount a few important directories in this new namespace, such as /lib64, /lib and such, and also just one directory in read-write, located into the users’s own home directory, so that from inside this jail, it can still have its .psql_history and .psqlrc files from past sessions.

We also set a few environments variables, so that the SQL CLI is not lost (HOME, USER, LOGNAME), then setup a seccomp policy on top of all that, to limit which syscalls can be made from this environment. For example, the execve() syscall is forbidden: the SQL CLI can not create any other process, or it’ll get terminated. Last but not least, when all of this has been set up by minijail, it drops its privileges to the guybrush user (-u) and guybrush group (-g), before executing the psql binary.

The twelfth line (PID 508) is the psql process itself, running inside the jail we’ve built above. This way, it is extremely difficult to escape the psql binary and get out of the jail. The whole setup instantly disappears when the user disconnects. The only remains will be his .psql_history and .psqlrc files. Of course, the ttyrec session record of his SQL usage will remain, too (as executed outside of the jail).

This concludes the post, where we’ve been detailing how some design principles help in delivering a resilient and secure system. Next week, in the final post of this series, we’ll be announcing something special. Stay tuned!

OVHcloud Predictor, part 1

Alexandre Kalatzis — Mon, 05 Oct 2020 21:33:43 +0000

In our previous article concerning the CVE-2017-9841 vulnerability, we presented our web application firewall (WAF) implemented with NAXSI.

Usually, a WAF is run directly on the web server. At OVHcloud, we chose to run our web application firewall upstream, on a very powerful software layer that is specific to our web hosting infrastructures. These are the ‘Predictors’.

If you would like to learn more about them, this article focuses on them in detail.

They are a very crucial part of our infrastructure, and they’re like heroes you’d read about in fiction — but absolutely real!

Before we start describing the role of Predictors and how they work, we will need to understand how this infrastructure operates, powering 6 million websites.

To work on the internet, websites need a computer that serves queries from web browsers — this is a role a server plays.

But just one server is not enough to ensure that a website is available all the time. This means you need to add several servers in case one of them experiences an outage, as well as load balancers to redirect traffic to the right machine.

Websites use and produce data that needs to be conserved, even if the server that makes the website available on the internet goes down. To guarantee security for this data, we externalise storage within two distinct technical building blocks — file servers, and database servers.

These specialised servers have dedicated hardware and software, which offer optimal conservation for data and simplified backup strategies.

Here is a diagram representing the organisation of our web hosting clusters:

A vast architecture like this requires heavy financial investment. In fact, websites are not using constantly their allocated resources. With several websites hosted on this single infrastructure, the resources can be shared, and costs can be divided. This is the principle behind shared hosting.

At OVHcloud, we also offer hosting plans with guaranteed resources — Performance hosting plans. This means that instead of sharing a server’s resources with other customers, your website is placed on a separate server, and its resources are fully dedicated to you. Our Predictors use their talents here, too!

As you can see, the Predictors are included in the big family of load balancers.

So what sets them apart? They are specific to our web hosting plans, and they use data from our IT system to recognise the domains hosted, as well as the resources allocated to them.

They have four main roles:

Distributing queries depending on our customers offers in different clusters and web servers.
Ensuring that resources are distributed equally to customers using web servers, and redistributing resources automatically if required.
Protecting the infrastructure.
Regulating traffic during incidents, and blocking access to certain servers while the technical team carry out interventions.

Since the subject is so vast, we will detail the first two points in this article, and we’ll cover the other two points in a second article.

To assign the right server for each HTTP request, Predictors use several different criteria.

A fair balance of traffic

To work as load balancers, Predictors analyze all incoming queries in order to determine which web server is best to choose as a target, depending on their domain name.

This process only adds a few milliseconds to the request response time, but adapts Predictors answers with server statuses and website traffic in real-time — in that way we can guarantee optimal service availability.

In nominal operation, websites are reallocated periodically to balance the load for servers, and ensure optimal performance for customers.

Reallocation depending on the hosting plan

Predictors determine which hosting plan is linked to each incoming request. This enables them to redirect traffic to the corresponding web farm.

For example, customers who have opted for Performance offers with guaranteed resources are redirected to a web farm where the resources are dedicated to each website.

Some hosting offers like Cloud Web are grouped on dedicated clusters, which makes them easier to maintain.

With this system, we can also manage situations where the service is being misused. Although they may not do so knowingly, some customers use a high volume of resources on our infrastructure, and this can negatively impact performance for other websites hosted on the same server. To avoid this, a hosting plan can be temporarily redirected to a server cluster that is dedicated to managing these situations on a ‘best effort’ basis. When this happens, the customer is contacted accordingly.

We will discuss this in more detail in the next article.

Cache optimization

On our shared infrastructure, the data stored on hosting plans is grouped onto file servers that support multiple customers.

The web servers connect to these file servers using the NFS(Network File System) protocol.

This protocol is known for being robust and flexible, and enables us to share files across the network. Each time a read or write action is performed on the storage space dedicated to a website, data passes through the network before being read or written on the remote storage hardware. The use of this protocol — associated with the remote storage block — is a key factor that makes both the Predictor and the infrastructure resilient against website outages. The website data is instantly available on other web servers, so the website’s traffic can be redirected seamlessly from one server to another.

The simplest way of distributing queries across a web server cluster is to send the same number of queries to each server. But there are smarter ways of going even further to optimize resource usage!

To reduce queries to the storage server and speed up the website, we use a cache layer directly on the web servers to manage the file system.

This means that when a web user visits a website for the first time (whether they are using a CMS like WordPress, Joomla!, etc., or a website coded from scratch), the web server will read the website data on the file server, and store it in-memory via its VFS (Virtual File System) cache. This cache offers an abstraction of the underlying storage system, enabling the web server to use remote storage via NFS protocol — the same way as a local hard drive. For future queries, the HTTP server will avoid querying the network for this file.

But as Phil Karlton said, there are only two hard things in computer science: cache invalidation and naming things (https://quotesondesign.com/phil-karlton/). Because with each write operation, the cache needs to be updated — which generates traffic on the network.

To overcome this, the Predictors keep website visitors on the same server — which means the cache doesn’t have to be refreshed on other servers that are only used in the event of an incident, or rebalancing operation.

By assigning a single server to each hosting plan, we can significantly reduce the volume of network queries, which is a strong argument for getting good value for money on an infrastructure.

Predictors can also divide incoming traffic according to certain criteria (e.g. the solution, the source IP address), and balance the load for a single website across several web servers, while getting the optimal performance delivered by this cache optimization — but this is another story, which we’ll focus on in another article.

Monitoring

The Predictors also monitor the web servers constantly by retrieving their statuses. And they don’t just check the machine’s availability (does it ping? Is Apache responding?) — they also verify many other probes that are more specific to our hosting platform, and help us guarantee that websites are working properly.

If a web server becomes unavailable, the Predictors no longer redirects traffic to it, and the requests that would have been sent there are redirected to a working server. This significantly reduces the impact on the customer’s side.

This is when the self-healing mechanism described in another of our articles comes in: Selfheal at Webhosting – The external part

It takes over to repair the web server, before the Predictors re-integrates it in to the cluster, and HTTP requests can be sent on it again.

And our final challenge? Resolving how SSL certificates are issued.

Since 2016, we have delivered all of our web hosting plans with SSL certificates generated by our partner, Let’s Encrypt.

Before we deliver certificates, Let’s Encrypt needs to verify that the requester is legitimate. To verify this, Let’s Encrypt provides ‘challenges’ that can only be resolved by an infrastructure responding behind the domain name of the SSL certificate requester. This verification is carried out via a HTTP request to a single URL dependent on the domain, and is generated when the certificate request is launched by our infrastructure following a customer’s request.

On our infrastructure, the Predictors play a vital role in resolving these challenges! Placed on the critical path upstream from web servers, they can receive the Let’s Encrypt request, and process it without sending queries to the web servers.

This means we can generate SSL certificates with total transparency for our users, and simplify HTTPS access to websites!

And is that everything?

As you will have gathered from this article, Predictors are essential components of the OVHcloud shared hosting infrastructure.

They help us efficiently provide the features we offer, so that we can deliver web hosting solutions at the best price.

Like superheroes, they have more than one trick up their sleeve. And above all, they’re here to protect us!

In our next article, we will show you the benefits of Predictors in terms of security and stability.

The OVHcloud SSH Bastion – Part 2: delegation dizziness

Stéphane Lesimple — Fri, 11 Sep 2020 15:05:44 +0000

This is the second part of a blog series, here is part one. We’ve previously found that the bastion is not your usual SSH jumphost (in fact, we found it is not a jumphost at all) and we discussed how the delegation was one of the core features we’d originally needed. So, let’s dive into these concepts. There are two compatible accesses models on the bastion: personal and group-based.

Personal Accesses – Piece of Cake

On the bastion, each account has (at least) one set of personal egress keys. These beasts are generated when the account is first created. The personal egress private key sits in the bastion account home. The account user has no way to see it, or export it out of the bastion, but they can use it through the bastion’s code logic. The user can retrieve the corresponding public key at any time, and install it – or get it installed – on the remote servers he needs to access. Depending on your use case – and the level of autonomy you want to give to the teams – there are two ways of managing these personal accesses.

Help yourself

The first way mimics how you would manage accesses if you weren’t using an SSH bastion at all. This is a perfectly valid way to handle accesses on a simple level, without too many users and a limited number of machines. This allows anyone to grant themselves personal accesses on the bastion, without having to ask anyone else to do it. It sounds like a security hole, but it’s not. If someone adds themself a personal access to the remote server, it will only work if his personal egress public key has already been installed on the remote server. In other words, he either already had access to the remote server to do this – using means other than the bastion – or somebody who had access to the remote server accepted the addition of his key. Either way, he cannot magically grant himself personal access without the admins of the remote server first permitting his key.

Ask the IT crowd

Another way to handle this can be to grant a limited number of people, such as security teams, the right to add personal accesses to others. This way people are less autonomous, but it might be useful if adding accesses has to be enacted via normalized processes. It also has some nice effects: as a sysadmin, one of the pros is that you can create 3 separate accounts on the remote machine, and map them to each bastion account you’re adding. This is a good method for achieving end-to-end traceability; including on the remote server; where you might want to install auditd or similar tools. It’s also doable in the help yourself mode, but it may be harder to enforce.

To be clear, this access model doesn’t scale so efficiently when we’re dealing with whole teams, or big infrastructures – this is where group-based access comes handy.

Group Accesses – Let’s Rock

A group has three components:

A list of members (accounts, representing individual people)
At least one set of group egress keys
A list of servers (actually IPs)

Servers list

The servers list is actually a list of IPs, or IP blocks. They map to your servers, network devices, or anything else with SSH capability that has an IP (on which the egress group key has been installed). Technically, this list is actually composed of 3-tuple items: remote user, remote IP (or IP block), remote port. That which applies to the personal accesses, also applies here: adding a server to the list doesn’t magically give access to it, it is first necessary to install the egress group public key. Of course, managing the installation of these keys manually quickly becomes impractical, but you can consider these part of the configuration of the servers, hence they should be managed with whichever centralized configuration system you already use (Puppet, Chef, Ansible, /bin/cp… wait, no, strike this last one).

Members list

The members are people who can connect to any server listed in the group server list. They’ll be using the private egress group key they have access to, as members of said group. Of course, they have no way to extract this private key for their own use outside of the bastion, they can only use it through the bastion’s code logic.

Got a new team member? Just add them as a member of your group, and they instantly get access to all the group servers. Somebody leaves the company? Just delete there account on the bastion, and all the accesses are instantly gone. This is the case because all your servers should have incoming SSH sessions limited to your bastions. This way, any rogue SSH key that would have been added, is no longer of any use.

And some more

We’ve covered the basics of the group-based approach, but as we need a lot of flexibility and delegation, there is a little more to cover. Remember when I said a group had 3 components? Well, I lied. A group has more than just members. Additional group roles include:

Guests
Gatekeepers
Aclkeepers
Owners

All of these are lists of accounts that have a specific role in the group.

First, guests. These are a bit like members, but with less privileges: they can connect to remote machines using the group key, but not to all the machines of the group, only to a subset. This is useful when somebody outside of the team needs a specific access to a specific server, potentially for a limited amount of time (as such accesses can be set to expire).

Then, gatekeepers. Those guys manage the list of members and guests of the group. In other terms, they have the right to give the right to get access. Nothing too complicated here. Then, there are the aclkeepers. As you may have guessed, they manage the list of servers that are part of the group. If you happen to have some automation managing the provisioning of servers of your infrastructure, this role could be granted to a robot account whose sole purpose would be to update the servers list on the bastion, in a completely integrated way with your provisioning. You can even tag such accounts so that they’ll never be able to use SSH through the bastion, even if somebody grants them by mistake!

Last but not least, the owners have the highest privilege level on the group, which means they can manage the gatekeepers, aclkeepers and owners list. They are permitted to give the right to give the right to get access. Moreover, users can accumulate these roles, which means some accounts may be a member and a gatekeeper at the same time, for example.

Global roles – Come Get Some

Beyond the roles we have just described – which are all scoped to a group – there are two additional roles, which are scoped to the whole bastion: the ‘superowner’ and the ‘bastion admin’.

In a nutshell, a superowner is the implicit owner of all groups present on the bastion. This comes in handy if the group becomes ownerless, as superowners are able to nominate a brand new owner. See where I’m going? Superowners are permitted to give the right to give the right to give the right to get access.

Dizzy yet? Now, for the most powerful role: the bastion admin. This role should only be given to a few individuals, as they can impersonate anyone (even if, of course, when they do, this is logged, and makes our SIEM go red), and in practice should not be given to anyone who is not already root on the bastion’s operating system itself. Among other things, they manage the configuration of the bastion, where the superowners are declared. Hold your breath. Ready? They are permitted to give the right to give the right to give the right to give the right to get access. This is why delegation is at the core of the system: everybody has their own set of responsibilities, and potential action, without having to ask the bastion admin.

Wrapping up

All the access management concepts we’ve talked about are mapped to actual commands. These can be run on the bastion after the user has authenticated himself (the famous ingress connection). They’re called osh commands in bastion jargon. There are no egress connections in this case, as these commands interact with the bastion itself:

As you may notice in the above screenshot, the version of the bastion software seems to be very close to 3.00.00! Perhaps, an interesting milestone is coming up?

In the next part of this blog series, we dig into some implementation details of one of those osh plugins and, more precisely, on our security and defense-programming approach.

Selfheal at Webhosting – The external part

Florian Chardin — Fri, 17 Jul 2020 12:34:38 +0000

Introduction

With almost 6000000 websites hosted on more than 15000 servers, the OVHcloud Webhosting SRE team manage lots of alerts during their working day.

Our infrastructure is constantly growing, but to scale smoothly, the amount of time spent solving alerts should not increase proportionally.

We need, therefore, some tools to help us. In our team, we call it the selfheal.

What is the selfheal?

The selfheal refers to the automation of alert solving in our production environments. The automated process is able to fix well-known issues, with no admin interaction.

Why do we need it?

We must limit the time we spend to solve alerts as far as possible. Not only so we have the time to run and maintain the infrastructure, but also to stay up to date.

With the number of servers we manage, a small issue can represent dozens of alerts.

We need to be efficient by automating as many production chores as possible.

Hardware

Serving billions of HTTP requests each day requires a lot of resources, which is why we often use physical servers in our datacenters.

Even a single physical server requires a big follow up. It takes a lot of time to diagnose, schedule downtime, request and manage an intervention with datacenter teams, or even to reinstall the operating system when a disk is faulty.

We cannot afford to spend hours on repetitive tasks when they can be automated.

Software

Even if software seems predictable, it will still encounter failure. This is true even when managing the underlying infrastructure that hosts millions of lines of unknown code provided by our client.

While we try to have a stable software stack, we cannot predict all behaviour. Many of the software problems can be solved with a restart or a quick fix, and lots of these operations can also be automated.

We should alert the on-call admin staff as little as possible, only when it’s absolutely necessary.

The idea is is to log each action done by the selfheal to identify bug or error patterns and then work on longer term fixes.

The selfheal at Webhosting

At Webhosting we split selfheal in two part:

External selfheal which handles hardware problems or anything that can not be solved by the host itself.
Internal selfheal which is intended to solve software problems on a given system.

In this article, we will discuss the the external part.

External selfheal

Context:

As we said earlier, the external part of our selfheal is mainly intended to solve hardware problems that cannot be solved by the server alone.

To accomplish this, we created a small micro-service application that listens – monitoring events.

We could have chosen an existing tool (like StackStorm), but we didn’t. Here’s why:

Building micro-services is really simple and fast at OVH.
Structured, detailed and simple logs with a uniq uuid to follow each selfheal task in our internal logging system (which allow us to graph them easily).
Simple integration with our existing tools and ecosystem
Fast and easy deployment in all our regions
Simple CI/CD (unit-testing, etc)
Custom notifications, like chat-bot
Intelligence and history

How it works

Everything starts with our monitoring, which scraps the servers probes and sends all alerts in a Kafka topic.

The application consumes Kafka events and then reacts instantly with the correct workflow, depending on the problem.

The app will react with the appropriate workflow depending on the alert it gets. It does this by performing the correct API call to our different services and tools.

All actions performed are stored. This prevents having to do the same fix several times on a given server and to identify complex problems.

Concrete example on faulty disk replacement

One of the top time-consuming alerts we’ve had to solve was the replacement of unhealthy HDD found by SMART checks.

Being stateless, lots of our servers use a single disk with no raid setup. It also means replacing a disk to reinstall the host; but hopefully, it can be done with a single API call.

To manage this alert, an admin had to do the following actions:

Put the server in maintenance to drain client requests
Create a datacenter request to replace the HDD
Reinstall the server

This whole process can take up to 3 hours and is hard to execute manually (managing several issues at once).

The first thing we did, was to automate the check with a probe.

Then, we decided to automate the whole thing with a simple workflow in our self-healing application, then to orchestrate the API call.

With this process, we are able to replace disks every day without any manual tasks performed by an admin.

To conclude

Last month, our external selfheal tool requested more than 70 datacenter interventions to datacenter teams which represents a big time saving.

We won in reactivity. No more lag between the time an alert is detected and when it’s handled.

Alerts are handled instantly when detected by the monitoring system. It helps us to keep a clean monitoring backlog and to avoid “batches” of alert solving, which were complicated for both us and DC.

Now, we just handle alerts that cannot be solved through automation and focus on corner cases, where admin interactions are valuable and required.