SRE Archives - OVHcloud Blog

Warden: the self-healing framework for local actions

Alexandre Gauthier — Wed, 09 Dec 2020 11:23:14 +0000

This article is the follow up to Selfheal at Webhosting – The External Part published on 2020-07-17.
Part two below covers the local self-healing system.

Introduction

With over 15-000 servers dedicated to providing services for 6 million websites and web applications of all sorts, across multiple data-centers and geographical zones, a certain amount of software failures are inevitable. They must be handled to ensure the servers are in a functional state to provide continuity-of-service.

The overhead only increases once you account for supporting pieces of the infrastructure that provide the service, or by clients to access and manage their data.

Generally speaking, restarting failed services and reacting to health checks failing with automatic operations can be done swiftly with a simple install of, for example – Monit, or Systemd Unit Parameters.

Web-hosting infrastructure, however, poses unique challenges that require a holistic response.

It’s not only large, but it’s distributed and highly available. A web host encountering a failure will not degrade the service, as another node in a cluster will immediately take its place to service client requests.

Additionally, providing Shared Hosting as a service means you are mostly running Unknown Workloads. No two websites have the same requirements, performance, or behavior. You can’t therefore make assumptions about what is normal, and what isn’t, which in turn makes establishing a baseline for Abnormal Behavior difficult.

In this context, it is generally an inevitable fact of life that sometimes those workloads will misbehave, crash, or put the system into a state it cannot recover from without intervention.

Trying to prevent this is therefore futile. Facilitating recovery within isolated fault domains is a more productive approach and is where self-healing becomes useful.

Self-healing systems

While the highly available nature of the infrastructure means failure states don’t necessarily degrade the service – the cause still needs to be investigated and the system recovered before being returned to the pool of available hosts to serve requests.

Without automated systems in place to achieve this, it can easily turn into a battle of attrition. Systems to diagnose and clear can pile up and eat into actual time spent on improvements and long-term mitigation of failure states.

We therefore employ two self-healing systems at Webhosting to automate the process:

Healer: External self-healing, which handles hardware problems, the absence of connectivity, and anything the local systems can’t resolve locally.
Warden: A local agent that exposes a framework for self-healing on local nodes. Warden is the component we will be exploring today.

Enter Warden

Warden was designed as a simple, lightweight daemon process that exposes a plugin API, allowing members of the SRE team to quickly write small pluggable python scripts that handle specific conditions found on the local system. It is meant to exist as an agent on every single server of the web-hosting fleet, where it will work to maintain integrity and record information about failure states.

Goals

Warden has a few specific long term goals, which are worth going over.

Maximize system availability

Warden attempts to detect scenarios that would degrade or otherwise disrupt the service and responds to fault events from the monitoring system. This allows for the quick return of the system to a functional, clean state; allowing it to reintegrate the available hosts pool and serve requests again. Being a local, per-server process, Warden is able to be reactive and process events in a timely fashion, avoiding network round trips and monitoring delays. This contributes to the general health of the infrastructure by keeping the amount of hosts in a failure state at a bare minimum.

Log diagnostic data for later analysis

Being a local agent present on every system, Warden is in the enviable position of being able to collect all sorts of surrounding data for export upon detecting a failure state.

Warden keeps a detailed record of the failure state and surrounding system state, to be queried later. This ensures diagnosis is not a blocking point for returning the host to duty. It is also important to remember the goal is not to sweep failure states under the carpet, or mask them.

Additionally, since many of these failure states are non-critical (as other hosts take over transparently), it may be multiple days by the time someone gets to look at it, at which point the relevant state to inspect is long gone, and we’re just left with an empty, yet offline server.

The primary goal here is actually to increase visibility into failure states, and to be able to quickly identify trends and underlying issues that must be mitigated or resolved, while ensuring the relevant data is kept while fresh.

At runtime, Warden generates snapshots of interesting system aspects. A long term goal is to capture a meaningful representation of the entire system state at the time of event, preventing the need to perform diagnostics directly on affected hosts.

Minimise human overhead

Analysis of failure states can be highly time consuming, especially if you’re flooded by hundreds of systems reporting mostly the same issue. It can also be irritating to constantly deal with transient failure states that are considered “normal”, either due to known popular application bugs, or other known circumstances. Just sorting the signal from the noise can be a full time job, especially if your team is actively trying to maintain general health and resolve the issue long term.

This can quickly turn into a battle of attrition where resources are expended on managing the alerts, failure states and problems over actively working to mitigate and resolve them.

Warden hopes to streamline this process massively, allowing SRE people to focus on what actually matters and makes a difference in terms of Quality of Service.

Make writing self-healing plugins easy

The API Warden is meant to be simple. It abstracts much of the nuts and bolts of the implementation process involved in execution.

Plugin authors should not have to worry about scheduling their own run, or writing complex logic to obtain the information they are after, nor should they have to write solid logging code.

All of this should be handled by Warden. Plugin authors should be able to focus on describing their conditions, selecting what relevant data they want to record, and writing an action that hopefully restores functionality.

How does it work?

Warden Core

As previously mentioned, Warden is a small daemon written entirely in Python. On boot, it will enumerate the plugins it is configured to activate, and place them in a queue.

Plugins may have configuration values as well, exposing easily tunable thresholds for response, or other settings. The Warden Core essentially serves to orchestrate everything, as well as provide the plugin API.

It also keeps track of various internal decisions, plugin states and how many times a plugin has done a self-healing action.

Then, once booted, the main workflow starts.

State Collection

Warden immediately goes and collects system states from its available sources. This could be, for example, a monitoring probe sink – which can be queried remotely as well as locally – or a snapshot of the process table.

Some deeper information is also generated, on demand, to keep the system load as light as possible.

This information is then sent to plugins matching the type of state collector. For example, plugins that operate on the process table will be gently fed this information.

Plugin hand off

A Warden plugin consists of essentially three primary callbacks, which should be easy to implement.

Plugins are encouraged to terminate early if they do not find actionable items in the system state.

Scan Phase

In this phase, a Warden plugin will receive information about the system state, in a form it can easily digest, using standard Python data structures.
The plugin can select some particular pieces of information it would like to further analyze, if necessary.

If an event is detected that the plugin can respond to immediately, then this is recorded to a Central Store (provided by our own Logs Data Platform product)

If at this point, a self-heal action is necessary, the plugin can signal it by setting its internal state accordingly.

Analysis Phase

During this phase, the plugin will further dissect the received status, and/or collect information about the system – either requesting them from Warden, or collecting them itself.

This is where the diagnostic information will be exported to a Central Store, alongside a plethora of useful metadata (where, when, who, how).

At this point, if not already signaled by the previous phase, the plugin can mark its internal state as requiring an action.

Heal Phase

Warden will then check the internal state of the plugin, and if it needs to perform an action, this final phase will be executed.
This is where the logic to resolve the situation is written. Services get restarted, processes get terminated, maintenance scripts called, etc.

Success (or failure) is reported, and Warden will dutifully log the Action and its results to the Central Store.

At this point, if an action was taken, Warden will refresh the corresponding state before moving on to the next plugin in the queue.

This process is repeated at configurable intervals that can be kept short, since plugins are lightweight and exit quickly if no issue is found.

Dashboards and Visibility

Extensive Grafana dashboards as well as Graylog interfaces have been built to closely monitor everything the Warden does.

They simply query the Central Store where every single system reports its events and actions.

We can tell how frequently a specific self-heal is triggered, for example, on what amount of systems, and where they occur the most.

We can also easily tell where self-heals fail the most, between individual failure domains, or down to individual systems within a cluster.

They are made to be easy to drill down into, to get a bird eye’s view of the global state as well as a detailed view of the exact actions taken by a single plugin.

Keeping this up on a TV Monitor in office has been of incredible value when it comes to casually noticing trends, as well as identifying which problems are recurrent and which are transient.

A Practical Example

As a practical example of how Warden can be tied into existing systems and handling their events, there exists a probe on our servers that verifies the availability of the hosting runtime stack, ensuring it functions and is in the correct state to process requests.

It would often raise an alarm after some specific code in our hosting stack either terminates abnormally, or creates a scenario where the stack was incapable of recovering on its own. This would generate an alert, mark the server as unavailable, and remove it from the active pool.

Rebooting the server or restarting the entire stack would obviously resolve the situation and return the system to the pool of available hosts, but this robs us of the opportunity to inspect the issue. Existing metrics and logs only shed partial light into what exactly had occurred in order to cause this; especially since reproducing it will often be dependent on specific applications we host. Not to mention that by the time someone got to look at it, the chances are that the interesting state has long left the system.

In order to mitigate this, a Warden plugin was written with the following logic:

It scans the local alert sink for the failure state (exiting if it is not present)
During the analysis phase, crash dumps are collected, the filesystem state is recorded, relevant logs are extracted.
The exact version of the hosting stack is also collected, alongside everything relevant.
This is then sent to the Central Store alongside information about the host, the site, and timestamps.
The plugin then marks itself as needing to take action.
Everything relevant having been collected will mean that the hosting stack is destroyed, cleaned, and relaunched.
Afterwards, the probe that raised the alert is refreshed. Congratulations, the system is now back online, and in a matter of minutes!

The turnaround time for writing the plugin was also reasonably short, and was deemed complete in two iterations (mostly to collect more data).

This information helped our developers pin-point exactly what was happening, as well as continuing to be a solid metric for gauging the health of our infrastructure.

In Conclusion

So far, Warden has helped not only lower the amount of human resources expended towards diagnosing and resolving issues, but has generated targeted improvements to various components of our stack.

It has also identified issues that would otherwise have gone unnoticed simply by graphing a visual trend of certain non-fatal states, which has led to more fixes and improvements.

On-call duty cycles have also been reasonably more peaceful as the bar for accessibility has been significantly lowered when it comes to automating resolution of simple issues.

It has generally allowed us to better focus our energy where we are able to make a difference, and through further improvements, will hopefully continue to do so.

Selfheal at Webhosting – The external part

Florian Chardin — Fri, 17 Jul 2020 12:34:38 +0000

Introduction

With almost 6000000 websites hosted on more than 15000 servers, the OVHcloud Webhosting SRE team manage lots of alerts during their working day.

Our infrastructure is constantly growing, but to scale smoothly, the amount of time spent solving alerts should not increase proportionally.

We need, therefore, some tools to help us. In our team, we call it the selfheal.

What is the selfheal?

The selfheal refers to the automation of alert solving in our production environments. The automated process is able to fix well-known issues, with no admin interaction.

Why do we need it?

We must limit the time we spend to solve alerts as far as possible. Not only so we have the time to run and maintain the infrastructure, but also to stay up to date.

With the number of servers we manage, a small issue can represent dozens of alerts.

We need to be efficient by automating as many production chores as possible.

Hardware

Serving billions of HTTP requests each day requires a lot of resources, which is why we often use physical servers in our datacenters.

Even a single physical server requires a big follow up. It takes a lot of time to diagnose, schedule downtime, request and manage an intervention with datacenter teams, or even to reinstall the operating system when a disk is faulty.

We cannot afford to spend hours on repetitive tasks when they can be automated.

Software

Even if software seems predictable, it will still encounter failure. This is true even when managing the underlying infrastructure that hosts millions of lines of unknown code provided by our client.

While we try to have a stable software stack, we cannot predict all behaviour. Many of the software problems can be solved with a restart or a quick fix, and lots of these operations can also be automated.

We should alert the on-call admin staff as little as possible, only when it’s absolutely necessary.

The idea is is to log each action done by the selfheal to identify bug or error patterns and then work on longer term fixes.

The selfheal at Webhosting

At Webhosting we split selfheal in two part:

External selfheal which handles hardware problems or anything that can not be solved by the host itself.
Internal selfheal which is intended to solve software problems on a given system.

In this article, we will discuss the the external part.

External selfheal

Context:

As we said earlier, the external part of our selfheal is mainly intended to solve hardware problems that cannot be solved by the server alone.

To accomplish this, we created a small micro-service application that listens – monitoring events.

We could have chosen an existing tool (like StackStorm), but we didn’t. Here’s why:

Building micro-services is really simple and fast at OVH.
Structured, detailed and simple logs with a uniq uuid to follow each selfheal task in our internal logging system (which allow us to graph them easily).
Simple integration with our existing tools and ecosystem
Fast and easy deployment in all our regions
Simple CI/CD (unit-testing, etc)
Custom notifications, like chat-bot
Intelligence and history

How it works

Everything starts with our monitoring, which scraps the servers probes and sends all alerts in a Kafka topic.

The application consumes Kafka events and then reacts instantly with the correct workflow, depending on the problem.

The app will react with the appropriate workflow depending on the alert it gets. It does this by performing the correct API call to our different services and tools.

All actions performed are stored. This prevents having to do the same fix several times on a given server and to identify complex problems.

Concrete example on faulty disk replacement

One of the top time-consuming alerts we’ve had to solve was the replacement of unhealthy HDD found by SMART checks.

Being stateless, lots of our servers use a single disk with no raid setup. It also means replacing a disk to reinstall the host; but hopefully, it can be done with a single API call.

To manage this alert, an admin had to do the following actions:

Put the server in maintenance to drain client requests
Create a datacenter request to replace the HDD
Reinstall the server

This whole process can take up to 3 hours and is hard to execute manually (managing several issues at once).

The first thing we did, was to automate the check with a probe.

Then, we decided to automate the whole thing with a simple workflow in our self-healing application, then to orchestrate the API call.

With this process, we are able to replace disks every day without any manual tasks performed by an admin.

To conclude

Last month, our external selfheal tool requested more than 70 datacenter interventions to datacenter teams which represents a big time saving.

We won in reactivity. No more lag between the time an alert is detected and when it’s handled.

Alerts are handled instantly when detected by the monitoring system. It helps us to keep a clean monitoring backlog and to avoid “batches” of alert solving, which were complicated for both us and DC.

Now, we just handle alerts that cannot be solved through automation and focus on corner cases, where admin interactions are valuable and required.