Automation Archives - OVHcloud Blog

Warden: the self-healing framework for local actions

Alexandre Gauthier — Wed, 09 Dec 2020 11:23:14 +0000

This article is the follow up to Selfheal at Webhosting – The External Part published on 2020-07-17.
Part two below covers the local self-healing system.

Introduction

With over 15-000 servers dedicated to providing services for 6 million websites and web applications of all sorts, across multiple data-centers and geographical zones, a certain amount of software failures are inevitable. They must be handled to ensure the servers are in a functional state to provide continuity-of-service.

The overhead only increases once you account for supporting pieces of the infrastructure that provide the service, or by clients to access and manage their data.

Generally speaking, restarting failed services and reacting to health checks failing with automatic operations can be done swiftly with a simple install of, for example – Monit, or Systemd Unit Parameters.

Web-hosting infrastructure, however, poses unique challenges that require a holistic response.

It’s not only large, but it’s distributed and highly available. A web host encountering a failure will not degrade the service, as another node in a cluster will immediately take its place to service client requests.

Additionally, providing Shared Hosting as a service means you are mostly running Unknown Workloads. No two websites have the same requirements, performance, or behavior. You can’t therefore make assumptions about what is normal, and what isn’t, which in turn makes establishing a baseline for Abnormal Behavior difficult.

In this context, it is generally an inevitable fact of life that sometimes those workloads will misbehave, crash, or put the system into a state it cannot recover from without intervention.

Trying to prevent this is therefore futile. Facilitating recovery within isolated fault domains is a more productive approach and is where self-healing becomes useful.

Self-healing systems

While the highly available nature of the infrastructure means failure states don’t necessarily degrade the service – the cause still needs to be investigated and the system recovered before being returned to the pool of available hosts to serve requests.

Without automated systems in place to achieve this, it can easily turn into a battle of attrition. Systems to diagnose and clear can pile up and eat into actual time spent on improvements and long-term mitigation of failure states.

We therefore employ two self-healing systems at Webhosting to automate the process:

Healer: External self-healing, which handles hardware problems, the absence of connectivity, and anything the local systems can’t resolve locally.
Warden: A local agent that exposes a framework for self-healing on local nodes. Warden is the component we will be exploring today.

Enter Warden

Warden was designed as a simple, lightweight daemon process that exposes a plugin API, allowing members of the SRE team to quickly write small pluggable python scripts that handle specific conditions found on the local system. It is meant to exist as an agent on every single server of the web-hosting fleet, where it will work to maintain integrity and record information about failure states.

Goals

Warden has a few specific long term goals, which are worth going over.

Maximize system availability

Warden attempts to detect scenarios that would degrade or otherwise disrupt the service and responds to fault events from the monitoring system. This allows for the quick return of the system to a functional, clean state; allowing it to reintegrate the available hosts pool and serve requests again. Being a local, per-server process, Warden is able to be reactive and process events in a timely fashion, avoiding network round trips and monitoring delays. This contributes to the general health of the infrastructure by keeping the amount of hosts in a failure state at a bare minimum.

Log diagnostic data for later analysis

Being a local agent present on every system, Warden is in the enviable position of being able to collect all sorts of surrounding data for export upon detecting a failure state.

Warden keeps a detailed record of the failure state and surrounding system state, to be queried later. This ensures diagnosis is not a blocking point for returning the host to duty. It is also important to remember the goal is not to sweep failure states under the carpet, or mask them.

Additionally, since many of these failure states are non-critical (as other hosts take over transparently), it may be multiple days by the time someone gets to look at it, at which point the relevant state to inspect is long gone, and we’re just left with an empty, yet offline server.

The primary goal here is actually to increase visibility into failure states, and to be able to quickly identify trends and underlying issues that must be mitigated or resolved, while ensuring the relevant data is kept while fresh.

At runtime, Warden generates snapshots of interesting system aspects. A long term goal is to capture a meaningful representation of the entire system state at the time of event, preventing the need to perform diagnostics directly on affected hosts.

Minimise human overhead

Analysis of failure states can be highly time consuming, especially if you’re flooded by hundreds of systems reporting mostly the same issue. It can also be irritating to constantly deal with transient failure states that are considered “normal”, either due to known popular application bugs, or other known circumstances. Just sorting the signal from the noise can be a full time job, especially if your team is actively trying to maintain general health and resolve the issue long term.

This can quickly turn into a battle of attrition where resources are expended on managing the alerts, failure states and problems over actively working to mitigate and resolve them.

Warden hopes to streamline this process massively, allowing SRE people to focus on what actually matters and makes a difference in terms of Quality of Service.

Make writing self-healing plugins easy

The API Warden is meant to be simple. It abstracts much of the nuts and bolts of the implementation process involved in execution.

Plugin authors should not have to worry about scheduling their own run, or writing complex logic to obtain the information they are after, nor should they have to write solid logging code.

All of this should be handled by Warden. Plugin authors should be able to focus on describing their conditions, selecting what relevant data they want to record, and writing an action that hopefully restores functionality.

How does it work?

Warden Core

As previously mentioned, Warden is a small daemon written entirely in Python. On boot, it will enumerate the plugins it is configured to activate, and place them in a queue.

Plugins may have configuration values as well, exposing easily tunable thresholds for response, or other settings. The Warden Core essentially serves to orchestrate everything, as well as provide the plugin API.

It also keeps track of various internal decisions, plugin states and how many times a plugin has done a self-healing action.

Then, once booted, the main workflow starts.

State Collection

Warden immediately goes and collects system states from its available sources. This could be, for example, a monitoring probe sink – which can be queried remotely as well as locally – or a snapshot of the process table.

Some deeper information is also generated, on demand, to keep the system load as light as possible.

This information is then sent to plugins matching the type of state collector. For example, plugins that operate on the process table will be gently fed this information.

Plugin hand off

A Warden plugin consists of essentially three primary callbacks, which should be easy to implement.

Plugins are encouraged to terminate early if they do not find actionable items in the system state.

Scan Phase

In this phase, a Warden plugin will receive information about the system state, in a form it can easily digest, using standard Python data structures.
The plugin can select some particular pieces of information it would like to further analyze, if necessary.

If an event is detected that the plugin can respond to immediately, then this is recorded to a Central Store (provided by our own Logs Data Platform product)

If at this point, a self-heal action is necessary, the plugin can signal it by setting its internal state accordingly.

Analysis Phase

During this phase, the plugin will further dissect the received status, and/or collect information about the system – either requesting them from Warden, or collecting them itself.

This is where the diagnostic information will be exported to a Central Store, alongside a plethora of useful metadata (where, when, who, how).

At this point, if not already signaled by the previous phase, the plugin can mark its internal state as requiring an action.

Heal Phase

Warden will then check the internal state of the plugin, and if it needs to perform an action, this final phase will be executed.
This is where the logic to resolve the situation is written. Services get restarted, processes get terminated, maintenance scripts called, etc.

Success (or failure) is reported, and Warden will dutifully log the Action and its results to the Central Store.

At this point, if an action was taken, Warden will refresh the corresponding state before moving on to the next plugin in the queue.

This process is repeated at configurable intervals that can be kept short, since plugins are lightweight and exit quickly if no issue is found.

Dashboards and Visibility

Extensive Grafana dashboards as well as Graylog interfaces have been built to closely monitor everything the Warden does.

They simply query the Central Store where every single system reports its events and actions.

We can tell how frequently a specific self-heal is triggered, for example, on what amount of systems, and where they occur the most.

We can also easily tell where self-heals fail the most, between individual failure domains, or down to individual systems within a cluster.

They are made to be easy to drill down into, to get a bird eye’s view of the global state as well as a detailed view of the exact actions taken by a single plugin.

Keeping this up on a TV Monitor in office has been of incredible value when it comes to casually noticing trends, as well as identifying which problems are recurrent and which are transient.

A Practical Example

As a practical example of how Warden can be tied into existing systems and handling their events, there exists a probe on our servers that verifies the availability of the hosting runtime stack, ensuring it functions and is in the correct state to process requests.

It would often raise an alarm after some specific code in our hosting stack either terminates abnormally, or creates a scenario where the stack was incapable of recovering on its own. This would generate an alert, mark the server as unavailable, and remove it from the active pool.

Rebooting the server or restarting the entire stack would obviously resolve the situation and return the system to the pool of available hosts, but this robs us of the opportunity to inspect the issue. Existing metrics and logs only shed partial light into what exactly had occurred in order to cause this; especially since reproducing it will often be dependent on specific applications we host. Not to mention that by the time someone got to look at it, the chances are that the interesting state has long left the system.

In order to mitigate this, a Warden plugin was written with the following logic:

It scans the local alert sink for the failure state (exiting if it is not present)
During the analysis phase, crash dumps are collected, the filesystem state is recorded, relevant logs are extracted.
The exact version of the hosting stack is also collected, alongside everything relevant.
This is then sent to the Central Store alongside information about the host, the site, and timestamps.
The plugin then marks itself as needing to take action.
Everything relevant having been collected will mean that the hosting stack is destroyed, cleaned, and relaunched.
Afterwards, the probe that raised the alert is refreshed. Congratulations, the system is now back online, and in a matter of minutes!

The turnaround time for writing the plugin was also reasonably short, and was deemed complete in two iterations (mostly to collect more data).

This information helped our developers pin-point exactly what was happening, as well as continuing to be a solid metric for gauging the health of our infrastructure.

In Conclusion

So far, Warden has helped not only lower the amount of human resources expended towards diagnosing and resolving issues, but has generated targeted improvements to various components of our stack.

It has also identified issues that would otherwise have gone unnoticed simply by graphing a visual trend of certain non-fatal states, which has led to more fixes and improvements.

On-call duty cycles have also been reasonably more peaceful as the bar for accessibility has been significantly lowered when it comes to automating resolution of simple issues.

It has generally allowed us to better focus our energy where we are able to make a difference, and through further improvements, will hopefully continue to do so.

Selfheal at Webhosting – The external part

Florian Chardin — Fri, 17 Jul 2020 12:34:38 +0000

Introduction

With almost 6000000 websites hosted on more than 15000 servers, the OVHcloud Webhosting SRE team manage lots of alerts during their working day.

Our infrastructure is constantly growing, but to scale smoothly, the amount of time spent solving alerts should not increase proportionally.

We need, therefore, some tools to help us. In our team, we call it the selfheal.

What is the selfheal?

The selfheal refers to the automation of alert solving in our production environments. The automated process is able to fix well-known issues, with no admin interaction.

Why do we need it?

We must limit the time we spend to solve alerts as far as possible. Not only so we have the time to run and maintain the infrastructure, but also to stay up to date.

With the number of servers we manage, a small issue can represent dozens of alerts.

We need to be efficient by automating as many production chores as possible.

Hardware

Serving billions of HTTP requests each day requires a lot of resources, which is why we often use physical servers in our datacenters.

Even a single physical server requires a big follow up. It takes a lot of time to diagnose, schedule downtime, request and manage an intervention with datacenter teams, or even to reinstall the operating system when a disk is faulty.

We cannot afford to spend hours on repetitive tasks when they can be automated.

Software

Even if software seems predictable, it will still encounter failure. This is true even when managing the underlying infrastructure that hosts millions of lines of unknown code provided by our client.

While we try to have a stable software stack, we cannot predict all behaviour. Many of the software problems can be solved with a restart or a quick fix, and lots of these operations can also be automated.

We should alert the on-call admin staff as little as possible, only when it’s absolutely necessary.

The idea is is to log each action done by the selfheal to identify bug or error patterns and then work on longer term fixes.

The selfheal at Webhosting

At Webhosting we split selfheal in two part:

External selfheal which handles hardware problems or anything that can not be solved by the host itself.
Internal selfheal which is intended to solve software problems on a given system.

In this article, we will discuss the the external part.

External selfheal

Context:

As we said earlier, the external part of our selfheal is mainly intended to solve hardware problems that cannot be solved by the server alone.

To accomplish this, we created a small micro-service application that listens – monitoring events.

We could have chosen an existing tool (like StackStorm), but we didn’t. Here’s why:

Building micro-services is really simple and fast at OVH.
Structured, detailed and simple logs with a uniq uuid to follow each selfheal task in our internal logging system (which allow us to graph them easily).
Simple integration with our existing tools and ecosystem
Fast and easy deployment in all our regions
Simple CI/CD (unit-testing, etc)
Custom notifications, like chat-bot
Intelligence and history

How it works

Everything starts with our monitoring, which scraps the servers probes and sends all alerts in a Kafka topic.

The application consumes Kafka events and then reacts instantly with the correct workflow, depending on the problem.

The app will react with the appropriate workflow depending on the alert it gets. It does this by performing the correct API call to our different services and tools.

All actions performed are stored. This prevents having to do the same fix several times on a given server and to identify complex problems.

Concrete example on faulty disk replacement

One of the top time-consuming alerts we’ve had to solve was the replacement of unhealthy HDD found by SMART checks.

Being stateless, lots of our servers use a single disk with no raid setup. It also means replacing a disk to reinstall the host; but hopefully, it can be done with a single API call.

To manage this alert, an admin had to do the following actions:

Put the server in maintenance to drain client requests
Create a datacenter request to replace the HDD
Reinstall the server

This whole process can take up to 3 hours and is hard to execute manually (managing several issues at once).

The first thing we did, was to automate the check with a probe.

Then, we decided to automate the whole thing with a simple workflow in our self-healing application, then to orchestrate the API call.

With this process, we are able to replace disks every day without any manual tasks performed by an admin.

To conclude

Last month, our external selfheal tool requested more than 70 datacenter interventions to datacenter teams which represents a big time saving.

We won in reactivity. No more lag between the time an alert is detected and when it’s handled.

Alerts are handled instantly when detected by the monitoring system. It helps us to keep a clean monitoring backlog and to avoid “batches” of alert solving, which were complicated for both us and DC.

Now, we just handle alerts that cannot be solved through automation and focus on corner cases, where admin interactions are valuable and required.

Doing BIG automation with Celery

Bartosz Rabiega — Fri, 06 Mar 2020 16:14:18 +0000

Intro

TL;DR: You might want to skip the intro and jump right into “Celery – Distributed Task Queue”.

Hello! I’m Bartosz Rabiega, and I’m part of the R&D/DevOps teams at OVHcloud. As part of our daily work, we’re developing and maintaining the Ceph-as-a-Service project, in order to provide highly available, solid, distributed storage for various applications. We’re dealing with 60PB+ of data, across 10 regions, so as you might imagine, we’ve got quite a lot of work ahead in terms of replacing broken hardware, handling natural growth, provisioning new regions and datacentres, evaluating new hardware, optimising software and hardware configurations, researching new storage solutions, and much more!

Because of the wide scope of our work, we need to offload as many repetitive tasks as possible. And we do that through automation.

Automating your work

To some extent, every manual process can be described as set of actions and conditions. If we somehow managed to force something to automatically perform the actions and check the conditions, we would be able to automate the process, resulting in an automated workflow. Take a look at the example below, which shows some generic steps for manually replacing hardware in our project.

Hmm… What could help us do this automatically? Doesn’t a computer sound like a perfect fit? 🙂 There are many ways to force computers to process automated workflows, but first we need to define some building blocks (let’s call them tasks) and get them to run sequentially or in parallel (i.e. a workflow). Fortunately, there are software solutions that can help with that, among which is Celery.

Celery – Distributed Task Queue

Celery is a well-known and widely adopted piece of software that allows us to process tasks asynchronously. The description of the project on its main page (http://www.celeryproject.org/) may sound a little bit enigmatic, but we can narrow down its basic functionality to something like this:

Such machinery is perfectly suited to tasks like sending emails asynchronously (i.e. ‘fire and forget’), but it can also be used for different purposes. So what other tasks could it handle? Basically, any tasks you can implement in Python (the main Celery language)! I won’t go too much into the details, as they are available in the Celery documentation. What matters is that since we can implement any task we want, we can use that to create the building blocks for our automation.

There is one more important thing… Celery natively supports combining such tasks into workflows (Celery primitives: chains, groups, chords, etc.). So let’s get through some examples…

We’ll use the following task definitions – single task, printing args and kwargs:

@celery_app.task
def noop(*args, **kwargs):
    # Task accepts any arguments and does nothing
    print(args, kwargs)
    return True

Now we can execute the task asynchronously, using the following code:

task = noop.s(777)
task.apply_async()

The elementary tasks can be parametrised and combined into a complex workflow using celery methods, i.e. “chain”, “group”, and “chord”. See the examples below. In each of them, the left side shows a visual representation of a workflow, while the right side shows the code snippet that generates it. The green box is the starting point, after which the workflow execution progresses vertically.

Chain – a set of tasks processed sequentially

workflow = (
    chain([noop.s(i) for i in range(3)])
)

Group – a set of tasks processed in parallel

workflow = (
    group([noop.s(i) for i in range(5)])
)

Chord – a group of tasks chained to the following task

workflow = chord(
        [noop.s(i) for i in range(5)],
        noop.s(i)
)

# Equivalent:
workflow = chain([
        group([noop.s(i) for i in range(5)]),
        noop.s(i)
])

An important point: the execution of a workflow will always stop in the event of a failed task. As a result, a chain won’t be continued if some task fails in the middle of it. This gives us quite a powerful framework for implementing some neat automation, and that’s exactly what we’re using for Ceph-as-a-Service at OVHcloud! We’ve implemented lots of small, flexible, parameterisable tasks, which we combine together to reach a common goal. Here are some real-life examples of elementary tasks, used for the automatic removal of old hardware:

Change weight of Ceph node (used to increase/decrease the amount of data on node. Triggers data rebalance)
Set service downtime (data rebalance triggers monitoring probes, but this is expected, so set downtime for this particular monitoring entry)
Wait until Ceph is healthy (wait until the data rebalance is complete – repeating task)
Remove Ceph node from a cluster (node is empty so it can simply be uninstalled)
Send info to technicians in DC (hardware is ready to be replaced)
Add new Ceph node to a cluster (install new empty node)

We parametrise these tasks and tie them together, using Celery chains, groups and chords to create the desired workflow. Celery then does the rest by asynchronously executing the workflow.

Big workflows and Celery

As our infrastructure grows, so doo our automated workflows grow, with more tasks per workflow, higher complexity of workflows… What do we understand as a big workflow? A workflow consisting of 1,000-10,000 tasks. Just to visualize it take a look on following examples:

A few chords chained together (57 tasks in total)

workflow = chain([
    noop.s(0),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    noop.s()
])

More complex graph structure built from chains and groups (23 tasks in total)

# | is ‘chain’ operator in celery
workflow = (
    group(
        group(
            group([noop.s() for i in range(5)]),
            chain([noop.s() for i in range(5)])
        ) |
        noop.s() |
        group([noop.s() for i in range(5)]) |
        noop.s(),
        chain([noop.s() for i in range(5)])
    ) |
    noop.s()
)

As you can probably imagine, visualisations get quite big and messy when 1,000 tasks are involved! Celery is a powerful tool, and has lots of features that are well-suited for automation, but it still struggles when it comes to processing big, complex, long-running workflows. Orchestrating the execution of 10,000 tasks, with a variety of dependencies, is no trivial thing. There are several issues we encountered when our automation grew too big:

Memory issues during workflow building (client side)
Serialisation issues (client -> Celery backend transfer)
Nondeterministic, broken execution of workflows
Memory issues in Celery workers (Celery backend)
Disappearing tasks
And more…

Take a look at some GitHub tickets:

Using Celery for our particular use case became difficult and unreliable. Celery’s native support for workflows doesn’t seem to be the right choice for handling 100/1,000/10,000 tasks. In its current state, it’s just not enough. So here we stand, in front of a solid, concrete wall… Either we somehow fix Celery, or we rewrite our automation using a different framework.

Celery – to fix… or to fix?

Rewriting all of our automation would be possible, although relatively painful. Since I’m a rather lazy person, perhaps attempting to fix Celery wasn’t an entirely bad idea? So I took some time to dig through Celery’s code, and managed to find the parts responsible for building workflows, and executing chains and chords. It was still a little bit difficult for me to understand all the different code paths handling the wide range of use cases, but I realised it would be possible to implement a clean, straightforward orchestration that would handle all the tasks and their combinations in the same way. What’s more, I had a glimpse that it wouldn’t take too much effort to integrate it into our automation (let’s not forget the main goal!).

Unfortunately, introducing new orchestration into the Celery project would probably be quite hard, and would most likely break some backwards compatibility. So I decided to take a different approach – writing an extension or a plugin that wouldn’t require changes in Celery. Something pluggable, and as non-invasive as possible. That’s how Celery Dyrygent emerged…

Celery Dyrygent

https://github.com/ovh/celery-dyrygent

How to represent a workflow

You can think of a workflow as a directed acyclic graph (DAG), where each task is a separate graph node. When it comes to acyclic graphs, it is relatively easy to store and resolve dependencies between nodes, which leads to straightforward orchestration. Celery Dyrygent was implemented based on these features. Each task in the workflow has an unique identifier (Celery already assigns task IDs when a task is pushed for execution) and each one of them is wrapped into a workflow node. Each workflow node consists of a task signature (a plain Celery signature) and a list of IDs for the tasks it depends on. See the example below:

How to process a workflow

So we know how to store a workflow in a clean and easy way. Now we just need to execute it. How about using… Celery? Why not? For this, Celery Dyrygent introduces a workflow processor task (an ordinary Celery task). This task wraps a whole workflow and schedules an execution of primitive tasks, according to their dependencies. Once the scheduling part is over, the task repeats itself (it ‘ticks’ with some delay).

Throughout the whole processing cycle, workflow processor retains the state of the entire workflow internally. As a result, it updates the state with each repetition. You can see an orchestration example below:

Most notably, workflow processor stops its execution in two cases:

Once the whole workflow finishes, with all tasks successfully completed
When it can’t proceed any further, due to a failed task

How to integrate

So how do we use this? Fortunately, I was able to find a way to use Celery Dyrygent quite easily. First of all, you need to inject the workflow processor task definition into your Celery applicationP:

from celery_dyrygent.tasks import register_workflow_processor
app = Celery() #  your celery application instance
workflow_processor = register_workflow_processor(app)

Next, you need to convert your Celery defined workflow into a Celery Dyrygent workflow:

from celery_dyrygent.workflows import Workflow

celery_workflow = chain([
    noop.s(0),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    noop.s()
])

workflow = Workflow()
workflow.add_celery_canvas(celery_workflow)

Finally, simply execute the workflow, just as you would an ordinary Celery task:

workflow.apply_async()

That’s it! You can always go back if you wish, as the small changes are very easy to undo.

Give it a try!

Celery Dyrygent is free to use, and its source code is available on Github (https://github.com/ovh/celery-dyrygent). Feel free to use it, improve it, request features, and report any bugs! It has a few additional features not described here, so I’d encourage you to take a look at the project’s readme file. For our automation requirements, it’s already a solid, battle-tested solution. We’ve been using it since the end of 2018, and it has processed thousands of workflows, consisting of hundreds of thousands of tasks. Here are some productions stats, from June 2019 to February 2020:

936,248 elementary tasks executed
11,170 workflows processed
4,098 tasks in the biggest workflow so far
~84 tasks per workflow, on average

Automation is always a good idea!

Introducing Director – a tool to build your Celery workflows

Nicolas Crocfer — Wed, 26 Feb 2020 12:38:57 +0000

As developers, we often need to execute tasks in the background. Fortunately, some tools already exist for this. In the Python ecosystem, for instance, the most well-known library is Celery. If you have already used it, you know how great it is! But you will also have probably discovered how complicated it can be to follow the state of a complex workflow.

Celery Director is a tool we created at OVHcloud to fix this problem. The code is now open-sourced and is available on Github.

Following the talk we did during FOSDEM 2020, this post aims to present the tool. We’ll take a close look at what Celery is, why we created Director, and how to use it.

What is Celery?

Here is the official description of Celery:

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

The important words here are “task queue”. This is a mechanism used to distribute work across a pool of machines or threads.

The queue, in the middle of the above diagram, stores messages sent by the producers (APIs, for instance). On the other side, consumers are constantly reading the queue to display new messages and execute tasks.

In Celery, a message sent by the producer is the signature of a Python function: send_email("john.doe"), for example.

The queue (named broker in Celery) stores this signature until a worker reads it and really executes the function within the given parameter.

But why execute a Python function somewhere else? The main reason is to quickly return a response in cases of long-running functions. Indeed, it’s not an option to keep users waiting for a response for several seconds or minutes.

Just as we can imagine producers without enough resources, with a CPU-bound task, a more robust worker could handle its execution.

How to use Celery

So Celery is a library used to execute a Python code somewhere else, but how does it do that? In fact, it’s really simple! To illustrate this, we’ll use some of the available methods to send tasks to the broker, then we’ll start a worker to consume them.

Here is the code to create a Celery task:

# tasks.py
from celery import Celery

app = Celery("tasks", broker="redis://127.0.0.1:6379/0")

@app.task
def add(x, y):
    return x + y

As you can see, a Celery task is just a Python function transformed to be sent in a broker. Note that we passed the redis connection to the Celery application (named app) to inform the broker where to store the messages.

This means it’s now possible to send a task in the broker:

>>> from tasks import add
>>> add.delay(2, 3)

That’s all! We used the .delay() method, so our producer didn’t execute the Python code but instead sent the task signature to the broker.

Now it’s time to consume it with a Celery worker:

$ celery worker -A tasks --loglevel=INFO
[...]
[2020-02-14 17:13:38,947: INFO/MainProcess] Received task: tasks.add[0e9b6ff2-7aec-46c3-b810-b62a32188000]
[2020-02-14 17:13:38,954: INFO/ForkPoolWorker-2] Task tasks.add[0e9b6ff2-7aec-46c3-b810-b62a32188000] succeeded in 0.0024250600254163146s: 5

It’s even possible to combine the Celery tasks with some primitives (the full list is here):

Chain: will execute tasks one after the other.
Group: will execute tasks in parallel by routing them to multiple workers.

For example, the following code will make two additions in parallel, then sum the results:

from celery import chain, group

# Create the canvas
canvas = chain(
    group(
        add.si(1, 2),
        add.si(3, 4)
    ),
    sum_numbers.s()
)

# Execute it
canvas.delay()

You probably noted we didn’t use the .delay() method here. Instead we created a canvas, used to combine a selection of tasks.

The .si() method is used to create an immutable signature (i.e. one that does not receive data from a previous task), while .s() relies on the data returned by the two previous tasks.

This introduction to Celery has just covered its very basic usage. If you’re keen to find out more, I invite you to read the documentation, where you’ll discover all the powerful features, including rate limits, tasks retrying, or even periodic tasks.

As a developer, I want…

I’m part of a team whose goal is to deploy and monitor internal infrastructures. As part of this, we needed to launch some background tasks, and as Python developers our natural choice was to use Celery. But, out of the box, Celery didn’t supported certain specific requirements for our projects:

Tracking the tasks’ evolution and their dependencies in a WebUI.
Executing the workflows using API calls, or simply with a CLI.
Combining tasks to create workflows in YAML format.
Periodically executing a whole workflow.

Some other cool tools exist for this, like Flower, but this only allows us to track each task individually, not a whole workflow and its component tasks.

And as we really needed these features, we decided to create Celery Director.

How to use Director

The installation can be done using the pipcommand:

$ pip install celery-director

Director provides a simple command to create a new workspace folder:

$ director init workflows
[*] Project created in /home/ncrocfer/workflows
[*] Do not forget to initialize the database
You can now export the DIRECTOR_HOME environment variable

A new tasks folder and a workflow example has been created for you below:

$ tree -a workflows/
├── .env
├── tasks
│   └── etl.py
└── workflows.yml

The tasks/*.py files will contain your Celery tasks, while the workflows.yml file will combine them:

$ cat workflows.yml
---
ovh.SIMPLE_ETL:
  tasks:
    - EXTRACT
    - TRANSFORM
    - LOAD

This example, named ovh.SIMPLE_ETL, will execute three tasks, one after the other. You can find more examples in the documentation.

After exporting the DIRECTOR_HOME variable and initialising the database with director db upgrade, you can execute this workflow :

$ director workflow list
+----------------+----------+-----------+
| Workflows (1)  | Periodic | Tasks     |
+----------------+----------+-----------+
| ovh.SIMPLE_ETL |    --    | EXTRACT   |
|                |          | TRANSFORM |
|                |          | LOAD      |
+----------------+----------+-----------+
$ director workflow run ovh.SIMPLE_ETL

The broker has received the tasks, so now you can launch the Celery worker to execute them:

$ director celery worker --loglevel=INFO

And then display the results using the webserver command (director webserver):

This is just the beginning, as Director provides other features, allowing you to parametrise a workflow or periodically execute it, for example. You will find more details on these features in the documentation.

Conclusion

Our teams use Director regularly to launch our workflows. No more boilerplating, and no more need for advanced Celery knowledge… A new colleague can easily create its tasks in Python and combine them in YAML, without using the Celery primitives discussed earlier.

Sometimes we need to execute a workflow periodically (to populate a cache, for instance), and sometimes we need to manually call it from another web service (note that a workflow can also be executed through an API call). This is now possible using our single Director instance.

We invite you to try Director for yourself, and give us your feedback via Github, so we can continue to enhance it. The source code can be found in Github, and the 2020 FOSDEM presentation is available here.

Simplify your research experiments with Kubernetes

Laurent Parmentier — Fri, 06 Sep 2019 13:40:08 +0000

Abstract

As a researcher I need to conduct experiments to validate my hypotheses. When the field of Computer Science is involved, it is well known that practitioners tend to drive experiments on different environments (at the hardware level: x86/arm/…, CPU frequency, available memory, or at the software level: operating system, versions of libraries). The problem with these different environments is the difficulty of accurately reproducing an experiment as it has been presented in a research article.

In this post we provide a way of conducting experiments that can be reproduced by using Kubernetes-as-a-service, a managed platform to perform distributed computations along with other tools (Argo, MinIO) that take the advantages of the platform into consideration.

The article is organised as follow, we first recall the context and the problem faced by a researcher who needs to conduct experiments. Then we explain how to solve the problem with Kubernetes and why we did not choose other solutions (e.g., HPC software). Finally, we give some tips on improving setup.

Introduction

When I started my PhD, I read a bunch of articles related to the field I’m working on, i.e. AutoML. From this research, I realised how important it is to conduct experiments well in order to make them credible and verifiable. I started asking my colleagues how they carried out their experiments, and there was a common pattern: develop your solution, look at other(s) solution(s) that are related to the same problem, run each solution 30 times if it is stochastic with equivalent resources and compare your results to the other(s) solution(s) with statistical tests: Wilcoxon-Mann-Whitney when comparing two algorithms, or else Friedman test. As it is not the main topic of this article, I will not discuss statistical tests in detail.

As an experienced DevOps, I had one question about automation: How do I find out how to reproduce an experiment, especially of another solution? Guess the answer? Meticulously read a paper, or find a repository with all the information.

Either you are lucky and a source code is available, or else a pseudo-code is provided in the publication. In this case you need to re-implement the solution to be able to test it and compare it. Even if you are lucky and there is a source code available, often the whole environment is missing (e.g., exact version of the packages, python version itself, JDK version, etc…). Not having the right information impacts performance and may potentially bias experiments. For example, new versions of packages, languages, and so on, usually have better optimisations that your implementation can use. Sometimes it is hard to find the versions that have been used by practitioners.

The other problem, is the complexity of setting up a cluster with HPC software (e.g., Slurm, Torque). Indeed, it requires technical knowledge to manage such a solution: configuration of the network, verifying that each node has the dependencies required by the runs installed, checking that nodes have the same versions of libraries, etc… These technical steps consume time for researchers, thus take them away from their main job. Moreover, to extract the results, researchers usually do it manually, they retrieve the different files (through FTP or NFS), and then perform statistical tests that they save by hand. Consequently, the workflow to perform an experiment is relatively costly and precarious.

In my point of view, it raise one big problem: that an experiment can not really be reproduced in the field of Computer Science.

Solution

OVH offers Kubernetes-as-a-service, a managed cluster platform where you do not have to worry about how to configure the cluster (add node, configure network, and so on), so I started to investigate how I could perform experiments similarly to the HPC solutions. Argo Workflows, came out of the box. This tool allows you to define a workflow of steps that you can perform on your Kubernetes cluster within each step is confined in a container, loosely called image. A container allows you to run a program under a specific environment software (language version, libraries, third-parties), additionally to limiting the resources (CPU time, memory) used by the program.

The solution is linked to our big problem: make sure you can reproduce an experiment that is equivalent to run a workflow composed of steps under a specific environment.

Use case: Evaluate an AutoML solution

The use case that we use in our research will be related to measuring the convergence of a Bayesian Optimization (SMAC) on the problem of the AutoML

For this use case, we stated the Argo workflow in the following yaml file

Set up the infrastructure

First we will to setup a Kubernetes cluster, secondly we will install the services on our cluster and lastly we will run an experiment.

Kubernetes cluster

Installing a Kubernetes cluster with OVH is child’s play. Connect to the OVH Control Panel, go to Public Cloud > Managed Kubernetes Service, then Create a Kubernetes cluster and follow the steps depending on your needs.

Once the cluster is created:

Take into consideration the change upgrade policy. If you are a researcher, and your experiment takes some time to run, you want to avoid an update that would shutdown your infrastructure with your runs. To avoid this situation, it is better to choose “Minimum unavailability” or “Do not update”.
Download the kubeconfig file, it will serve later with kubectl to connect on our cluster.
Add at least one node on your cluster.

Once installed, you will need kubectl, a tool that allows you to manage your cluster.

If everything has been properly set up, you should get something like this:

kubectl top nodes
NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node01   64m          3%     594Mi           11%

Installation of Argo

As we mentioned before, Argo allows us to run a workflow composed of steps. To install the client and the service on the cluster, we were inspired by this tutorial.

First we download and install Argo (client):

curl -sSL -o /usr/local/bin/argo https://github.com/argoproj/argo/releases/download/v2.3.0/argo-linux-amd64
chmod +x /usr/local/bin/argo

Then the controller and UI on our cluster:

kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.3.0/manifests/install.yaml

Configure the service account:

kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=default:default

Then, with the client try a simple hello-world workflow to confirm the stack is working (Status: Succeeded):

argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
Name:                hello-world-2lx9d
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Tue Aug 13 16:51:32 +0200 (24 seconds ago)
Started:             Tue Aug 13 16:51:32 +0200 (24 seconds ago)
Finished:            Tue Aug 13 16:51:56 +0200 (now)
Duration:            24 seconds

STEP                  PODNAME            DURATION  MESSAGE
 ✔ hello-world-2lx9d  hello-world-2lx9d  23s

You can also access the UI dashboard through http://localhost:8001:

kubectl port-forward -n argo service/argo-ui 8001:80

Configure an Artifact repository (MinIO)

Artifact is a term used by Argo, it represents an archive containing files returned by a step. In our case we will use this feature to return final results, and to share intermediate results between steps.

In order to get Artifact working, we need an object storage. If you already have one you can pass the installation part but still need to configure it.

As specified in the tutorial, we used MinIO, here is the manifest to install it (minio-argo-artifact.install.yml):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  # This name uniquely identifies the PVC. Will be used in deployment below.
  name: minio-pv-claim
  labels:
    app: minio-storage-claim
spec:
  # Read more about access modes here: https://kubernetes.io/docs/user-guide/persistent-volumes/#access-modes
  accessModes:
    - ReadWriteOnce
  resources:
    # This is the request for storage. Should be available in the cluster.
    requests:
      storage: 10
  # Uncomment and add storageClass specific to your requirements below. Read more https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
  #storageClassName:
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  # This name uniquely identifies the Deployment
  name: minio-deployment
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        # Label is used as selector in the service.
        app: minio
    spec:
      # Refer to the PVC created earlier
      volumes:
      - name: storage
        persistentVolumeClaim:
          # Name of the PVC created earlier
          claimName: minio-pv-claim
      containers:
      - name: minio
        # Pulls the default MinIO image from Docker Hub
        image: minio/minio
        args:
        - server
        - /storage
        env:
        # MinIO access key and secret key
        - name: MINIO_ACCESS_KEY
          value: "TemporaryAccessKey"
        - name: MINIO_SECRET_KEY
          value: "TemporarySecretKey"
        ports:
        - containerPort: 9000
        # Mount the volume into the pod
        volumeMounts:
        - name: storage # must match the volume name, above
          mountPath: "/storage"
---
apiVersion: v1
kind: Service
metadata:
  name: minio-service
spec:
  ports:
    - port: 9000
      targetPort: 9000
      protocol: TCP
  selector:
    app: minio

Note: Please edit the following key/values:

spec > resources > requests > storage > 10 correspond to 10 GB storage requested by MinIO to the cluster
TemporaryAccessKey
TemporarySecretKey

kubectl create ns minio
kubectl apply -n minio -f minio-argo-artifact.install.yml

Note: alternatively, you can install MinIO with Helm.

Now we need to configure Argo in order to use our object storage MinIO:

kubectl edit cm -n argo workflow-controller-configmap
...
data:
  config: |
    artifactRepository:
      s3:
        bucket: my-bucket
        endpoint: minio-service.minio:9000
        insecure: true
        # accessKeySecret and secretKeySecret are secret selectors.
        # It references the k8s secret named 'argo-artifacts'
        # which was created during the minio helm install. The keys,
        # 'accesskey' and 'secretkey', inside that secret are where the
        # actual minio credentials are stored.
        accessKeySecret:
          name: argo-artifacts
          key: accesskey
        secretKeySecret:
          name: argo-artifacts
          key: secretkey

Add credentials:

kubectl create secret generic argo-artifacts --from-literal=accesskey="TemporaryAccessKey" --from-literal=secretkey="TemporarySecretKey"

Note: Use the correct credentials you specified above

Create the bucket my-bucket with the rights Read and write by connecting to the interface http://localhost:9000:

kubectl port-forward -n minio service/minio-service 9000

Check that Argo is able to use Artifact with the object storage:

argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/artifact-passing.yaml
Name:                artifact-passing-qzgxj
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Wed Aug 14 15:36:03 +0200 (13 seconds ago)
Started:             Wed Aug 14 15:36:03 +0200 (13 seconds ago)
Finished:            Wed Aug 14 15:36:16 +0200 (now)
Duration:            13 seconds

STEP                       PODNAME                            DURATION  MESSAGE
 ✔ artifact-passing-qzgxj
 ├---✔ generate-artifact   artifact-passing-qzgxj-4183565942  5s
 └---✔ consume-artifact    artifact-passing-qzgxj-3706021078  7s

Note: In case you are stuck with a message ContainerCreating, there is a lot of chance that Argo is not able to access MinIO, e.g., bad credentials.

Install a private registry

Now that we have a way to run a workflow, we want each step to represent a specific software environment (i.e., an image). We defined this environment in a Dockerfile.

Because each step can run on different nodes in our cluster, the image needs to be stored somewhere, in the case of Docker we require a private registry.

You can get a private registry in different ways:

Docker Hub
Gitlab.com
OVH – tutorial
Harbor: allows you to have your own registry on your Kubernetes cluster

In our case we used OVH private registry.

# First we clone the repository
git clone git@gitlab.com:automl/automl-smac-vanilla.git
cd automl-smac-vanilla

# We build the image locally
docker build -t asv-environment:latest .

# We push the image to our private registry
docker login REGISTRY_SERVER -u REGISTRY_USERNAME
docker tag asv-environment:latest REGISTRY_IMAGE_PATH:latest
docker push REGISTRY_IMAGE_PATH:latest

Allow our cluster to pull images from the registry:

kubectl create secret docker-registry docker-credentials --docker-server=REGISTRY_SERVER --docker-username=REGISTRY_USERNAME --docker-password=REGISTRY_PWD

Try our experiment on the infrastructure

git clone git@gitlab.com:automl/automl-smac-vanilla.git
cd automl-smac-vanilla

argo submit --watch misc/workflow-argo -p image=REGISTRY_IMAGE_PATH:latest -p git_ref=master -p dataset=iris
Name:                automl-benchmark-xlbbg
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Tue Aug 20 12:25:40 +0000 (13 minutes ago)
Started:             Tue Aug 20 12:25:40 +0000 (13 minutes ago)
Finished:            Tue Aug 20 12:39:29 +0000 (now)
Duration:            13 minutes 49 seconds
Parameters:
  image:             m1uuklj3.gra5.container-registry.ovh.net/automl/asv-environment:latest
  dataset:           iris
  git_ref:           master
  cutoff_time:       300
  number_of_evaluations: 100
  train_size_ratio:  0.75
  number_of_candidates_per_group: 10

STEP                       PODNAME                            DURATION  MESSAGE
 ✔ automl-benchmark-xlbbg
 ├---✔ pre-run             automl-benchmark-xlbbg-692822110   2m
 ├-·-✔ run(0:42)           automl-benchmark-xlbbg-1485809288  11m
 | └-✔ run(1:24)           automl-benchmark-xlbbg-2740257143  9m
 ├---✔ merge               automl-benchmark-xlbbg-232293281   9s
 └---✔ plot                automl-benchmark-xlbbg-1373531915  10s

Note:

Here we only have 2 parallel runs, you can have much more by adding them to the list withItems. In our case, the list correspond to the seeds.
run(1:24) correspond to the run 1 with the seed 24
We limit the resources per run by using requests and limits, see also Managing Compute Resources.

Then we just retrieve the results through the MinIO web user interface http://localhost:9000 (you can also do that with the client).

The results are located in a directory with the same name as the argo workflow name, in our example it is my-bucket > automl-benchmark-xlbbg.

Limitation to our solution

The solution is not able to run the parallel steps on multiple nodes. This limitation is due to the way we are merging our results from the parallel steps to the merge step. We are using volumeClaimTemplates, i.e., we are mounting a volume, and this can’t be done between different nodes. The problem can be solved by two manners:

Using parallel artifacts and aggregate them, however it is an ongoing issue with Argo
Directly implement in the code of your run, a way to store the result on an accessible storage (MinIO SDK for example)

The first manner is preferred, it means you don’t have to change and customize the code for a specific storage file system.

Hints to improve the solution

In case you are interested in going further with your setup, you should take a look on the following topics:

Controlling the access: in order to confine the users in different spaces (for security reasons, or to control the resources).
Exploring Argo selector and Kubernetes selector: in case you have a cluster composed of nodes that have different hardware and that you require an experiment using a specific hardware (e.g., specific cpu, gpu).
Configure a distributed MinIO: it ensures that your data are replicated on multiple nodes and stay available in case of a node fail.
Monitoring your cluster.

Conclusion

Without needing in-depth technical knowledge, we have shown that we can easily set up a complex cluster to perform research experiments and make sure they can be reproduced.

OVH Private Cloud and HashiCorp Terraform – Part 1

Erwan Quelin — Fri, 03 May 2019 08:59:46 +0000

When discussing the concepts of DevOps and Infrastructure-as-a-Code, the tools developed by HashiCorp quickly come up. With Terraform, HashiCorp offers a simple way to automate infrastructure provisioning in both public clouds and on-premises. Terraform has a long history of deploying and managing OVH’s Public Cloud resources. For example, you can find a complete guide on GitHub. In this article, we will focus on using Terraform to interact with another OVH solution: Private Cloud.

Private Cloud enables customers to benefit from a VMware vSphere infrastructure, hosted and managed by OVH. Terraform lets you automate the creation of resources and their life cycle. In this first article, we will explore the basic notions of Terraform. After reading it, you should be able to write a Terraform configuration file to deploy and customise a virtual machine from a template. In a second article, we will build on this example, and modify it so that it is more generic and can be easily adapted to your needs.

Installation

Terraform is available on the HashiCorp website for almost all OSs as a simple binary. Just download it and copy it into a directory in your operating system PATH. To test that everything is working properly, run the terraform command.

$ terraform
Usage: terraform [-version] [-help]  [args]

The available commands for execution are listed below.
The most common, useful commands are shown first, followed by
less common or more advanced commands. If you're just getting
started with Terraform, stick with the common commands. For the
other commands, please read the help and docs before usage.

Common commands:
    apply              Builds or changes infrastructure
    console            Interactive console for Terraform interpolations
    destroy            Destroy Terraform-managed infrastructure

Folders and files

Like other Infrastructure-as- a-Code tools, Terraform uses simple files to define the target configuration. To begin, we will create a directory and place a file named main.tf. By default, Terraform will read all the files in the working directory with the .tf extension, but to simplify things, we will start with a single file. We will see in a future article how to organise the data into several files.

Similarly, to make it easier to understand Terraform operations, we will specify all the necessary information directly in the files. This includes usernames, passwords and names of different resources (vCenter, cluster, etc.). It is obviously not advisable to do this in order to use Terraform in production. The second article will also be an opportunity to improve this part of the code. But for now, let’s keep it simple!

Providers

The providers let you specify how Terraform will communicate with the outside world. In our example, the vSphere provider will be in charge of connecting with your Private Cloud’s vCenter. We declare a provider as follows:

provider "vsphere" {
    user = "admin"
    password = "MyAwesomePassword"
    vsphere_server = "pcc-XXX-XXX-XXX-XXX.ovh.com"
}

We see here that Terraform uses its own way of structuring data (it is also possible to write everything in JSON to facilitate the automatic generation of files! ). Data is grouped in blocks (here a block named vsphere, which is of the provider type) and the data relating to the block are in the form of keys/values.

Data

Now that Terraform is able to connect to the vCenter, we need to retrieve information about the vSphere infrastructure. Since we want to deploy a virtual machine, we need to know the datacentre, cluster, template, etc., and where we are going to create it. To do this, we will use data-type blocks:

data "vsphere_datacenter" "dc" {
  name = "pcc-XXX-XXX-XXX-XXX_datacenter3113"
}

data "vsphere_datastore" "datastore" {
  name          = "pcc-001234"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_virtual_machine" "template" {
  name          = "UBUNTU"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

In the above example, we are trying to get information about the datacentre named pcc-XXX-XXX-XXX-XXX_datacenter3113 and get the information from the datastore named pcc-001234 and a template whose name is UBUNTU. We see here that we use the datacentre id to get information about an object associated with it.

Resources

The resources will be used to create and/or manage elements of the infrastructure. In our example, we will use a resource of type virtual_machine, which as its name suggests, will help us to create a VM.

resource "vsphere_virtual_machine" "vm" {
  name             = "vm01"
  resource_pool_id = "${data.vsphere_compute_cluster.cluster.resource_pool_id}"
  datastore_id     = "${data.vsphere_datastore.datastore.id}"
  guest_id         = "${data.vsphere_virtual_machine.template.guest_id}"
  scsi_type        = "${data.vsphere_virtual_machine.template.scsi_type}"

  network_interface {
    network_id = "${data.vsphere_network.network.id}"
  }

  disk {
    label = "disk0"
    size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
  }

  clone {
    template_uuid = "${data.vsphere_virtual_machine.template.id}"

    customize {
      linux_options {
        host_name = "vm01"
        domain     = "example.com"
      }

      network_interface {
        ipv4_address = "192.168.1.2"
        ipv4_netmask = 24
      }

      ipv4_gateway    = "192.168.1.254"
      dns_suffix_list = ["example.com"]
      dns_server_list = ["192.168.1.1"]
    }
  }
}

The structure of this resource is a little more complex, because it is composed of several sub-blocks. We see that we will first define the name of the virtual machine. We then provide information about its configuration (Resource pool, datastore, etc.). The network_interface and disk blocks are used to specify the configuration of its virtual devices. The clone sub-block will let you specify which template you wish to use to create the VM, and also to specify the configuration information of the operating system installed on the VM. The customize sub-block is specific to the type of OS you want to clone. At all levels, we use information previously obtained in the data blocks.

Full example

provider "vsphere" {
    user = "admin"
    password = "MyAwesomePassword"
    vsphere_server = "pcc-XXX-XXX-XXX-XXX.ovh.com"
}

data "vsphere_datacenter" "dc" {
  name = "pcc-XXX-XXX-XXX-XXX_datacenter3113"
}

data "vsphere_datastore" "datastore" {
  name          = "pcc-001234"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_compute_cluster" "cluster" {
  name          = "Cluster1"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_network" "network" {
  name          = "vxw-dvs-57-virtualwire-2-sid-5001-Dc3113_5001"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_virtual_machine" "template" {
  name          = "UBUNTU"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

resource "vsphere_virtual_machine" "vm" {
  name             = "vm01"
  resource_pool_id = "${data.vsphere_compute_cluster.cluster.resource_pool_id}"
  datastore_id     = "${data.vsphere_datastore.datastore.id}"
  guest_id         = "${data.vsphere_virtual_machine.template.guest_id}"
  scsi_type        = "${data.vsphere_virtual_machine.template.scsi_type}"

  network_interface {
    network_id = "${data.vsphere_network.network.id}"
  }

  disk {
    label = "disk0"
    size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
  }

  clone {
    template_uuid = "${data.vsphere_virtual_machine.template.id}"

    customize {
      linux_options {
        host_name = "vm01"
        domain     = "example.com"
      }

      network_interface {
        ipv4_address = "192.168.1.2"
        ipv4_netmask = 24
      }

      ipv4_gateway    = "192.168.1.254"
      dns_suffix_list = ["example.com"]
      dns_server_list = ["192.168.1.1"]
    }
  }
}

3… 2… 1… Ignition

Let’s look at how to use our new config file with Terraform…

Initialisation

Now that our configuration file is ready, we will be able to use it to create our virtual machine. Let’s start by initialising the working environment with the terraform init command. This will take care of downloading the vSphere provider and create the different files that Terraform needs to work.

$ terraform init

Initializing provider plugins...
- Checking for available provider plugins on https://releases.hashicorp.com...
- Downloading plugin for provider "vsphere" (1.10.0)...

The following providers do not have any version constraints in configuration,
so the latest version was installed.

...

* provider.vsphere: version = "~> 1.10"

Terraform has been successfully initialized!
...

Plan

The next step is to execute the terraform plan command to validate that our configuration file contains no errors and to visualise all the actions that Terraform will perform.

$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.vsphere_datacenter.dc: Refreshing state...
data.vsphere_compute_cluster.cluster: Refreshing state...
data.vsphere_network.network: Refreshing state...
data.vsphere_datastore.datastore: Refreshing state...
data.vsphere_virtual_machine.template: Refreshing state...

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  + vsphere_virtual_machine.vm
      id:                                                   
      boot_retry_delay:                                     "10000"
      change_version:                                       
      clone.#:                                              "1"
      clone.0.customize.#:                                  "1"
      clone.0.customize.0.dns_server_list.#:                "1"
      clone.0.customize.0.dns_server_list.0:                "192.168.1.1"
      clone.0.customize.0.dns_suffix_list.#:                "1"
      clone.0.customize.0.dns_suffix_list.0:                "example.com"
      clone.0.customize.0.ipv4_gateway:                     "172.16.0.1"
      clone.0.customize.0.linux_options.#:                  "1"
      clone.0.customize.0.linux_options.0.domain:           "example.com"
      clone.0.customize.0.linux_options.0.host_name:        "vm01"
      clone.0.customize.0.linux_options.0.hw_clock_utc:     "true"
      clone.0.customize.0.network_interface.#:              "1"
      clone.0.customize.0.network_interface.0.ipv4_address: "192.168.1.2"
      clone.0.customize.0.network_interface.0.ipv4_netmask: "16"
      clone.0.customize.0.timeout:                          "10"
      clone.0.template_uuid:                                "42061bc5-fdec-03f3-67fd-b709ec06c7f2"
      clone.0.timeout:                                      "30"
      cpu_limit:                                            "-1"
      cpu_share_count:                                      
      cpu_share_level:                                      "normal"
      datastore_id:                                         "datastore-93"
      default_ip_address:                                   
      disk.#:                                               "1"
      disk.0.attach:                                        "false"
      disk.0.datastore_id:                                  ""
      disk.0.device_address:                                
      ...

Plan: 1 to add, 0 to change, 0 to destroy.

It is important to take time to check all information returned by the plan command before proceeding. It would be a mess to delete virtual machines in production due to an error in the configuration file… In the example below, we see that Terraform will create a new resource (here a VM) and not modify or delete anything, which is exactly the goal!

Apply

In the last step, the terraform apply command will actually configure the infrastructure according to the information present in the configuration file. As a first step, the plan command will be executed, and Terraform will ask you to validate by typing yes.

$ terraform apply
...

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

vsphere_virtual_machine.vm: Creating...
  boot_retry_delay:                                     "" => "10000"
  change_version:                                       "" => ""
  clone.#:                                              "" => "1"
  clone.0.customize.#:                                  "" => "1"
  clone.0.customize.0.dns_server_list.#:                "" => "1"
  clone.0.customize.0.dns_server_list.0:                "" => "192.168.1.1"
  clone.0.customize.0.dns_suffix_list.#:                "" => "1"
  clone.0.customize.0.dns_suffix_list.0:                "" => "example.com"
  clone.0.customize.0.ipv4_gateway:                     "" => "192.168.1.254"
  clone.0.customize.0.linux_options.#:                  "" => "1"
  clone.0.customize.0.linux_options.0.domain:           "" => "example.com"
  clone.0.customize.0.linux_options.0.host_name:        "" => "terraform-test"
  clone.0.customize.0.linux_options.0.hw_clock_utc:     "" => "true"
  clone.0.customize.0.network_interface.#:              "" => "1"
  clone.0.customize.0.network_interface.0.ipv4_address: "" => "192.168.1.2"
  clone.0.customize.0.network_interface.0.ipv4_netmask: "" => "16"
  clone.0.customize.0.timeout:                          "" => "10"
  clone.0.template_uuid:                                "" => "42061bc5-fdec-03f3-67fd-b709ec06c7f2"
  clone.0.timeout:                                      "" => "30"
  cpu_limit:                                            "" => "-1"
  cpu_share_count:                                      "" => ""
  cpu_share_level:                                      "" => "normal"
  datastore_id:                                         "" => "datastore-93"
  default_ip_address:                                   "" => ""
  disk.#:                                               "" => "1"
...
vsphere_virtual_machine.vm: Still creating... (10s elapsed)
vsphere_virtual_machine.vm: Still creating... (20s elapsed)
vsphere_virtual_machine.vm: Still creating... (30s elapsed)
...
vsphere_virtual_machine.vm: Still creating... (1m50s elapsed)
vsphere_virtual_machine.vm: Creation complete after 1m55s (ID: 42068313-d169-03ff-9c55-a23e66a44b48)

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

When you connect to the vCenter of your Private Cloud, you should see a new virtual machine in the inventory!

Next steps

Now that we have seen a standard Terraform workflow, you may want to test some modifications to your configuration file. For example, you can add another virtual disk to your VM by modifying the virtual_machine resource’s block like this:

disk {
  label = "disk0"
  size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
}

disk {
  label = "disk1"
  size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
  unit_number = 1
}

Then run terraform plan to see what Terraform is going to do to in order to reconcile the infrastructure state with your configuration file.

$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.vsphere_datacenter.dc: Refreshing state...
data.vsphere_datastore.datastore: Refreshing state...
data.vsphere_network.network: Refreshing state...
data.vsphere_compute_cluster.cluster: Refreshing state...
data.vsphere_virtual_machine.template: Refreshing state...
vsphere_virtual_machine.vm: Refreshing state... (ID: 4206be6f-f462-c424-d386-7bd0a0d2cfae)

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ vsphere_virtual_machine.vm
      disk.#:                  "1" => "2"
      disk.1.attach:           "" => "false"
      disk.1.datastore_id:     "" => ""
      ...


Plan: 0 to add, 1 to change, 0 to destroy.

If you agree with terraform action’s proposal, you can rerun terraform apply, to add a new virtual disk to your virtual machine.

Clean it up

When you have finished your tests and you no longer require the utility of the infrastructure, you can simply run the terraform destroy command to delete all previously-created resources. Be careful with this command, as there is no way to get your data back after that!

$ terraform destroy

data.vsphere_datacenter.dc: Refreshing state...
data.vsphere_compute_cluster.cluster: Refreshing state...
data.vsphere_datastore.datastore: Refreshing state...
data.vsphere_network.network: Refreshing state...
data.vsphere_virtual_machine.template: Refreshing state...
vsphere_virtual_machine.vm: Refreshing state... (ID: 42068313-d169-03ff-9c55-a23e66a44b48)

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  - vsphere_virtual_machine.vm


Plan: 0 to add, 0 to change, 1 to destroy.

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

vsphere_virtual_machine.vm: Destroying... (ID: 42068313-d169-03ff-9c55-a23e66a44b48)
vsphere_virtual_machine.vm: Destruction complete after 3s

Destroy complete! Resources: 1 destroyed.

In this article, we have seen how to deploy a virtual machine with a Terraform configuration file. This allowed us to learn the basic commands plan, apply and destroy, as well as the notions of provider, data and resource. In the next article, we will develop this example, by modifying it to make it more adaptable and generic.

Dedicated Servers: twice the bandwidth for the same price

Yaniv Fdida — Wed, 27 Mar 2019 10:02:27 +0000

We announced it at the OVH Summit 2018… We were going to double the public bandwidth on OVH dedicated servers, without changing the price.

A promise is a promise, so several weeks ago we fulfilled it: your servers now have twice the bandwidth, for the same price!

We knew from the start that this upgrade would be feasible, as our 20Tbps network core can definitely cope with the extra load! We work daily to make sure you enjoy using our network, which is one of the largest in the world among hosting providers.

Indeed, our network is constantly evolving, and our teams work tirelessly to optimise the capacity planning and anticipate the load generated by all our customers, spread across our 28 datacentres.

It’s also more than capable of managing the waves of DDoS attacks that arrive almost daily, sending millions of requests to hosted servers in an attempt to render them unavailable. These are absorbed by our in-house Anti-DDoS Protection, without any customer impact! As a reminder, we suffered one of the biggest attacks on record a few years ago, which generated traffic of more than 1Tbps, but was nonetheless absorbed by our infrastructure, without any impact on our customers.

To guarantee this additional public bandwidth, our Network and Bare Metal teams have worked closely together to be more and more LEAN when it comes to our infrastructures. As a result, thousands of active devices (routers, switches, servers etc.) have been updated in a completely transparent way!

The overall deployment process has taken some time, as we have done a rolling upgrade, taking a QoS and isolation approach to prevent possible traffic spikes. Product range by product range, datacentre by datacentre… The deployment itself was quick and painless, as it was fully automated. The potential bottleneck was making sure that everything worked as intended, which involved carefully monitoring our full server farm, as bandwidth doubling can have a huge impact, especially at OVH, where (let me mention it once again!) egress traffic is indeed unlimited!

Here’s a quick overview of the new bandwidth for each server range:

Even if the bandwidth doubling doesn’t yet cover the full extent of our ranges, or the So you Start and Kimsufi servers, we haven’t forgotten our customers who’re using those servers. We have also updated our bandwidth options to offer all our customers an even better service, at an even better price.

We aren’t going to stop there though! We will soon announce some nice new features on the network side of things. And of course, lots of other innovations will arrive in the coming months. But those are other stories, which will be told in other blog posts… 😉

Continuous Delivery and Deployment Workflows with CDS

Yvonnick Esnault — Fri, 01 Mar 2019 12:38:13 +0000

The CDS Workflow is a key feature of the OVH CI/CD Platform. This structural choice adds an additional concept above CI/CD pipelines and jobs, and after more than three years of intensive use, is definitely an essential feature.

Before delving into a full explanation of CDS workflows, let’s review some of the key concepts behind pipelines and jobs. Those concepts are drawn from the reference book, 8 Principles of Continuous Delivery

The basic element: “The job”

A job is composed of steps, which will be run sequentially. A job is executed in a dedicated workspace (i.e. filesystem). A new workspace is assigned for each new run of a job.

A standard build job looks like this:

You can use “built-in” actions, such as checkoutApplication, script, jUnit, artifact upload/download.

The checkoutApplication action clones your Git repository
The Script action executes your build command as “make build”
The artifactUpload action uploads previously-built binaries
The jUnit action parses a given Junit-formatted XML file to extract its test results

A pipeline: How to orchestrate your jobs with stages

With CDS, a pipeline is not a job flow. A pipeline is a sequence of stages, each of which contains one or more jobs.

A  Stage  is a set of jobs that will be run in parallel. Stages are executed sequentially, if the previous stage is successful.

Let’s take a real-life use case: the pipeline that built CDS. This pipeline has four stages:

The “Build Minimal” stage is launched for all Git branches. The main goal of this stage is to compile the Linux version of CDS binaries.
The “Build other os/arch” stage is only launched on the master branch. This stage compiles all binaries supported by the os/arch: linux, openbsd, freebsd, darwin, windows – 386, amd64 and arm.
The “Package” stage is launched for all Git branches. This stage prepares the docker image and Debian package.
Finally, the “Publish” stage is launched, whatever the Git branch.

Most tasks are executed in parallel, whenever possible. This results in very fast feedback, so we will quickly know if the compilation is OK or not.

CDS Workflows: How to orchestrate your pipelines

The workflow concept is a key feature, and widely considered a native, manageable and feature-rich entity in CDS. A CDS Workflow allows you to chain pipelines with manual or automatic gates, using conditional branching. A workflow can be stored as code, designed on CDS UI, or both, depending on what best suits you.

Let’s take an example. One workflow for building and deploying three micro-services:

Build each micro-service
Deploy them in preproduction
Run integration tests on preproduction environment
Deploy them in production, then re-run integration tests in production

For the building part, there is only one pipeline to manage, which is used three times in the workflow with a different application/environment context each time. This is called the “pipeline context”.

Any conditional branching against the workflow (e.g.“automatic deployment on the staging environment, only if the current Git branch is master”) can be executed through a “run conditional” set on the pipeline.

Let’s look at a a real use case. This is the workflow that builds, tests and deploys CDS in production at OVH (yes, CDS builds and deploys itself!):

For each Git commit, the workflow is triggered
The UI is packaged, all binaries are prepared, and the docker images are built. The “UT” job launches the unit tests. The job “IT” job installs CDS in an ephemeral environment and launches the integration tests on it. Part 2 is automatically triggered on all Git commits.
Part 3 deploys CDS on our preproduction environment, then launches the integration tests on it. It is started automatically when the current branch is the master branch.
Last but not least, part 4 deploys CDS on our production environment.

If there is a failure on a pipeline, it may look like this:

The same kind of workflow is used for building and deploying the Prescience Project (https://labs.ovh.com/machine-learning-platform):

But of course, you’re not limited to the most complex tasks with CDS Workflows! These two examples demonstrate the fact that workflows allow to build and deploy a coherent set of micro-services. If you have simpler needs, your workflows are, of course, simpler.

Pipeline reusability allows you to easily maintain the technical parts of the build, tests and deployments, even if you have hundreds of applications. If hundreds of applications share the same kind of workflows, you can leverage the maintainability of the workflow templates. We will talk more about this in a future post.

Much more than “Pipeline as Code”… “Workflow as Code”

There is no compromise with CDS. Some users prefer to draw the workflows by the web UI, others prefer to write yaml code. CDS lets you do both!

There are two ways to store workflows: either in the CDS database or on your Git repository with your source code. We call this “Workflow as Code”.

This makes it possible to have a workflow on a given branch, and then develop it on another branch. CDS will instantiate the workflow on the fly, based on the yaml code present on the current branch.

CDS is OVH open-source software, and can be found on https://github.com/ovh/cds, with documentation on https://ovh.github.io/cds.

CDS Introduction: https://www.ovh.com/fr/blog/how-does-ovh-manage-the-ci-cd-at-scale/
DataBuzzWord Podcast (French): https://www.ovh.com/fr/blog/understanding-ci-cd-for-big-data-and-machine-learning/

How does OVH manage the CI/CD at scale?

Yvonnick Esnault — Thu, 14 Feb 2019 15:22:39 +0000

The delivery process is the set of steps – from git commit to production – that takes place to deliver your service to your customers. Drawing on agile values, Continuous Integration and Continuous Delivery (CI/CD ) are practices that aim to automate this process as much as possible.

The Continuous Delivery Team at OVH has one fundamental mission: to help the OVH developers industrialise and automate their delivery processes. The CD team is here to advocate CI/CD best practices and maintain our ecosystem tools, with the maximum focus on as-a-service solutions.

The centre of this ecosystem is a tool called CDS, developed in-house at OVH.
CDS is an open-source software solution that can be found at https://github.com/ovh/cds, with documentation at https://ovh.github.io/cds.

CDS is the third generation of CI/CD tools at OVH, following two previous solutions, that were based on Bash, Jenkins, Gitlab and Bamboo. It is the end-result of 12 years’ experience in the field of CI/CD. Familiar with most of the standard tools of the industry, we found that none completely matched our expectations regarding the four key aspects we identified. That is what CDS tries to solve.

These four aspects are:

Elastic

CDS resources/workers are launched on demand, to guarantee low waiting times for users, with no over-consumption of idle resources.

Extensible

In CDS, any kind of action (Kubernetes and OpenStack deployments, pushing to Kafka, testing for CVEs…) can be captured in high-level plugins, to be used as building blocks by users. These plugins are straightforward to write and use, so it’s easy to meet the most exotic needs in an effective and stress-free way.

Flexible, but easy

CDS can run complex workflows, with all sorts of intermediary steps, including build, test, deploy 1/10/100, manual or automatic gates, rollback, conditional branches… These workflows can be stored as code in the git repository. CDS provides basic workflow templates for the Core team’s most common scenarios, in order to ease the adoption process. This way, building a functional CI/CD chain from nothing can be quick and easy.

Self-service

Finally, a key aspect is the idea of self-service. Once a CDS project is created by users, they are completely autonomous within that space, with the freedom to manage pipelines, delegate access rights etc. All users are free to customise their space as they see fit, and build on what is provided out-of-the-box. Personalising workflow templates, plugins, running build and tests on custom VM flavors or custom hardware… all this can be done without any intervention from the CDS administrators.

CI/CD in 2018 – Five million workers!

About 5.7M workers started and deleted on demand.
3.7M containers
2M Virtual Machines

How is it possible?

One of the initial CDS objectives at OVH was to build and deploy 150 applications as a container in less than seven minutes. This has been a reality since 2015. So what’s the secret? Auto-Scale on Demand!

With this approach, you can have hundreds of worker models that CDS will launch via hatcheries whenever necessary.

CDS Hatchery

A hatchery is like an incubator: it gives birth to the CDS workers and maintains the power of life and death over them.

CDS Hatcheries – Worker @Scale

Each hatchery is dedicated to an orchestrator. Furthermore, one CDS Instance can create workers over many cloud platforms:
– The Kubernetes hatchery starts workers in pods
– The OpenStack hatchery starts virtual machines
– The Swarm hatchery starts docker containers
– The Marathon hatchery starts docker containers
– The VSphere hatchery start virtual machines
– The local hatchery starts process on a host

CDS Hatcheries

What’s next?

This is all just a preview of CDS… we have lots more to tell you about! The CI/CD tool offers a wide range of features that we will explore in depth in our upcoming articles. We promise, before 2019 is done, you will not look at your CI/CD tool the same way again…

Understanding CI/CD for Big Data and Machine Learning

Yvonnick Esnault — Thu, 14 Feb 2019 12:28:36 +0000

This week, the OVH Integration and Continuous Deployment team was invited to the DataBuzzWord podcast.

Together, we explored the topic of continuous deployment in the context of machine learning and big data. We also discussed continuous deployment for environments like Kubernetes, Docker, OpenStack and VMware VSphere.

If you missed it, or would like to review everything that was discussed, you can listen to it again here. We hope to return soon, to continue sharing our passion for testing, integration and continuous deployment.

Although the podcast was recorded in French, starting from tomorrow, we’ll be delving further into the key points of our discussion in a series of articles on this blog.

Find CDS on GitHub:

https://github.com/ovh/cds

…. and follow us on Twitter:

Come chat about these subjects with us on our Gitter channel: https://gitter.im/ovh-cds/

Automation Archives - OVHcloud Blog

Warden: the self-healing framework for local actions

Introduction

Self-healing systems

Enter Warden

Goals

Maximize system availability

Log diagnostic data for later analysis

Minimise human overhead

Make writing self-healing plugins easy

How does it work?

Warden Core

State Collection

Plugin hand off

Scan Phase

Analysis Phase

Heal Phase

Dashboards and Visibility

A Practical Example

In Conclusion

Selfheal at Webhosting – The external part

Introduction

What is the selfheal?

Why do we need it?

Hardware

Software

The selfheal at Webhosting

External selfheal

Context:

How it works

Concrete example on faulty disk replacement

To conclude

Doing BIG automation with Celery

Intro

Automating your work

Celery – Distributed Task Queue

Chain – a set of tasks processed sequentially

Group – a set of tasks processed in parallel

Chord – a group of tasks chained to the following task

Big workflows and Celery

A few chords chained together (57 tasks in total)

More complex graph structure built from chains and groups (23 tasks in total)

Celery – to fix… or to fix?

Celery Dyrygent

How to represent a workflow

How to process a workflow

How to integrate

Give it a try!

Introducing Director – a tool to build your Celery workflows

What is Celery?

How to use Celery

As a developer, I want…

How to use Director

Conclusion

Simplify your research experiments with Kubernetes

Abstract

Introduction

Solution

Use case: Evaluate an AutoML solution

Set up the infrastructure

Kubernetes cluster

Installation of Argo

Configure an Artifact repository (MinIO)

Install a private registry

Try our experiment on the infrastructure

Limitation to our solution

Hints to improve the solution

Conclusion

Related links

OVH Private Cloud and HashiCorp Terraform – Part 1

Installation

Folders and files

Providers

Data

Resources

Full example

3… 2… 1… Ignition

Initialisation

Plan

Apply