<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Automation Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/automation/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/automation/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Tue, 01 Mar 2022 15:49:38 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Automation Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/automation/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Warden: the self-healing framework for local actions</title>
		<link>https://blog.ovhcloud.com/warden-the-self-healing-framework-for-local-actions/</link>
		
		<dc:creator><![CDATA[Alexandre Gauthier]]></dc:creator>
		<pubDate>Wed, 09 Dec 2020 11:23:14 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Self-heal at Webhosting]]></category>
		<category><![CDATA[SRE]]></category>
		<category><![CDATA[Web Hosting]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=19951</guid>

					<description><![CDATA[This article is the follow up to Selfheal at Webhosting &#8211; The External Part published on 2020-07-17.Part two below covers the local self-healing system. Introduction With over 15-000 servers dedicated to providing services for 6 million websites and web applications of all sorts, across multiple data-centers and geographical zones, a certain amount of software failures [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fwarden-the-self-healing-framework-for-local-actions%2F&amp;action_name=Warden%3A%20the%20self-healing%20framework%20for%20local%20actions&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>This article is the follow up to <a href="https://www.ovh.com/blog/selfheal-at-webhosting-the-external-part/" data-wpel-link="exclude">Selfheal at Webhosting &#8211; The External Part</a> published on 2020-07-17.<br>Part two below covers the local self-healing system.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/11/IMG_0392-1024x537.png" alt="Warden: the self-healing framework for local actions" class="wp-image-20157" width="768" height="403" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/11/IMG_0392-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/11/IMG_0392-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/11/IMG_0392-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/11/IMG_0392.png 1200w" sizes="(max-width: 768px) 100vw, 768px" /></figure></div>



<h3 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-Introduction">Introduction</h3>



<p>With over 15-000 servers dedicated to providing services for 6 million websites and web applications of all sorts, across multiple data-centers and geographical zones, a certain amount of software failures are inevitable. They must be handled to ensure the servers are in a functional state to provide continuity-of-service.</p>



<p>The overhead only increases once you account for supporting pieces of the infrastructure that provide the service, or by clients to access and manage their data.</p>



<p>Generally speaking, restarting failed services and reacting to health checks failing with automatic operations can be done swiftly with a simple install of, for example &#8211; <em>Monit</em>, or <em>Systemd</em> Unit Parameters.</p>



<p>Web-hosting infrastructure, however, poses unique challenges that require a holistic response.</p>



<p>It&#8217;s not only large, but it&#8217;s distributed and&nbsp;<a href="https://www.ovh.com/blog/web-hosting-how-to-host-3-million-websites/" data-wpel-link="exclude">highly available</a>. A web host encountering a failure will not degrade the service, as another node in a cluster will immediately take its place to service client requests.</p>



<p>Additionally, providing Shared Hosting as a service means you are mostly running Unknown Workloads. No two websites have the same requirements, performance, or behavior. You can&#8217;t therefore make assumptions about what is normal, and what isn&#8217;t, which in turn makes establishing a baseline for Abnormal Behavior difficult.</p>



<p>In this context, it is generally&nbsp;an inevitable fact of life that sometimes those workloads will misbehave, crash, or put the system into a state it cannot recover from without intervention.</p>



<p>Trying to prevent this is therefore futile. Facilitating recovery within isolated fault domains is a more productive approach and is where self-healing becomes useful. </p>



<h3 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-Self-healingsystems">Self-healing systems</h3>



<p>While the highly available nature of the infrastructure means failure states don&#8217;t necessarily degrade the service &#8211; the cause still needs to be investigated and the system recovered before being returned to the pool of available hosts to serve requests.</p>



<p>Without automated systems in place to achieve this, it can easily turn into a battle of attrition. Systems to diagnose and clear can pile up and eat into actual time spent on improvements and long-term mitigation of failure states.</p>



<p>We therefore employ two self-healing systems at Webhosting to automate the process:</p>



<ul class="wp-block-list"><li><a href="https://www.ovh.com/blog/selfheal-at-webhosting-the-external-part/" data-wpel-link="exclude">Healer: External self-healing</a>, which handles hardware problems, the absence of connectivity, and anything the local systems can&#8217;t resolve locally.</li><li>Warden: A local agent that exposes a framework for self-healing on local nodes.  Warden is the component we will be exploring today.</li></ul>



<h3 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-EnterWarden">Enter Warden</h3>



<div class="wp-block-image"><figure class="alignright size-large is-resized"><img decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/IMG_0393.png" alt="" class="wp-image-20161" width="302" height="236" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0393.png 604w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0393-300x234.png 300w" sizes="(max-width: 302px) 100vw, 302px" /></figure></div>



<p>Warden was designed as a simple, lightweight daemon process that exposes a plugin API, allowing members of the SRE team to quickly write small pluggable python scripts that handle specific conditions found on the local system. It is meant to exist as an agent on every single server of the web-hosting fleet, where it will work to maintain integrity and record information about failure states.</p>



<h3 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-Goals">Goals</h3>



<p>Warden has a few specific long term goals, which are worth going over.</p>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-MaximizeSystemAvailability">Maximize system availability</h4>



<p>Warden attempts to detect scenarios that would degrade or otherwise disrupt the service and responds to fault events from the monitoring system. This allows for the quick return of the system to a functional, clean state; allowing it to reintegrate the available hosts pool and serve requests again. Being a local, per-server process, Warden is able to be reactive and process events in a timely fashion, avoiding network round trips and monitoring delays. This contributes to the general health of the infrastructure by keeping the amount of hosts in a failure state at a bare minimum. </p>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-Logdiagnosticdataforlateranalysis">Log diagnostic data for later analysis</h4>



<p>Being a local agent present on every system, Warden is in the enviable position of being able to collect all sorts of surrounding data for export upon detecting a failure state.</p>



<p>Warden keeps a detailed record of the failure state and surrounding system state, to be queried later. This ensures diagnosis is not a blocking point for returning the host to duty. It is also important to remember the goal is&nbsp;<strong>not</strong>&nbsp;to sweep failure states under the carpet, or mask them.</p>



<p>Additionally, since many of these failure states are non-critical (as other hosts take over transparently), it may be&nbsp;<em>multiple days</em>&nbsp;by the time someone gets to look at it, at which point the relevant state to inspect is long gone, and we&#8217;re just left with an empty, yet offline server.</p>



<p>The primary goal here is actually to increase visibility into failure states, and to be able to quickly identify trends and underlying issues that must be mitigated or resolved, while ensuring the relevant data is kept while fresh.</p>



<p>At runtime, Warden generates snapshots of interesting system aspects. A long term goal is to capture a meaningful representation of the entire system state at the time of event, preventing the need to perform diagnostics directly on affected hosts.</p>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-MinimizeHumanOverhead">Minimise human overhead</h4>



<p>Analysis of failure states can be highly time consuming, especially if you&#8217;re flooded by hundreds of systems reporting mostly the same issue. It can also be irritating to constantly deal with transient failure states that are considered &#8220;normal&#8221;, either due to known popular application bugs, or other known circumstances. Just sorting the signal from the noise can be a full time job, especially if your team is actively trying to maintain general health and resolve the issue long term.</p>



<p>This can quickly turn into a battle of attrition where resources are expended on managing the alerts, failure states and problems over actively working to mitigate and resolve them.</p>



<p>Warden hopes to streamline this process massively, allowing SRE people to focus on what actually matters and makes a difference in terms of Quality of Service.</p>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-Makewritingself-healingpluginseasy">Make writing self-healing plugins easy</h4>



<p>The API Warden is meant to be simple. It abstracts much of the nuts and bolts of the implementation process involved in execution.</p>



<p>Plugin authors should not have to worry about scheduling their own run, or writing complex logic to obtain the information they are after, nor should they have to write solid logging code.</p>



<p>All of this should be handled by Warden. Plugin authors should be able to focus on describing their conditions, selecting what relevant data they want to record, and writing an action that hopefully restores functionality.</p>



<h3 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-Howdoesitwork?">How does it work?</h3>



<div class="wp-block-image"><figure class="aligncenter size-large"><img decoding="async" width="1024" height="462" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/IMG_0394-1024x462.png" alt="Warden - How does it work?" class="wp-image-20164" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0394-1024x462.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0394-300x135.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0394-768x346.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0394-1536x693.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0394.png 1905w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure></div>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-WardenCore">Warden Core</h4>



<p>As previously mentioned, Warden is a small daemon written entirely in Python. On boot, it will enumerate the plugins it is configured to activate, and place them in a queue.</p>



<p>Plugins may have configuration values as well, exposing easily tunable thresholds for response, or other settings. The Warden Core essentially serves to orchestrate everything, as well as provide the plugin API.</p>



<p>It also keeps track of various internal decisions, plugin states and how many times a plugin has done a self-healing action.</p>



<p>Then, once booted, the main workflow starts.</p>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-StateCollection">State Collection</h4>



<p>Warden immediately goes and collects system states from its available sources. This could be, for example, a monitoring probe sink &#8211; which can be queried remotely as well as locally &#8211; or a snapshot of the process table.</p>



<p>Some deeper information is also generated, on demand, to keep the system load as light as possible.</p>



<p>This information is then sent to plugins matching the type of state collector. For example, plugins that operate on the process table will be gently fed this information.</p>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-Pluginhandoff">Plugin hand off</h4>



<p>A Warden plugin consists of essentially three primary callbacks, which should be easy to implement.</p>



<p>Plugins are encouraged to terminate early if they do not find actionable items in the system state.</p>



<h5 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-ScanPhase">Scan Phase</h5>



<p>In this phase, a Warden plugin will receive information about the system state, in a form it can easily digest, using standard Python data structures.<br>The plugin can select some particular pieces of information it would like to further analyze, if necessary.</p>



<p>If an event is detected that the plugin can respond to immediately, then this is recorded to a&nbsp;<strong>Central Store</strong>&nbsp;(provided by our own Logs Data Platform product)</p>



<p>If at this point, a self-heal action is necessary, the plugin can signal it by setting its internal state accordingly.</p>



<h5 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-AnalysisPhase">Analysis Phase</h5>



<p>During this phase, the plugin will further dissect the received status, and/or collect information about the system &#8211; either requesting them from Warden, or collecting them itself.</p>



<p>This is where the diagnostic information will be exported to a&nbsp;<strong>Central Store</strong>, alongside a plethora of useful metadata (where, when, who, how).</p>



<p>At this point, if not already signaled by the previous phase, the plugin can mark its internal state as requiring an action.</p>



<h5 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-HealPhase">Heal Phase</h5>



<p>Warden will then check the internal state of the plugin, and if it needs to perform an action, this final phase will be executed.<br>This is where the logic to resolve the situation is written. Services get restarted, processes get terminated, maintenance scripts called, etc.<br><br>Success (or failure) is reported, and Warden will dutifully log the Action and its results to the&nbsp;<strong>Central Store.</strong></p>



<p>At this point, if an action was taken, Warden will refresh the corresponding state before moving on to the next plugin in the queue.</p>



<p>This process is repeated at configurable intervals that can be kept short, since plugins are lightweight and exit quickly if no issue is found.</p>



<h4 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-DashboardsandVisibility">Dashboards and Visibility</h4>



<p>Extensive Grafana dashboards as well as Graylog interfaces have been built to closely monitor everything the Warden does.</p>



<p>They simply query the Central Store where every single system reports its events and actions.</p>



<p>We can tell how frequently a specific self-heal is triggered, for example, on what amount of systems, and where they occur the most.</p>



<p>We can also easily tell where self-heals fail the most, between individual failure domains, or down to individual systems within a cluster.</p>



<p>They are made to be easy to drill down into, to get a bird eye&#8217;s view of the global state as well as a detailed view of the exact actions taken by a single plugin.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/Screenshot_2020-12-09-warden-Global-Statistics-Grafana-1024x540.png" alt="" class="wp-image-20168" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/Screenshot_2020-12-09-warden-Global-Statistics-Grafana-1024x540.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/Screenshot_2020-12-09-warden-Global-Statistics-Grafana-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/Screenshot_2020-12-09-warden-Global-Statistics-Grafana-768x405.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/Screenshot_2020-12-09-warden-Global-Statistics-Grafana-1536x810.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/Screenshot_2020-12-09-warden-Global-Statistics-Grafana.png 1893w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Keeping this up on a TV Monitor in office has been of incredible value when it comes to casually noticing trends, as well as identifying which problems are recurrent and which are transient.</p>



<h3 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-APracticalExample">A Practical Example</h3>



<p>As a practical example of how Warden can be tied into existing systems and handling their events, there exists a probe on our servers that verifies the availability of the hosting runtime stack, ensuring it functions and is in the correct state to process requests.</p>



<p>It would often raise an alarm after some specific code in our hosting stack either terminates abnormally, or creates a scenario where the stack was incapable of recovering on its own. This would generate an alert, mark the server as unavailable, and remove it from the active pool.</p>



<p>Rebooting the server or restarting the entire stack would obviously resolve the situation and return the system to the pool of available hosts, but this robs us of the opportunity to inspect the issue. Existing metrics and logs only shed partial light into what exactly had occurred in order to cause this; especially since reproducing it will often be dependent on specific applications we host. Not to mention that by the time someone got to look at it, the chances are that the interesting state has long left the system.</p>



<p>In order to mitigate this, a Warden plugin was written with the following logic:</p>



<ul class="wp-block-list"><li>It scans the local alert sink for the failure state (exiting if it is not present)</li><li>During the analysis phase, crash dumps are collected, the filesystem state is recorded, relevant logs are extracted.<br>The exact version of the hosting stack is also collected, alongside everything relevant.<br>This is then sent to the Central Store alongside information about the host, the site, and timestamps.<br>The plugin then marks itself as needing to take action.</li><li>Everything relevant having been collected will mean that the hosting stack is destroyed, cleaned, and relaunched.</li><li>Afterwards, the probe that raised the alert is refreshed. Congratulations, the system is now back online, and in a matter of minutes!</li></ul>



<p>The turnaround time for writing the plugin was also reasonably short, and was deemed complete in two iterations (mostly to collect more data).</p>



<p>This information helped our developers pin-point exactly what was happening, as well as continuing to be a solid metric for gauging the health of our infrastructure.</p>



<h3 class="wp-block-heading" id="Warden:ASelfhealingframeworkforlocalactions-InConclusion">In Conclusion</h3>



<p>So far, Warden has helped not only lower the amount of human resources expended towards diagnosing and resolving issues, but has generated targeted improvements to various components of our stack.</p>



<p>It has also identified issues that would otherwise have gone unnoticed simply by graphing a visual trend of certain non-fatal states, which has led to more fixes and improvements.</p>



<p>On-call duty cycles have also been reasonably more peaceful as the bar for accessibility has been significantly lowered when it comes to automating resolution of simple issues.</p>



<p>It has generally allowed us to better focus our energy where we are able to make a difference, and through further improvements, will hopefully continue to do so.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fwarden-the-self-healing-framework-for-local-actions%2F&amp;action_name=Warden%3A%20the%20self-healing%20framework%20for%20local%20actions&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Selfheal at Webhosting &#8211; The external part</title>
		<link>https://blog.ovhcloud.com/selfheal-at-webhosting-the-external-part/</link>
		
		<dc:creator><![CDATA[Florian Chardin]]></dc:creator>
		<pubDate>Fri, 17 Jul 2020 12:34:38 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Self-heal at Webhosting]]></category>
		<category><![CDATA[SRE]]></category>
		<category><![CDATA[Web Hosting]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=18726</guid>

					<description><![CDATA[Introduction With almost 6000000 websites hosted on more than 15000 servers, the OVHcloud Webhosting SRE team manage lots of alerts during their working day. Our infrastructure is constantly growing, but to scale smoothly, the amount of time spent solving alerts should not increase proportionally. We need, therefore, some tools to help us.&#160;&#160;In our team, we [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fselfheal-at-webhosting-the-external-part%2F&amp;action_name=Selfheal%20at%20Webhosting%20%26%238211%3B%20The%20external%20part&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h3 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Introduction">Introduction</h3>



<p>With almost 6000000 websites hosted on more than 15000 servers, the OVHcloud Webhosting SRE team manage lots of alerts during their working day. </p>



<p>Our infrastructure is constantly growing, but to scale smoothly, the amount of time spent solving alerts should not increase proportionally.</p>



<p>We need, therefore, some tools to help us.&nbsp;&nbsp;In our team, we call it the <em>selfheal</em>.&nbsp;</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="537" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/8C2CA3A8-F3E3-4B73-A4C5-087DACF7E88F-1024x537.jpeg" alt="Selfheal at Webhosting – The external part" class="wp-image-18848" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/8C2CA3A8-F3E3-4B73-A4C5-087DACF7E88F-1024x537.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/8C2CA3A8-F3E3-4B73-A4C5-087DACF7E88F-300x157.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/8C2CA3A8-F3E3-4B73-A4C5-087DACF7E88F-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/8C2CA3A8-F3E3-4B73-A4C5-087DACF7E88F.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Whatistheselfheal?">What is the <em>selfheal</em>?</h3>



<p>The <em>selfheal </em>refers to the automation of alert solving in our production environments. The automated process is able to fix well-known issues, with no admin interaction.</p>



<h3 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Whydoweneedit?">Why do we need it?</h3>



<p>We must limit the time we spend to solve alerts as far as possible. Not only so we have the time to run and maintain the infrastructure, but also to stay up to date.</p>



<p>With the number of servers we manage, a small issue can represent dozens of alerts.</p>



<p>We need to be efficient by automating as many production chores as possible.</p>



<h4 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Hardware">Hardware</h4>



<p>Serving billions of HTTP requests each day requires a lot of resources, which is why we often use physical servers in our datacenters. </p>



<p>Even a single physical server requires a big follow up. It takes a lot of time to diagnose, schedule downtime, request and manage an intervention with datacenter teams, or even to reinstall the operating system when a disk is faulty.</p>



<p>We cannot afford to spend hours on repetitive tasks when they can be automated.</p>



<h4 class="wp-block-heading">Software</h4>



<p>Even if software seems predictable, it will still encounter failure. This is true even when managing the underlying infrastructure&nbsp;that hosts millions of lines of unknown code provided by our client.</p>



<p>While we try to have a stable software stack, we cannot predict all behaviour. Many of the software problems can be solved with a restart or a quick fix, and lots of these operations can also be automated.</p>



<p>We should alert the on-call admin staff as little as possible, only when it&#8217;s absolutely necessary.</p>



<p>The idea is is to log each action done by the <em>selfheal </em>to identify bug or error patterns and then work on longer term fixes.</p>



<h3 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Theselfhealatwebhosting">The <em>selfheal </em>at Webhosting</h3>



<p>At Webhosting we split <em>selfheal </em>in two part:</p>



<ol class="wp-block-list"><li>External <em>selfheal </em>which handles hardware problems or anything that can not be solved by the host itself.</li><li>Internal <em>selfheal </em>which is intended to solve software problems on a given system.</li></ol>



<p>In this article, we will discuss the the external part.</p>



<h3 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Externalselfheal">External <em>selfheal</em></h3>



<h4 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Context">Context:</h4>



<p>As we said earlier, the external part of our <em>selfheal </em>is mainly intended to solve hardware problems that cannot be solved by the server alone.</p>



<p>To accomplish this, we created a small micro-service application that listens &#8211; monitoring events.</p>



<p>We could have chosen an existing tool (like StackStorm), but we didn&#8217;t. Here&#8217;s why:</p>



<ul class="wp-block-list"><li>Building micro-services is really simple and fast at OVH.</li><li>Structured, detailed and simple logs with a uniq uuid to follow each selfheal task in our internal logging system (which allow us to graph them easily).&nbsp;</li><li>Simple integration with our existing tools and ecosystem&nbsp;</li><li>Fast and easy deployment in all our regions</li><li>Simple CI/CD (unit-testing, etc)</li><li>Custom notifications, like chat-bot</li><li>Intelligence and history</li></ul>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="280" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/4FE23E51-6835-45F4-8ACE-819668BB9F16-1024x280.png" alt="External Selfheal" class="wp-image-18826" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/4FE23E51-6835-45F4-8ACE-819668BB9F16-1024x280.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/4FE23E51-6835-45F4-8ACE-819668BB9F16-300x82.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/4FE23E51-6835-45F4-8ACE-819668BB9F16-768x210.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/4FE23E51-6835-45F4-8ACE-819668BB9F16-1536x419.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/4FE23E51-6835-45F4-8ACE-819668BB9F16.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h4 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Howitworks">How it works</h4>



<p>Everything starts with our monitoring, which scraps the servers probes and sends all alerts in a Kafka topic.</p>



<p>The application consumes Kafka events and then reacts instantly with the correct workflow, depending on the problem.</p>



<p>The app will react with the appropriate workflow depending on the alert it gets. It does this by performing the correct API call to our different services and tools.</p>



<p>All actions performed are stored. This prevents having to do the same fix several times on a given server and to identify complex problems.</p>



<h4 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Concreteexampleonfaultydiskreplacement">Concrete example on faulty disk replacement</h4>



<p>One of the top time-consuming alerts we&#8217;ve had to solve was the replacement of unhealthy HDD found by SMART checks.</p>



<p>Being stateless, lots of our servers use a single disk with no raid setup.&nbsp;It also means replacing a disk to reinstall the host; but hopefully, it can be done with a single API call.</p>



<p>To manage this alert, an admin had to do the following actions:</p>



<ol class="wp-block-list"><li>Put the server in maintenance to drain client requests</li><li>Create a datacenter request to replace the HDD</li><li>Reinstall the server</li></ol>



<p>This whole process can take up to 3 hours and is hard to execute manually (managing several issues at once).&nbsp;</p>



<p>The first thing we did, was to automate the check with a probe.</p>



<p>Then, we decided to automate the whole thing with a simple workflow in our self-healing application, then to orchestrate the API call.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="352" src="https://www.ovh.com/blog/wp-content/uploads/2020/07/886B192D-9DF6-4DD7-AF70-77351AA6BFF2-1024x352.png" alt="Internal Selfheal" class="wp-image-18827" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/07/886B192D-9DF6-4DD7-AF70-77351AA6BFF2-1024x352.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/886B192D-9DF6-4DD7-AF70-77351AA6BFF2-300x103.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/886B192D-9DF6-4DD7-AF70-77351AA6BFF2-768x264.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/886B192D-9DF6-4DD7-AF70-77351AA6BFF2-1536x528.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/07/886B192D-9DF6-4DD7-AF70-77351AA6BFF2.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>With this process, we are able to replace disks every day without any manual tasks performed by an admin.</p>



<h3 class="wp-block-heading" id="OVHCloudblogarticleSelfhealatWebhosting(part1)-Toconclude">To conclude</h3>



<p>Last month, our external <em>selfheal </em>tool requested more than 70 datacenter interventions to datacenter teams which represents a big time saving.</p>



<p>We won in reactivity. No more lag between the time an alert is detected and when it&#8217;s handled.</p>



<p>Alerts are handled instantly when detected by the monitoring system. It helps us to keep a clean monitoring backlog and to avoid &#8220;batches&#8221; of alert solving, which were complicated for both us and DC.</p>



<p>Now, we just handle alerts that cannot be solved through automation and focus on corner cases, where admin interactions are valuable and required.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fselfheal-at-webhosting-the-external-part%2F&amp;action_name=Selfheal%20at%20Webhosting%20%26%238211%3B%20The%20external%20part&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Doing BIG automation with Celery</title>
		<link>https://blog.ovhcloud.com/doing-big-automation-with-celery/</link>
		
		<dc:creator><![CDATA[Bartosz Rabiega]]></dc:creator>
		<pubDate>Fri, 06 Mar 2020 16:14:18 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[ceph]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[workflows]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=17100</guid>

					<description><![CDATA[Intro TL;DR: You might want to skip the intro and jump right into “Celery &#8211; Distributed Task Queue”. Hello! I’m Bartosz Rabiega, and I’m part of the R&#38;D/DevOps teams at OVHcloud. As part of our daily work, we’re developing and maintaining the Ceph-as-a-Service project, in order to provide highly available, solid, distributed storage for various [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdoing-big-automation-with-celery%2F&amp;action_name=Doing%20BIG%20automation%20with%20Celery&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Intro</h2>



<p><strong>TL;DR</strong>: You might want to skip the intro and jump right into “Celery &#8211; Distributed Task Queue”.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/2A010EF2-2666-42D4-91C1-F1FAE33148FE-1024x537.png" alt="" class="wp-image-17420" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/2A010EF2-2666-42D4-91C1-F1FAE33148FE-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/2A010EF2-2666-42D4-91C1-F1FAE33148FE-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/2A010EF2-2666-42D4-91C1-F1FAE33148FE-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/2A010EF2-2666-42D4-91C1-F1FAE33148FE.png 1200w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p>Hello! I’m Bartosz Rabiega, and I’m part of the R&amp;D/DevOps teams at OVHcloud. As part of our daily work, we’re developing and maintaining the Ceph-as-a-Service project, in order to provide highly available, solid, distributed storage for various applications. We’re dealing with 60PB+ of data, across 10 regions, so as you might imagine, we’ve got quite a lot of work ahead in terms of replacing broken hardware, handling natural growth, provisioning new regions and datacentres, evaluating new hardware, optimising software and hardware configurations, researching new storage solutions, and much more!</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/B87CD670-7779-4325-92D9-F30A1C8C71A2.png" alt="" class="wp-image-17382" width="705" height="471" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/B87CD670-7779-4325-92D9-F30A1C8C71A2.png 940w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/B87CD670-7779-4325-92D9-F30A1C8C71A2-300x200.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/B87CD670-7779-4325-92D9-F30A1C8C71A2-768x513.png 768w" sizes="auto, (max-width: 705px) 100vw, 705px" /></figure></div>



<p>Because of the wide scope of our work, we need to offload as many repetitive tasks as possible. And we do that through automation.</p>



<h2 class="wp-block-heading">Automating your work</h2>



<p>To some extent, every manual process can be described as set of actions and conditions. If we somehow managed to force something to automatically perform the actions and check the conditions, we would be able to automate the process, resulting in an automated workflow. Take a look at the example below, which shows some generic steps for manually replacing hardware in our project.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="291" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/E9662233-9498-4F2F-9A7E-B640F85EE295-1024x291.png" alt="" class="wp-image-17389" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/E9662233-9498-4F2F-9A7E-B640F85EE295-1024x291.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/E9662233-9498-4F2F-9A7E-B640F85EE295-300x85.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/E9662233-9498-4F2F-9A7E-B640F85EE295-768x218.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/E9662233-9498-4F2F-9A7E-B640F85EE295-1536x436.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/E9662233-9498-4F2F-9A7E-B640F85EE295.png 1677w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>Hmm… What could help us do this automatically? Doesn’t a computer sound like a perfect fit? 🙂 There are many ways to force computers to process automated workflows, but first we need to define some building blocks (let’s call them tasks) and get them to run sequentially or in parallel (i.e. a workflow). Fortunately, there are software solutions that can help with that, among which is Celery.</p>



<h2 class="wp-block-heading">Celery &#8211; Distributed Task Queue</h2>



<p>Celery is a well-known and widely adopted piece of software that allows us to process tasks asynchronously. The description of the project on its main page (<a href="http://www.celeryproject.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">http://www.celeryproject.org/</a>) may sound a little bit enigmatic, but we can narrow down its basic functionality to something like this:</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/4749B2AA-AA5B-4BEF-BA3A-FC0B67FCD447-1024x539.png" alt="" class="wp-image-17414" width="768" height="404" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/4749B2AA-AA5B-4BEF-BA3A-FC0B67FCD447-1024x539.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/4749B2AA-AA5B-4BEF-BA3A-FC0B67FCD447-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/4749B2AA-AA5B-4BEF-BA3A-FC0B67FCD447-768x404.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/4749B2AA-AA5B-4BEF-BA3A-FC0B67FCD447.png 1294w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<p>Such machinery is perfectly suited to tasks like sending emails asynchronously (i.e. &#8216;fire and forget&#8217;), but it can also be used for different purposes. So what other tasks could it handle? Basically, any tasks you can implement in Python (the main Celery language)! I won’t go too much into the details, as they are available in the Celery documentation. What matters is that since we can implement any task we want, we can use that to create the building blocks for our automation.</p>



<p>There is one more important thing&#8230; Celery natively supports combining such tasks into workflows (Celery primitives: chains, groups, chords, etc.). So let’s get through some examples&#8230;</p>



<p>We’ll use the following task definitions &#8211; single task, printing <em>args</em> and <em>kwargs</em>:</p>



<pre class="wp-block-code"><code class="">@celery_app.task
def noop(*args, **kwargs):
    # Task accepts any arguments and does nothing
    print(args, kwargs)
    return True</code></pre>



<p>Now we can execute the task asynchronously, using the following code:</p>



<pre class="wp-block-code"><code class="">task = noop.s(777)
task.apply_async()</code></pre>



<p>The elementary tasks can be parametrised and combined into a complex workflow using celery methods, i.e. “chain”, “group”, and “chord”. See the examples below. In each of them, the left side shows a visual representation of a workflow, while the right side shows the code snippet that generates it. The green box is the starting point, after which the workflow execution progresses vertically.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
<h4 class="wp-block-heading">Chain &#8211; a set of tasks processed sequentially</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/705AD975-048B-4E6A-8BFF-F68775C9C5C7.png" alt="" class="wp-image-17394" width="92" height="320"/></figure></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<pre class="wp-block-code"><code class="">workflow = (
    chain([noop.s(i) for i in range(3)])
)</code></pre>
</div>
</div>



<h4 class="wp-block-heading">Group &#8211; a set of tasks processed in parallel</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/B112B87E-2813-46DD-9105-4B528BB3C110.png" alt="" class="wp-image-17396" width="317" height="169" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/B112B87E-2813-46DD-9105-4B528BB3C110.png 633w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/B112B87E-2813-46DD-9105-4B528BB3C110-300x160.png 300w" sizes="auto, (max-width: 317px) 100vw, 317px" /></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<pre class="wp-block-code"><code class="">workflow = (
    group([noop.s(i) for i in range(5)])
)</code></pre>
</div>
</div>



<h4 class="wp-block-heading">Chord &#8211; a group of tasks chained to the following task</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/4E75C373-2CE1-4A68-8599-245E768167A4.png" alt="" class="wp-image-17397" width="311" height="223" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/4E75C373-2CE1-4A68-8599-245E768167A4.png 621w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/4E75C373-2CE1-4A68-8599-245E768167A4-300x215.png 300w" sizes="auto, (max-width: 311px) 100vw, 311px" /></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<pre class="wp-block-code"><code class="">workflow = chord(
        [noop.s(i) for i in range(5)],
        noop.s(i)
)

# Equivalent:
workflow = chain([
        group([noop.s(i) for i in range(5)]),
        noop.s(i)
])</code></pre>
</div>
</div>
</div></div>



<p>An important point: the execution of a workflow will always stop in the event of a failed task. As a result, a chain won’t be continued if some task fails in the middle of it. This gives us quite a powerful framework for implementing some neat automation, and that’s exactly what we’re using for Ceph-as-a-Service at OVHcloud! We’ve implemented lots of small, flexible, parameterisable tasks, which we combine together to reach a common goal. Here are some real-life examples of elementary tasks, used for the automatic removal of old hardware:</p>



<ul class="wp-block-list"><li>Change weight of Ceph node (used to increase/decrease the amount of data on node. Triggers data rebalance)</li><li>Set service downtime (data rebalance triggers monitoring probes, but this is expected, so set downtime for this particular monitoring entry)</li><li>Wait until Ceph is healthy (wait until the data rebalance is complete &#8211; repeating task)</li><li>Remove Ceph node from a cluster (node is empty so it can simply be uninstalled)</li><li>Send info to technicians in DC (hardware is ready to be replaced)</li><li>Add new Ceph node to a cluster (install new empty node)</li></ul>



<p>We parametrise these tasks and tie them together, using Celery chains, groups and chords to create the desired workflow. Celery then does the rest by asynchronously executing the workflow.</p>



<h2 class="wp-block-heading">Big workflows and Celery</h2>



<p>As our infrastructure grows, so doo our automated workflows grow, with more tasks per workflow, higher complexity of workflows&#8230; What do we understand as a big workflow? A workflow consisting of 1,000-10,000 tasks. Just to visualize it take a look on following examples:</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
<h4 class="wp-block-heading">A few chords chained together (57 tasks in total)</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh4.googleusercontent.com/XZWOfqmSMu68u7GcbvceB0mc8_HA_v8higDeoG08dlO5oTlRd9R98QBSlf4sMLPuiFB2RPVgM-6i7vG86jtAxMCrKSLTkt0nK4z5JSbYE4QkXF96qkXh3uSJYj1X82UUm-agBMxu" alt=""/></figure></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<pre class="wp-block-code"><code class="">workflow = chain([
    noop.s(0),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    noop.s()
])</code></pre>
</div>
</div>



<h4 class="wp-block-heading">More complex graph structure built from chains and groups (23 tasks in total)</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" src="https://lh5.googleusercontent.com/gUQlIa5Nmb4a5oNDbojhBtukEn--6dSxlKrn-enggXk9eCtuBvgVBTxecwAczOMghEoZ0zOtKuz0nohZTsj01QqVBxkbX8bxqyVVvYjC6B1sfrpXN8pferDSgg-RE6TB6v5SOBdL" alt=""/></figure></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<pre class="wp-block-code"><code class=""># | is ‘chain’ operator in celery
workflow = (
    group(
        group(
            group([noop.s() for i in range(5)]),
            chain([noop.s() for i in range(5)])
        ) |
        noop.s() |
        group([noop.s() for i in range(5)]) |
        noop.s(),
        chain([noop.s() for i in range(5)])
    ) |
    noop.s()
)</code></pre>
</div>
</div>
</div></div>



<p>As you can probably imagine, visualisations get quite big and messy when 1,000 tasks are involved! Celery is a powerful tool, and has lots of features that are well-suited for automation, but it still struggles when it comes to processing big, complex, long-running workflows. Orchestrating the execution of 10,000 tasks, with a variety of dependencies, is no trivial thing. There are several issues we encountered when our automation grew too big:</p>



<ul class="wp-block-list"><li>Memory issues during workflow building (client side)</li><li>Serialisation issues (client -&gt; Celery backend transfer)</li><li>Nondeterministic, broken execution of workflows</li><li>Memory issues in Celery workers (Celery backend)</li><li>Disappearing tasks</li><li>And more&#8230;</li></ul>



<p>Take a look at some GitHub tickets:</p>



<ul class="wp-block-list"><li><a href="https://github.com/celery/celery/issues/5000" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/celery/celery/issues/5000</a></li><li><a href="https://github.com/celery/celery/issues/5286" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/celery/celery/issues/5286</a></li><li><a href="https://github.com/celery/celery/issues/5327" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/celery/celery/issues/5327</a></li><li><a href="https://github.com/celery/celery/issues/3723" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/celery/celery/issues/3723</a></li></ul>



<p>Using Celery for our particular use case became difficult and unreliable. Celery’s native support for workflows doesn’t seem to be the right choice for handling 100/1,000/10,000 tasks. In its current state, it’s just not enough. So here we stand, in front of a solid, concrete wall… Either we somehow fix Celery, or we rewrite our automation using a different framework.</p>



<h2 class="wp-block-heading">Celery &#8211; to fix&#8230; or to fix?</h2>



<p>Rewriting all of our automation would be possible, although relatively painful. Since I’m a rather lazy person, perhaps attempting to fix Celery wasn’t an entirely bad idea? So I took some time to dig through Celery’s code, and managed to find the parts responsible for building workflows, and executing chains and chords. It was still a little bit difficult for me to understand all the different code paths handling the wide range of use cases, but I realised it would be possible to implement a clean, straightforward orchestration that would handle all the tasks and their combinations in the same way. What’s more, I had a glimpse that it wouldn&#8217;t take too much effort to integrate it into our automation (let’s not forget the main goal!). </p>



<p>Unfortunately, introducing new orchestration into the Celery project would probably be quite hard, and would most likely break some backwards compatibility. So I decided to take a different approach &#8211; writing an extension or a plugin that wouldn’t require changes in Celery. Something pluggable, and as non-invasive as possible. That’s how Celery Dyrygent emerged&#8230;</p>



<h2 class="wp-block-heading">Celery Dyrygent</h2>



<p><a href="https://github.com/ovh/celery-dyrygent" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/ovh/celery-dyrygent</a></p>



<h3 class="wp-block-heading">How to represent a workflow</h3>



<p>You can think of a workflow as a directed acyclic graph (DAG), where each task is a separate graph node. When it comes to acyclic graphs, it is relatively easy to store and resolve dependencies between nodes, which leads to straightforward orchestration. Celery Dyrygent was implemented based on these features. Each task in the workflow has an unique identifier (Celery already assigns task IDs when a task is pushed for execution) and each one of them is wrapped into a workflow node. Each workflow node consists of a task signature (a plain Celery signature) and a list of IDs for the tasks it depends on. See the example below:</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/F4601B45-EB13-4710-9325-B9684BF77918-1024x533.png" alt="" class="wp-image-17400" width="512" height="267" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/F4601B45-EB13-4710-9325-B9684BF77918-1024x533.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/F4601B45-EB13-4710-9325-B9684BF77918-300x156.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/F4601B45-EB13-4710-9325-B9684BF77918-768x400.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/F4601B45-EB13-4710-9325-B9684BF77918.png 1172w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<h3 class="wp-block-heading">How to process a workflow</h3>



<p>So we know how to store a workflow in a clean and easy way. Now we just need to execute it. How about using&#8230; Celery? Why not? For this, Celery Dyrygent introduces a <strong>workflow processor</strong> task (an ordinary Celery task). This task wraps a whole workflow and schedules an execution of primitive tasks, according to their dependencies. Once the scheduling part is over, the task repeats itself (it &#8216;ticks&#8217; with some delay). </p>



<p>Throughout the whole processing cycle, workflow processor retains the state of the entire workflow internally. As a result, it updates the state with each repetition. You can see an orchestration example below:</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/CE6EE688-92F2-4BA5-9A6B-147BD956A0F0-1024x553.png" alt="" class="wp-image-17416" width="512" height="277" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/CE6EE688-92F2-4BA5-9A6B-147BD956A0F0-1024x553.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/CE6EE688-92F2-4BA5-9A6B-147BD956A0F0-300x162.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/CE6EE688-92F2-4BA5-9A6B-147BD956A0F0-768x415.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/CE6EE688-92F2-4BA5-9A6B-147BD956A0F0.png 1470w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/7764C3D5-1EF9-44A9-A588-4C37A275570B-1024x553.png" alt="" class="wp-image-17417" width="512" height="277" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/7764C3D5-1EF9-44A9-A588-4C37A275570B-1024x553.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/7764C3D5-1EF9-44A9-A588-4C37A275570B-300x162.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/7764C3D5-1EF9-44A9-A588-4C37A275570B-768x415.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/7764C3D5-1EF9-44A9-A588-4C37A275570B.png 1470w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/F2E6717E-B355-46AB-AD73-6C98B6CE4B19-1024x553.png" alt="" class="wp-image-17418" width="512" height="277" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/F2E6717E-B355-46AB-AD73-6C98B6CE4B19-1024x553.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/F2E6717E-B355-46AB-AD73-6C98B6CE4B19-300x162.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/F2E6717E-B355-46AB-AD73-6C98B6CE4B19-768x415.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/F2E6717E-B355-46AB-AD73-6C98B6CE4B19.png 1470w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p>Most notably, workflow processor stops its execution in two cases:</p>



<ul class="wp-block-list"><li>Once the whole workflow finishes, with all tasks successfully completed</li><li>When it can’t proceed any further, due to a failed task</li></ul>



<h3 class="wp-block-heading">How to integrate</h3>



<p>So how do we use this? Fortunately, I was able to find a way to use Celery Dyrygent quite easily. First of all, you need to inject the workflow processor task definition into your Celery applicationP:</p>



<pre class="wp-block-code"><code class="">from celery_dyrygent.tasks import register_workflow_processor
app = Celery() #  your celery application instance
workflow_processor = register_workflow_processor(app)</code></pre>



<p>Next, you need to convert your Celery defined workflow into a Celery Dyrygent workflow:</p>



<pre class="wp-block-code"><code class="">from celery_dyrygent.workflows import Workflow

celery_workflow = chain([
    noop.s(0),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    chord([noop.s(i) for i in range(10)], noop.s()),
    noop.s()
])

workflow = Workflow()
workflow.add_celery_canvas(celery_workflow)</code></pre>



<p>Finally, simply execute the workflow, just as you would an ordinary Celery task:</p>



<pre class="wp-block-code"><code class="">workflow.apply_async()</code></pre>



<p>That’s it! You can always go back if you wish, as the small changes are very easy to undo.</p>



<h3 class="wp-block-heading">Give it a try!</h3>



<p>Celery Dyrygent is free to use, and its source code is available on Github (<a href="https://github.com/ovh/celery-dyrygent" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/ovh/celery-dyrygent</a>). Feel free to use it, improve it, request features, and report any bugs! It has a few additional features not described here, so I&#8217;d encourage you to take a look at the project’s readme file. For our automation requirements, it&#8217;s already a solid, battle-tested solution. We’ve been using it since the end of 2018, and it has processed thousands of workflows, consisting of hundreds of thousands of tasks. Here are some productions stats, from June 2019 to February 2020:</p>



<ul class="wp-block-list"><li>936,248 elementary tasks executed</li><li>11,170 workflows processed</li><li>4,098 tasks in the biggest workflow so far</li><li>~84 tasks per workflow, on average</li></ul>



<p>Automation is always a good idea!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdoing-big-automation-with-celery%2F&amp;action_name=Doing%20BIG%20automation%20with%20Celery&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Introducing Director – a tool to build your Celery workflows</title>
		<link>https://blog.ovhcloud.com/introducing-director-a-tool-to-build-your-celery-workflows/</link>
		
		<dc:creator><![CDATA[Nicolas Crocfer]]></dc:creator>
		<pubDate>Wed, 26 Feb 2020 12:38:57 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[celery]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[python]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=17064</guid>

					<description><![CDATA[As developers, we often need to execute tasks in the background. Fortunately, some tools already exist for this. In the Python ecosystem, for instance, the most well-known library is Celery. If you have already used it, you know how great it is! But you will also have probably discovered how complicated it can be to [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fintroducing-director-a-tool-to-build-your-celery-workflows%2F&amp;action_name=Introducing%20Director%20%E2%80%93%20a%20tool%20to%20build%20your%20Celery%20workflows&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>As developers, we often need to execute tasks in the background. Fortunately, some tools already exist for this. In the Python ecosystem, for instance, the most well-known library is <a href="http://www.celeryproject.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Celery</a>. If you have already used it, you know how great it is! But you will also have probably discovered how complicated it can be to follow the state of a complex workflow.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="537" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/7E201458-960D-44E8-8DF8-816CE1DE766E-1024x537.jpeg" alt="" class="wp-image-17224" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/7E201458-960D-44E8-8DF8-816CE1DE766E-1024x537.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/7E201458-960D-44E8-8DF8-816CE1DE766E-300x157.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/7E201458-960D-44E8-8DF8-816CE1DE766E-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/7E201458-960D-44E8-8DF8-816CE1DE766E.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p><strong>Celery Director</strong> is a tool we created at OVHcloud to fix this problem. The code is now open-sourced and is available on <a href="https://github.com/ovh/celery-director" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Github</a>.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="525" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/director-1024x525.png" alt="" class="wp-image-17098" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/director-1024x525.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/director-300x154.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/director-768x394.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/director.png 1440w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Following the talk we did during <a href="https://fosdem.org/2020/schedule/event/python2020_celery/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">FOSDEM 2020</a>, this post aims to present the tool. We&#8217;ll take a close look at what Celery is, why we created Director, and how to use it.</p>



<h2 class="wp-block-heading">What is Celery?</h2>



<p>Here is the official description of Celery:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>Celery is an asynchronous <strong>task queue</strong>/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.</p></blockquote>



<p>The important words here are &#8220;task queue&#8221;. This is a mechanism used to distribute work across a pool of machines or threads.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="572" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/51EA37AB-E3E5-453F-9EFD-92414C84523F-1024x572.jpeg" alt="" class="wp-image-17220" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/51EA37AB-E3E5-453F-9EFD-92414C84523F-1024x572.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/51EA37AB-E3E5-453F-9EFD-92414C84523F-300x168.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/51EA37AB-E3E5-453F-9EFD-92414C84523F-768x429.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/51EA37AB-E3E5-453F-9EFD-92414C84523F.jpeg 1156w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The queue, in the middle of the above diagram, stores messages sent by the producers (APIs, for instance). On the other side, consumers are constantly reading the queue to display new messages and execute tasks.</p>



<p>In Celery, a message sent by the producer is the signature of a Python function: <code>send_email("john.doe")</code>, for example.</p>



<p>The queue (named <em>broker</em> in Celery) stores this signature until a worker reads it and <strong>really</strong> executes the function within the given parameter.</p>



<p>But why execute a Python function <em>somewhere else</em>? The main reason is to quickly return a response in cases of long-running functions. Indeed, it&#8217;s not an option to keep users waiting for a response for several seconds or minutes. </p>



<p>Just as we can imagine producers without enough resources, with a CPU-bound task, a more robust worker could handle its execution.</p>



<h2 class="wp-block-heading">How to use Celery</h2>



<p>So Celery is a library used to execute a Python code <em>somewhere else</em>, but how does it do that? In fact, it&#8217;s really simple! To illustrate this, we&#8217;ll use some of the available methods to send tasks to the broker, then we&#8217;ll start a worker to consume them.</p>



<p>Here is the code to create a Celery task:</p>



<pre class="wp-block-code"><code class=""># tasks.py
from celery import Celery

app = Celery("tasks", broker="redis://127.0.0.1:6379/0")

@app.task
def add(x, y):
    return x + y</code></pre>



<p>As you can see, a Celery task is just a Python function transformed to be sent in a broker. Note that we passed the redis connection to the Celery application (named app) to inform the broker where to store the messages.</p>



<p>This means it&#8217;s now possible to send a task in the broker:</p>



<pre class="wp-block-code"><code class="">>>> from tasks import add
>>> add.delay(2, 3)</code></pre>



<p>That&#8217;s all! We used the <code>.delay()</code> method, so our producer didn&#8217;t execute the Python code but instead sent the task signature to the broker.</p>



<p>Now it&#8217;s time to consume it with a Celery worker:</p>



<pre class="wp-block-code"><code class="">$ celery worker -A tasks --loglevel=INFO
[...]
[2020-02-14 17:13:38,947: INFO/MainProcess] Received task: tasks.add[0e9b6ff2-7aec-46c3-b810-b62a32188000]
[2020-02-14 17:13:38,954: INFO/ForkPoolWorker-2] Task tasks.add[0e9b6ff2-7aec-46c3-b810-b62a32188000] succeeded in 0.0024250600254163146s: 5</code></pre>



<p>It&#8217;s even possible to combine the Celery tasks with some primitives (the full list is <a href="https://docs.celeryproject.org/en/stable/userguide/canvas.html#the-primitives" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>):</p>



<ul class="wp-block-list"><li>Chain: will execute tasks one after the other.</li><li>Group: will execute tasks in parallel by routing them to multiple workers.</li></ul>



<p>For example, the following code will make two additions in parallel, then sum the results:</p>



<pre class="wp-block-code"><code class="">from celery import chain, group

# Create the canvas
canvas = chain(
    group(
        add.si(1, 2),
        add.si(3, 4)
    ),
    sum_numbers.s()
)

# Execute it
canvas.delay()</code></pre>



<p>You probably noted we didn&#8217;t use the <em>.delay()</em> method here. Instead we created a <strong>canvas</strong>, used to combine a selection of tasks.</p>



<p>The <code>.si()</code> method is used to create an immutable signature (i.e. one that does not receive data from a previous task), while <code>.s()</code> relies on the data returned by the two previous tasks.</p>



<p>This introduction to Celery has just covered its very basic usage. If you&#8217;re keen to find out more, I invite you to read the documentation, where you&#8217;ll discover all the powerful features, including <strong>rate limits</strong>, <strong>tasks retrying</strong>, or even <strong>periodic tasks</strong>. </p>



<h2 class="wp-block-heading">As a developer, I want&#8230;</h2>



<p>I&#8217;m part of a team whose goal is to deploy and monitor internal infrastructures. As part of this, we needed to launch some background tasks, and as Python developers our natural choice was to use Celery. But, out of the box, Celery didn&#8217;t supported certain specific requirements for our projects:</p>



<ul class="wp-block-list"><li>Tracking the tasks&#8217; evolution and their dependencies in a WebUI.</li><li>Executing the workflows using API calls, or simply with a CLI.</li><li>Combining tasks to create workflows in YAML format.</li><li>Periodically executing a whole workflow.</li></ul>



<p>Some other cool tools exist for this, like Flower, but this only allows us to track each task individually, not a whole workflow and its component tasks.</p>



<p>And as we really needed these features, we decided to create <a href="https://github.com/ovh/celery-director" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Celery Director</a>.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="377" height="377" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/2E75457D-256F-4CB9-942B-B1B8C00CF79B.png" alt="" class="wp-image-17222" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/2E75457D-256F-4CB9-942B-B1B8C00CF79B.png 377w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/2E75457D-256F-4CB9-942B-B1B8C00CF79B-300x300.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/2E75457D-256F-4CB9-942B-B1B8C00CF79B-150x150.png 150w" sizes="auto, (max-width: 377px) 100vw, 377px" /></figure></div>



<h2 class="wp-block-heading">How to use Director</h2>



<p>The installation can be done using the <code>pip</code>command:</p>



<pre class="wp-block-code"><code class="">$ pip install celery-director</code></pre>



<p>Director provides a simple command to create a new workspace folder:</p>



<pre class="wp-block-code"><code class="">$ director init workflows
[*] Project created in /home/ncrocfer/workflows
[*] Do not forget to initialize the database
You can now export the DIRECTOR_HOME environment variable</code></pre>



<p>A new tasks folder and a workflow example has been created for you below:</p>



<pre class="wp-block-code"><code class="">$ tree -a workflows/
├── .env
├── tasks
│   └── etl.py
└── workflows.yml</code></pre>



<p>The <code>tasks/*.py</code> files will contain your Celery tasks, while the <code>workflows.yml</code> file will combine them:</p>



<pre class="wp-block-code"><code class="">$ cat workflows.yml
---
ovh.SIMPLE_ETL:
  tasks:
    - EXTRACT
    - TRANSFORM
    - LOAD</code></pre>



<p>This example, named <strong>ovh.SIMPLE_ETL</strong>, will execute three tasks, one after the other. You can find more examples in the <a href="https://ovh.github.io/celery-director/guides/build-workflows/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">documentation</a>.</p>



<p>After exporting the <code>DIRECTOR_HOME</code> variable and initialising the database with <code>director db upgrade</code>, you can execute this workflow :</p>



<pre class="wp-block-code"><code class="">$ director workflow list
+----------------+----------+-----------+
| Workflows (1)  | Periodic | Tasks     |
+----------------+----------+-----------+
| ovh.SIMPLE_ETL |    --    | EXTRACT   |
|                |          | TRANSFORM |
|                |          | LOAD      |
+----------------+----------+-----------+
$ director workflow run ovh.SIMPLE_ETL</code></pre>



<p>The broker has received the tasks, so now you can launch the Celery worker to execute them:</p>



<pre class="wp-block-code"><code class="">$ director celery worker --loglevel=INFO</code></pre>



<p>And then display the results using the webserver command (<code>director webserver</code>):</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="530" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/director_etl-1024x530.png" alt="" class="wp-image-17094" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/director_etl-1024x530.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/director_etl-300x155.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/director_etl-768x397.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/director_etl.png 1440w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This is just the beginning, as Director provides other features, allowing you to parametrise a workflow or periodically execute it, for example. You will find more details on these features in the <a href="https://ovh.github.io/celery-director/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">documentation</a>.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Our teams use Director regularly to launch our workflows. No more boilerplating, and no more need for advanced Celery knowledge&#8230; A new colleague can easily create its tasks in Python and combine them in YAML, without using the Celery primitives discussed earlier.</p>



<p>Sometimes we need to execute a workflow periodically (to populate a cache, for instance), and sometimes we need to manually call it from another web service (note that a workflow can also be executed through an <a href="https://ovh.github.io/celery-director/guides/run-workflows/#using-the-api" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">API call</a>). This is now possible using our single Director instance.</p>



<p>We invite you to try Director for yourself, and give us your feedback via Github, so we can continue to enhance it. The source code can be found in <a href="https://github.com/ovh/celery-director" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Github</a>, and the 2020 FOSDEM presentation is available <a href="https://fosdem.org/2020/schedule/event/python2020_celery/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fintroducing-director-a-tool-to-build-your-celery-workflows%2F&amp;action_name=Introducing%20Director%20%E2%80%93%20a%20tool%20to%20build%20your%20Celery%20workflows&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Simplify your research experiments with Kubernetes</title>
		<link>https://blog.ovhcloud.com/simplify-your-research-experiments-with-kubernetes/</link>
		
		<dc:creator><![CDATA[Laurent Parmentier]]></dc:creator>
		<pubDate>Fri, 06 Sep 2019 13:40:08 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[experiment]]></category>
		<category><![CDATA[research]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=15990</guid>

					<description><![CDATA[Abstract As a researcher I need to conduct experiments to validate my hypotheses. When the field of Computer Science is involved, it is well known that practitioners tend to drive experiments on different environments (at the hardware level: x86/arm/…, CPU frequency, available memory, or at the software level: operating system, versions of libraries). The problem [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fsimplify-your-research-experiments-with-kubernetes%2F&amp;action_name=Simplify%20your%20research%20experiments%20with%20Kubernetes&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Abstract</h2>



<p>As a <a href="https://www.ovh.com/blog/academics-ovh-ai-collaboration/" data-wpel-link="exclude">researcher</a> I need to conduct experiments to validate my hypotheses. When the field of Computer Science is involved, it is well known that practitioners tend to drive experiments on different environments (at the hardware level: x86/arm/…, CPU frequency, available memory, or at the software level: operating system, versions of libraries). The problem with these different environments is the difficulty of accurately reproducing an experiment as it has been presented in a research article.</p>



<p>In this post we provide a way of conducting experiments that can be reproduced by using Kubernetes-as-a-service, a managed platform to perform distributed computations along with other tools (Argo, MinIO) that take the advantages of the platform into consideration.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/09/IMG_0360-1024x539.png" alt="Simplify your research experiments with Kubernetes" class="wp-image-16028" width="768" height="404" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0360-1024x539.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0360-300x158.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0360-768x404.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0360.png 1199w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<p>The article is organised as follow, we first recall the context and the problem faced by a researcher who needs to conduct experiments. Then we explain how to solve the problem with Kubernetes and why we did not choose other solutions (e.g., HPC software). Finally, we give some tips on improving setup. </p>



<h2 class="wp-block-heading">Introduction</h2>



<p>When I started my PhD, I read a bunch of articles related to the field I’m working on, i.e. <a href="https://en.wikipedia.org/wiki/Automated_machine_learning" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">AutoML</a>. From this research, I realised how important it is to conduct experiments well in order to make them credible and verifiable. I started asking my colleagues how they carried out their experiments, and there was a common pattern: develop your solution, look at other(s) solution(s) that are related to the same problem, run each solution <a href="https://www.researchgate.net/post/What_is_the_rationale_behind_the_magic_number_30_in_statistics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">30 times if it is stochastic</a> with equivalent resources and compare your results to the other(s) solution(s) with statistical tests: Wilcoxon-Mann-Whitney when comparing two algorithms, or else Friedman test. As it is not the main topic of this article, I will not discuss statistical tests in detail.</p>



<p>As an experienced DevOps, I had one question about automation: How do I find out how to reproduce an experiment, especially of another solution? Guess the answer? Meticulously read a paper, or find a repository with all the information.</p>



<p>Either you are lucky and a source code is available, or else a pseudo-code is provided in the publication. In this case you need to re-implement the solution to be able to test it and compare it. Even if you are lucky and there is a source code available, often the whole environment is missing (e.g., exact version of the packages, python version itself, JDK version, etc…). Not having the right information impacts performance and may potentially bias experiments. For example, new versions of packages, languages, and so on, usually have better optimisations that your implementation can use. Sometimes it is hard to find the versions that have been used by practitioners.</p>



<p>The other problem, is the complexity of setting up a cluster with <a href="https://en.wikipedia.org/wiki/Comparison_of_cluster_software" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HPC software</a> (e.g., Slurm, Torque). Indeed, it requires technical knowledge to manage such a solution: configuration of the network, verifying that each node has the dependencies required by the runs installed, checking that nodes have the same versions of libraries, etc… These technical steps consume time for researchers, thus take them away from their main job. Moreover, to extract the results, researchers usually do it manually, they retrieve the different files (through FTP or NFS), and then perform statistical tests that they save by hand. Consequently, the workflow to perform an experiment is relatively costly and precarious.</p>



<p>In my point of view, it raise one <strong>big problem</strong>: that an experiment can not really be reproduced in the field of Computer Science.</p>



<h2 class="wp-block-heading">Solution</h2>



<p>OVH offers <a href="https://www.ovh.com/fr/public-cloud/kubernetes/" data-wpel-link="exclude">Kubernetes-as-a-service</a>, a managed cluster platform where you do not have to worry about how to configure the cluster (add node, configure network, and so on), so I started to investigate how I could perform experiments similarly to the HPC solutions. <a href="https://argoproj.github.io/argo/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Argo Workflows</a>, came out of the box. This tool allows you to define a workflow of steps that you can perform on your Kubernetes cluster within each step is confined in a <a href="https://en.wikipedia.org/wiki/List_of_Linux_containers" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">container</a>, loosely called image. A container allows you to run a program under a specific environment software (language version, libraries, third-parties), additionally to limiting the resources (CPU time, memory) used by the program.</p>



<p>The <strong>solution</strong> is linked to our big problem: make sure you can reproduce an experiment that is equivalent to run a workflow composed of steps under a specific environment.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="969" src="https://www.ovh.com/blog/wp-content/uploads/2019/09/IMG_0359-1024x969.jpg" alt="Simplify your research experiments with Kubernetes: architecture" class="wp-image-16027" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0359-1024x969.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0359-300x284.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0359-768x727.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/IMG_0359.jpg 1483w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Use case: Evaluate an AutoML solution</h3>



<p>The use case that we use in our research will be related to measuring <a href="https://gitlab.com/automl/automl-smac-vanilla" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">the convergence of a Bayesian Optimization (SMAC) on the problem of the AutoML</a></p>



<p>For this use case, we stated the Argo workflow in the following <a href="https://gitlab.com/automl/automl-smac-vanilla/blob/master/misc/workflow-argo.yml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">yaml file</a></p>



<h3 class="wp-block-heading">Set up the infrastructure</h3>



<p>First we will to setup a Kubernetes cluster, secondly we will install the services on our cluster and lastly we will run an experiment.</p>



<h4 class="wp-block-heading">Kubernetes cluster</h4>



<p>Installing a Kubernetes cluster with OVH is child&#8217;s play. Connect to the <a href="https://www.ovh.com/manager" data-wpel-link="exclude">OVH Control Panel</a>, go to <code>Public Cloud &gt; Managed Kubernetes Service</code>, then <code>Create a Kubernetes cluster</code> and follow the steps depending on your needs.</p>



<p>Once the cluster is created:</p>



<ul class="wp-block-list"><li>Take into consideration the change upgrade policy. If you are a researcher, and your experiment takes some time to run, you want to avoid an update that would shutdown your infrastructure with your runs. To avoid this situation, it is better to choose &#8220;Minimum unavailability&#8221; or &#8220;Do not update&#8221;.</li><li>Download the <code>kubeconfig</code> file, it will serve later with <code>kubectl</code> to connect on our cluster.</li><li>Add at least one node on your cluster.</li></ul>



<p>Once installed, you will <strong>need <a href="https://kubernetes.io/fr/docs/tasks/tools/install-kubectl/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">kubectl</a></strong>, a tool that allows you to manage your cluster.</p>



<p>If everything has been properly set up, you should get something like this:</p>



<pre class="wp-block-code"><code class="">kubectl top nodes
NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node01   64m          3%     594Mi           11%</code></pre>



<h4 class="wp-block-heading">Installation of Argo</h4>



<p>As we mentioned before, Argo allows us to run a workflow composed of steps. To install the client and the service on the cluster, we were inspired by this <a href="https://github.com/argoproj/argo/blob/02f38262c40901346ddd622685bc6bfd344a2717/demo.md" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">tutorial</a>.</p>



<p>First we download and install Argo (client):</p>



<pre class="wp-block-code"><code class="">curl -sSL -o /usr/local/bin/argo https://github.com/argoproj/argo/releases/download/v2.3.0/argo-linux-amd64
chmod +x /usr/local/bin/argo</code></pre>



<p>Then the controller and UI on our cluster:</p>



<pre class="wp-block-code"><code class="">kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.3.0/manifests/install.yaml</code></pre>



<p>Configure the <a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">service account</a>:</p>



<pre class="wp-block-code"><code class="">kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=default:default</code></pre>



<p>Then, with the <a href="https://github.com/argoproj/argo/blob/master/demo.md#1-download-argo" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">client</a> try a simple hello-world workflow to confirm the stack is working (Status: Succeeded):</p>



<pre class="wp-block-code"><code class="">argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
Name:                hello-world-2lx9d
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Tue Aug 13 16:51:32 +0200 (24 seconds ago)
Started:             Tue Aug 13 16:51:32 +0200 (24 seconds ago)
Finished:            Tue Aug 13 16:51:56 +0200 (now)
Duration:            24 seconds

STEP                  PODNAME            DURATION  MESSAGE
 ✔ hello-world-2lx9d  hello-world-2lx9d  23s</code></pre>



<p>You can also access the UI dashboard through <code>http://localhost:8001</code>:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n argo service/argo-ui 8001:80</code></pre>



<h4 class="wp-block-heading">Configure an Artifact repository (MinIO)</h4>



<p><strong>Artifact</strong> is a term used by Argo, it represents an archive containing files returned by a step. In our case we will use this feature to return final results, and to share intermediate results between steps.</p>



<p>In order to get Artifact working, we need an object storage. If you already have one you can pass the installation part but still need to <a href="https://argoproj.github.io/docs/argo/ARTIFACT_REPO.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">configure it</a>.</p>



<p>As specified in the tutorial, we used <a href="https://min.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">MinIO</a>, here is the manifest to install it (<code>minio-argo-artifact.install.yml</code>):</p>



<pre class="wp-block-code"><code class="">apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  # This name uniquely identifies the PVC. Will be used in deployment below.
  name: minio-pv-claim
  labels:
    app: minio-storage-claim
spec:
  # Read more about access modes here: https://kubernetes.io/docs/user-guide/persistent-volumes/#access-modes
  accessModes:
    - ReadWriteOnce
  resources:
    # This is the request for storage. Should be available in the cluster.
    requests:
      storage: 10
  # Uncomment and add storageClass specific to your requirements below. Read more https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class-1
  #storageClassName:
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  # This name uniquely identifies the Deployment
  name: minio-deployment
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        # Label is used as selector in the service.
        app: minio
    spec:
      # Refer to the PVC created earlier
      volumes:
      - name: storage
        persistentVolumeClaim:
          # Name of the PVC created earlier
          claimName: minio-pv-claim
      containers:
      - name: minio
        # Pulls the default MinIO image from Docker Hub
        image: minio/minio
        args:
        - server
        - /storage
        env:
        # MinIO access key and secret key
        - name: MINIO_ACCESS_KEY
          value: "TemporaryAccessKey"
        - name: MINIO_SECRET_KEY
          value: "TemporarySecretKey"
        ports:
        - containerPort: 9000
        # Mount the volume into the pod
        volumeMounts:
        - name: storage # must match the volume name, above
          mountPath: "/storage"
---
apiVersion: v1
kind: Service
metadata:
  name: minio-service
spec:
  ports:
    - port: 9000
      targetPort: 9000
      protocol: TCP
  selector:
    app: minio</code></pre>



<p><strong>Note</strong>: Please edit the following key/values:</p>



<ul class="wp-block-list"><li><code>spec &gt; resources &gt; requests &gt; storage &gt; 10</code> correspond to 10 GB storage requested by MinIO to the cluster</li><li><code>TemporaryAccessKey</code></li><li><code>TemporarySecretKey</code></li></ul>



<pre class="wp-block-code"><code class="">kubectl create ns minio
kubectl apply -n minio -f minio-argo-artifact.install.yml</code></pre>



<p><strong>Note</strong>: alternatively, you can install MinIO with <a href="https://hub.helm.sh/charts/stable/minio" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Helm</a>.</p>



<p>Now we need to configure Argo in order to use our object storage MinIO:</p>



<pre class="wp-block-code"><code class="">kubectl edit cm -n argo workflow-controller-configmap
...
data:
  config: |
    artifactRepository:
      s3:
        bucket: my-bucket
        endpoint: minio-service.minio:9000
        insecure: true
        # accessKeySecret and secretKeySecret are secret selectors.
        # It references the k8s secret named 'argo-artifacts'
        # which was created during the minio helm install. The keys,
        # 'accesskey' and 'secretkey', inside that secret are where the
        # actual minio credentials are stored.
        accessKeySecret:
          name: argo-artifacts
          key: accesskey
        secretKeySecret:
          name: argo-artifacts
          key: secretkey</code></pre>



<p>Add credentials:</p>



<pre class="wp-block-code"><code class="">kubectl create secret generic argo-artifacts --from-literal=accesskey="TemporaryAccessKey" --from-literal=secretkey="TemporarySecretKey"</code></pre>



<p><strong>Note</strong>: Use the correct credentials you specified above</p>



<p>Create the bucket <code>my-bucket</code> with the rights <code>Read and write</code> by connecting to the interface <code>http://localhost:9000</code>:</p>



<pre class="wp-block-code"><code class="">kubectl port-forward -n minio service/minio-service 9000</code></pre>



<p>Check that Argo is able to use Artifact with the object storage:</p>



<pre class="wp-block-code"><code class="">argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/artifact-passing.yaml
Name:                artifact-passing-qzgxj
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Wed Aug 14 15:36:03 +0200 (13 seconds ago)
Started:             Wed Aug 14 15:36:03 +0200 (13 seconds ago)
Finished:            Wed Aug 14 15:36:16 +0200 (now)
Duration:            13 seconds

STEP                       PODNAME                            DURATION  MESSAGE
 ✔ artifact-passing-qzgxj
 ├---✔ generate-artifact   artifact-passing-qzgxj-4183565942  5s
 └---✔ consume-artifact    artifact-passing-qzgxj-3706021078  7s</code></pre>



<p><strong>Note</strong>: In case you are stuck with a message <code>ContainerCreating</code>, there is a lot of chance that Argo is not able to access MinIO, e.g., bad credentials.</p>



<h4 class="wp-block-heading">Install a private registry</h4>



<p>Now that we have a way to run a workflow, we want each step to represent a specific software environment (i.e., an image). We defined this environment in a <a href="https://gitlab.com/automl/automl-smac-vanilla/blob/master/Dockerfile" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Dockerfile</a>.</p>



<p>Because each step can run on different nodes in our cluster, the image needs to be stored somewhere, in the case of Docker we require a <a href="https://docs.docker.com/registry/deploying/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">private registry</a>.</p>



<p>You can get a private registry in different ways:</p>



<ul class="wp-block-list"><li><a href="https://hub.docker.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Docker Hub</a></li><li><a href="https://gitlab.com/help/user/project/container_registry" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Gitlab.com</a></li><li><a href="https://labs.ovh.com/private-registry" data-wpel-link="exclude">OVH</a> &#8211; <a href="https://labs.ovh.com/private-registry/documentation/creating-a-private-registry" data-wpel-link="exclude">tutorial</a></li><li><a href="https://hub.helm.sh/charts/choerodon/harbor" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Harbor</a>: allows you to have your own registry on your Kubernetes cluster</li></ul>



<p>In our case we used <a href="https://labs.ovh.com/private-registry" data-wpel-link="exclude">OVH private registry</a>.</p>



<pre class="wp-block-code"><code class=""># First we clone the repository
git clone git@gitlab.com:automl/automl-smac-vanilla.git
cd automl-smac-vanilla

# We build the image locally
docker build -t asv-environment:latest .

# We push the image to our private registry
docker login REGISTRY_SERVER -u REGISTRY_USERNAME
docker tag asv-environment:latest REGISTRY_IMAGE_PATH:latest
docker push REGISTRY_IMAGE_PATH:latest</code></pre>



<p>Allow our cluster to pull images from the registry:</p>



<pre class="wp-block-code"><code class="">kubectl create secret docker-registry docker-credentials --docker-server=REGISTRY_SERVER --docker-username=REGISTRY_USERNAME --docker-password=REGISTRY_PWD</code></pre>



<h3 class="wp-block-heading">Try our experiment on the infrastructure</h3>



<pre class="wp-block-code"><code class="">git clone git@gitlab.com:automl/automl-smac-vanilla.git
cd automl-smac-vanilla

argo submit --watch misc/workflow-argo -p image=REGISTRY_IMAGE_PATH:latest -p git_ref=master -p dataset=iris
Name:                automl-benchmark-xlbbg
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Tue Aug 20 12:25:40 +0000 (13 minutes ago)
Started:             Tue Aug 20 12:25:40 +0000 (13 minutes ago)
Finished:            Tue Aug 20 12:39:29 +0000 (now)
Duration:            13 minutes 49 seconds
Parameters:
  image:             m1uuklj3.gra5.container-registry.ovh.net/automl/asv-environment:latest
  dataset:           iris
  git_ref:           master
  cutoff_time:       300
  number_of_evaluations: 100
  train_size_ratio:  0.75
  number_of_candidates_per_group: 10

STEP                       PODNAME                            DURATION  MESSAGE
 ✔ automl-benchmark-xlbbg
 ├---✔ pre-run             automl-benchmark-xlbbg-692822110   2m
 ├-·-✔ run(0:42)           automl-benchmark-xlbbg-1485809288  11m
 | └-✔ run(1:24)           automl-benchmark-xlbbg-2740257143  9m
 ├---✔ merge               automl-benchmark-xlbbg-232293281   9s
 └---✔ plot                automl-benchmark-xlbbg-1373531915  10s</code></pre>



<p><strong>Note</strong>:</p>



<ul class="wp-block-list"><li>Here we only have 2 parallel runs, you can have much more by adding them to the <a href="https://gitlab.com/automl/automl-smac-vanilla/blob/master/misc/workflow-argo.yml#L59" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">list</a> <code>withItems</code>. In our case, the list correspond to the seeds.</li><li><code>run(1:24)</code> correspond to the run <code>1</code> with the seed <code>24</code></li><li>We limit the resources per run by using <a href="https://gitlab.com/automl/automl-smac-vanilla/blob/master/misc/workflow-argo.yml#L114" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">requests and limits</a>, see also <a href="https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Managing Compute Resources</a>.</li></ul>



<p>Then we just retrieve the results through the MinIO web user interface <code>http://localhost:9000</code> (you can also do that with the <a href="https://github.com/minio/mc" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">client</a>).</p>



<p>The results are located in a directory with the same name as the argo workflow name, in our example it is <code>my-bucket &gt; automl-benchmark-xlbbg</code>.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/09/performance-1024x768.jpg" alt="" class="wp-image-16026" width="768" height="576" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/09/performance-1024x768.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/performance-300x225.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/performance-768x576.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/09/performance.jpg 1920w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<h2 class="wp-block-heading">Limitation to our solution</h2>



<p>The solution is not able to run the parallel steps on multiple nodes. This limitation is due to the way we are merging our results from the parallel steps to the merge step. We are using <code>volumeClaimTemplates</code>, i.e., we are <a href="https://gitlab.com/automl/automl-smac-vanilla/blob/master/misc/workflow-argo.yml#L7" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">mounting a volume</a>, and this can&#8217;t be done between different nodes. The problem can be solved by two manners:</p>



<ul class="wp-block-list"><li>Using parallel artifacts and aggregate them, however it is an <a href="https://github.com/argoproj/argo/issues/934" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">ongoing issue with Argo</a></li><li>Directly implement in the code of your run, a way to store the result on an accessible storage (<a href="https://docs.min.io/docs/python-client-quickstart-guide.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">MinIO SDK</a> for example)</li></ul>



<p>The first manner is preferred, it means you don&#8217;t have to change and customize the code for a specific storage file system.</p>



<h2 class="wp-block-heading">Hints to improve the solution</h2>



<p>In case you are interested in going further with your setup, you should take a look on the following topics:</p>



<ul class="wp-block-list"><li><a href="https://kubernetes.io/docs/reference/access-authn-authz/controlling-access/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Controlling the access</a>: in order to confine the users in different spaces (for security reasons, or to control the resources).</li><li>Exploring <a href="https://github.com/argoproj/argo/blob/master/examples/node-selector.yaml" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Argo selector</a> and <a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kubernetes selector</a>: in case you have a cluster composed of nodes that have different hardware and that you require an experiment using a specific hardware (e.g., specific cpu, gpu).</li><li>Configure a <a href="https://docs.min.io/docs/distributed-minio-quickstart-guide.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">distributed MinIO</a>: it ensures that your data are replicated on multiple nodes and stay available in case of a node fail.</li><li><a href="https://www.ovh.com/blog/how-to-monitor-your-kubernetes-cluster-with-ovh-metrics/" data-wpel-link="exclude">Monitoring your cluster</a>.</li></ul>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Without needing in-depth technical knowledge, we have shown that we can easily set up a complex cluster to perform research experiments and make sure they can be reproduced. </p>



<h2 class="wp-block-heading">Related links</h2>



<ul class="wp-block-list"><li><a href="https://www.youtube.com/watch?v=ZK510prml8o&amp;t=0s&amp;index=169&amp;list=PLj6h78yzYM2PZf9eA7bhWnIh_mK1vyOfU" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Automating Research Workflows at BlackRock</a></li><li><a href="https://www.stackhpc.com/the-state-of-hpc-containers.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">The State of HPC Containers</a></li><li><a href="https://kubernetes.io/blog/2017/08/kubernetes-meets-high-performance/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kubernetes Meets High-Performance Computing</a></li></ul>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fsimplify-your-research-experiments-with-kubernetes%2F&amp;action_name=Simplify%20your%20research%20experiments%20with%20Kubernetes&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>OVH Private Cloud and HashiCorp Terraform &#8211; Part 1</title>
		<link>https://blog.ovhcloud.com/private_cloud_and_hashicorp_terraform_part1/</link>
		
		<dc:creator><![CDATA[Erwan Quelin]]></dc:creator>
		<pubDate>Fri, 03 May 2019 08:59:46 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[HashiCorp]]></category>
		<category><![CDATA[IaaC]]></category>
		<category><![CDATA[Private Cloud]]></category>
		<category><![CDATA[Terraform]]></category>
		<category><![CDATA[VMware]]></category>
		<category><![CDATA[vSphere]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=15407</guid>

					<description><![CDATA[When discussing the concepts of DevOps and Infrastructure-as-a-Code, the tools developed by HashiCorp quickly come up. With Terraform, HashiCorp offers a simple way to automate infrastructure provisioning in both public clouds and on-premises. Terraform has a long history of deploying and managing OVH&#8217;s Public Cloud resources. For example, you can find a complete guide on [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fprivate_cloud_and_hashicorp_terraform_part1%2F&amp;action_name=OVH%20Private%20Cloud%20and%20HashiCorp%20Terraform%20%26%238211%3B%20Part%201&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><span class="tlid-translation translation" lang="en">When discussing the concepts of DevOps and Infrastructure-as-a-Code, the tools developed by <a href="https://www.hashicorp.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HashiCorp</a> quickly come up. With Terraform, HashiCorp offers a simple way to automate infrastructure provisioning in both public clouds and on-premises. Terraform has a long history of deploying and managing <a href="https://www.ovh.co.uk/public-cloud/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVH&#8217;s Public Cloud</a> resources. For example, you can find a complete guide on <a href="https://github.com/ovh/terraform-ovh-commons" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GitHub</a>. In this article, we will focus on using Terraform to interact with another OVH solution:&nbsp;<a href="https://www.ovh.co.uk/private-cloud/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Private Cloud</a>.</span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="900" height="450" src="/blog/wp-content/uploads/2019/04/IMG_0225.jpg" alt="" class="wp-image-15430" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0225.jpg 900w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0225-300x150.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0225-768x384.jpg 768w" sizes="auto, (max-width: 900px) 100vw, 900px" /></figure></div>



<p><span class="tlid-translation translation" lang="en"><br>Private Cloud enables customers to benefit from a VMware vSphere infrastructure, hosted and managed by OVH. Terraform lets you automate the creation of resources and their life cycle. In this first article, we will explore the basic notions of Terraform. After reading it, you should be able to write a Terraform configuration file to deploy and customise a virtual machine from a template. In a second article, we will build on this example, and modify it so that it is more generic and can be easily adapted to your needs.</span></p>



<h3 class="wp-block-heading">Installation</h3>



<p><span class="tlid-translation translation" lang="en"><span title="">Terraform is available on the <a href="https://www.terraform.io/downloads.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">HashiCorp website</a> for almost all OSs as a simple binary. Just download it and copy it into a directory in</span> <span title=""> your operating system PATH.</span> <span class="" title="">To test that everything is working properly, run the <strong>terraform</strong> command.</span></span></p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">$ terraform
Usage: terraform [-version] [-help] &lt;command> [args]

The available commands for execution are listed below.
The most common, useful commands are shown first, followed by
less common or more advanced commands. If you're just getting
started with Terraform, stick with the common commands. For the
other commands, please read the help and docs before usage.

Common commands:
    apply              Builds or changes infrastructure
    console            Interactive console for Terraform interpolations
    destroy            Destroy Terraform-managed infrastructure</code></pre>



<h3 class="wp-block-heading">Folders and files</h3>



<p><span class="tlid-translation translation" lang="en">Like other Infrastructure-as- a-Code tools, Terraform uses simple files to define the target configuration. To begin, we will create a directory and place a file named <code>main.tf</code>. By default, Terraform will read all the files in the working directory with the <code>.tf</code> extension, but to simplify things, we will start with a single file. We will see in a future article how to organise the data into several files.</span></p>



<p>Similarly, to make it easier to understand Terraform operations, we will specify all the necessary information directly in the files. This includes usernames, passwords and names of different resources (vCenter, cluster, etc.). It is obviously not advisable to do this in order to use Terraform in production. The second article will also be an opportunity to improve this part of the code. But for now, let&#8217;s keep it simple!</p>



<h3 class="wp-block-heading">Providers</h3>



<p><span class="tlid-translation translation" lang="en"><span title="">The providers let you specify how Terraform will communicate with the outside world.</span> <span title="">In our example, the vSphere provider will be in charge of connecting with your Private Cloud&#8217;s vCenter.</span> <span title="">We declare a provider as follows:</span><br></span></p>



<pre class="wp-block-code"><code lang="json" class="language-json">provider "vsphere" {
    user = "admin"
    password = "MyAwesomePassword"
    vsphere_server = "pcc-XXX-XXX-XXX-XXX.ovh.com"
}</code></pre>



<p><span class="tlid-translation translation" lang="en"><span class="" title="">We see here that Terraform uses its own way of structuring data (it is also possible to write everything in <a href="https://www.terraform.io/docs/configuration/syntax-json.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">JSON</a>&nbsp; to facilitate the automatic generation of files!</span> <span title="">)</span><span class="" title="">.</span> <span class="" title="">Data is grouped in blocks (here a block named <strong>vsphere</strong>, which is of the&nbsp;<strong>provider </strong>type) and the data relating to the block are in the form of keys/values.</span></span></p>



<h3 class="wp-block-heading">Data</h3>



<p><span class="tlid-translation translation" lang="en"><span title="">Now that Terraform is able to connect to the vCenter, we need to retrieve information about the vSphere infrastructure.</span> <span title="">Since we want to deploy a virtual machine, we need to know the datacentre, cluster, template, etc., and where we are going to create it.</span> <span title="">To do this, we will use <strong>data</strong>-type blocks:</span><br></span></p>



<pre class="wp-block-code"><code lang="json" class="language-json">data "vsphere_datacenter" "dc" {
  name = "pcc-XXX-XXX-XXX-XXX_datacenter3113"
}

data "vsphere_datastore" "datastore" {
  name          = "pcc-001234"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_virtual_machine" "template" {
  name          = "UBUNTU"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}</code></pre>



<p><span class="tlid-translation translation" lang="en"><span class="" title="">In the above example, we are trying to get information about the datacentre named <code>pcc-XXX-XXX-XXX-XXX_datacenter3113</code> and get the information from the datastore named <code>pcc-001234</code> and a template whose name</span> <span title="">is <code>UBUNTU</code>.</span> <span class="" title="">We see here that we use the datacentre id to get information about an object associated with it.</span></span></p>



<h3 class="wp-block-heading">Resources</h3>



<p><span class="tlid-translation translation" lang="en"><span title="">The resources will be used to create and/or manage elements of the infrastructure.</span> <span class="" title="">In our example, we will use a resource of type <code>virtual_machine</code>, which as its name suggests, will help us to create a VM.</span><br></span></p>



<pre class="wp-block-code"><code lang="json" class="language-json">resource "vsphere_virtual_machine" "vm" {
  name             = "vm01"
  resource_pool_id = "${data.vsphere_compute_cluster.cluster.resource_pool_id}"
  datastore_id     = "${data.vsphere_datastore.datastore.id}"
  guest_id         = "${data.vsphere_virtual_machine.template.guest_id}"
  scsi_type        = "${data.vsphere_virtual_machine.template.scsi_type}"

  network_interface {
    network_id = "${data.vsphere_network.network.id}"
  }

  disk {
    label = "disk0"
    size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
  }

  clone {
    template_uuid = "${data.vsphere_virtual_machine.template.id}"

    customize {
      linux_options {
        host_name = "vm01"
        domain     = "example.com"
      }

      network_interface {
        ipv4_address = "192.168.1.2"
        ipv4_netmask = 24
      }

      ipv4_gateway    = "192.168.1.254"
      dns_suffix_list = ["example.com"]
      dns_server_list = ["192.168.1.1"]
    }
  }
}</code></pre>



<p><span class="tlid-translation translation" lang="en"><br><span class="" title="">The structure of this resource is a little more complex, because it is composed of several sub-blocks.</span> <span class="" title="">We see that we will first define the name of the virtual machine.</span> <span class="" title="">We then provide information about its configuration (Resource pool, datastore, etc.).</span> <span class="" title="">The <code>network_interface</code> and <code>disk</code> blocks are used to specify the configuration of its virtual devices.</span> <span class="" title="">The <code>clone</code> sub-block will let you specify which template you wish to use to create the VM, and also to specify the configuration information of the operating system installed on the VM.</span> <span class="" title="">The <code>customize</code> sub-block is specific to the type of OS you want to clone.</span> <span class="" title="">At all levels, we use information previously obtained in the <code>data</code> blocks.</span></span></p>



<h3 class="wp-block-heading">Full example</h3>



<pre class="wp-block-code"><code lang="json" class="language-json">provider "vsphere" {
    user = "admin"
    password = "MyAwesomePassword"
    vsphere_server = "pcc-XXX-XXX-XXX-XXX.ovh.com"
}

data "vsphere_datacenter" "dc" {
  name = "pcc-XXX-XXX-XXX-XXX_datacenter3113"
}

data "vsphere_datastore" "datastore" {
  name          = "pcc-001234"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_compute_cluster" "cluster" {
  name          = "Cluster1"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_network" "network" {
  name          = "vxw-dvs-57-virtualwire-2-sid-5001-Dc3113_5001"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

data "vsphere_virtual_machine" "template" {
  name          = "UBUNTU"
  datacenter_id = "${data.vsphere_datacenter.dc.id}"
}

resource "vsphere_virtual_machine" "vm" {
  name             = "vm01"
  resource_pool_id = "${data.vsphere_compute_cluster.cluster.resource_pool_id}"
  datastore_id     = "${data.vsphere_datastore.datastore.id}"
  guest_id         = "${data.vsphere_virtual_machine.template.guest_id}"
  scsi_type        = "${data.vsphere_virtual_machine.template.scsi_type}"

  network_interface {
    network_id = "${data.vsphere_network.network.id}"
  }

  disk {
    label = "disk0"
    size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
  }

  clone {
    template_uuid = "${data.vsphere_virtual_machine.template.id}"

    customize {
      linux_options {
        host_name = "vm01"
        domain     = "example.com"
      }

      network_interface {
        ipv4_address = "192.168.1.2"
        ipv4_netmask = 24
      }

      ipv4_gateway    = "192.168.1.254"
      dns_suffix_list = ["example.com"]
      dns_server_list = ["192.168.1.1"]
    }
  }
}</code></pre>



<h3 class="wp-block-heading">3&#8230; 2&#8230; 1&#8230; Ignition</h3>



<p>Let&#8217;s look at how to use our new config file with Terraform&#8230;</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/04/IMG_0223-1024x405.jpg" alt="OVH Private Cloud and HashiCorp Terraform" class="wp-image-15429" width="768" height="304" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0223-1024x405.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0223-300x119.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0223-768x303.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0223-1200x474.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0223.jpg 2048w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<h4 class="wp-block-heading">Initialisation</h4>



<p><span class="tlid-translation translation" lang="en"><span class="" title="">Now that our configuration file is ready, we will be able to use it to create our virtual machine.</span> <span class="" title="">Let&#8217;s start by initialising the working environment with the <code>terraform init</code> command.</span> <span class="" title="">This will take care of downloading the vSphere provider and create the different files that Terraform needs to work.</span></span></p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">$ terraform init

Initializing provider plugins...
- Checking for available provider plugins on https://releases.hashicorp.com...
- Downloading plugin for provider "vsphere" (1.10.0)...

The following providers do not have any version constraints in configuration,
so the latest version was installed.

...

* provider.vsphere: version = "~> 1.10"

Terraform has been successfully initialized!
...</code></pre>



<h4 class="wp-block-heading">Plan</h4>



<p><span class="tlid-translation translation" lang="en"><span title="">The next step is to execute the <code>terraform plan</code> command to validate that our configuration file contains no errors and to visualise all the actions that Terraform will perform.</span><br></span></p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.vsphere_datacenter.dc: Refreshing state...
data.vsphere_compute_cluster.cluster: Refreshing state...
data.vsphere_network.network: Refreshing state...
data.vsphere_datastore.datastore: Refreshing state...
data.vsphere_virtual_machine.template: Refreshing state...

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  + vsphere_virtual_machine.vm
      id:                                                   &lt;computed>
      boot_retry_delay:                                     "10000"
      change_version:                                       &lt;computed>
      clone.#:                                              "1"
      clone.0.customize.#:                                  "1"
      clone.0.customize.0.dns_server_list.#:                "1"
      clone.0.customize.0.dns_server_list.0:                "192.168.1.1"
      clone.0.customize.0.dns_suffix_list.#:                "1"
      clone.0.customize.0.dns_suffix_list.0:                "example.com"
      clone.0.customize.0.ipv4_gateway:                     "172.16.0.1"
      clone.0.customize.0.linux_options.#:                  "1"
      clone.0.customize.0.linux_options.0.domain:           "example.com"
      clone.0.customize.0.linux_options.0.host_name:        "vm01"
      clone.0.customize.0.linux_options.0.hw_clock_utc:     "true"
      clone.0.customize.0.network_interface.#:              "1"
      clone.0.customize.0.network_interface.0.ipv4_address: "192.168.1.2"
      clone.0.customize.0.network_interface.0.ipv4_netmask: "16"
      clone.0.customize.0.timeout:                          "10"
      clone.0.template_uuid:                                "42061bc5-fdec-03f3-67fd-b709ec06c7f2"
      clone.0.timeout:                                      "30"
      cpu_limit:                                            "-1"
      cpu_share_count:                                      &lt;computed>
      cpu_share_level:                                      "normal"
      datastore_id:                                         "datastore-93"
      default_ip_address:                                   &lt;computed>
      disk.#:                                               "1"
      disk.0.attach:                                        "false"
      disk.0.datastore_id:                                  "&lt;computed>"
      disk.0.device_address:                                &lt;computed>
      ...

Plan: 1 to add, 0 to change, 0 to destroy.</code></pre>



<p><span class="tlid-translation translation" lang="en"><br><span class="" title="">It is important to take time to check all information returned by the <code>plan</code> command before proceeding.</span> <span class="" title="">It would be a mess to delete virtual machines in production due to an error in the configuration file&#8230; In the example below, we see that Terraform will create a new resource (here a VM) and not modify or delete anything,</span> <span class="" title="">which is exactly the goal!</span></span></p>



<h4 class="wp-block-heading">Apply</h4>



<p><span class="tlid-translation translation" lang="en"><span title="">In the last step, the <code>terraform apply</code> command will actually configure the infrastructure according to the information present in the configuration file.</span> <span title="">As a first step, the <code>plan</code> command will be executed, and Terraform will ask you to validate by typing <code>yes</code>.</span><br></span></p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">$ terraform apply
...

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

vsphere_virtual_machine.vm: Creating...
  boot_retry_delay:                                     "" => "10000"
  change_version:                                       "" => "&lt;computed>"
  clone.#:                                              "" => "1"
  clone.0.customize.#:                                  "" => "1"
  clone.0.customize.0.dns_server_list.#:                "" => "1"
  clone.0.customize.0.dns_server_list.0:                "" => "192.168.1.1"
  clone.0.customize.0.dns_suffix_list.#:                "" => "1"
  clone.0.customize.0.dns_suffix_list.0:                "" => "example.com"
  clone.0.customize.0.ipv4_gateway:                     "" => "192.168.1.254"
  clone.0.customize.0.linux_options.#:                  "" => "1"
  clone.0.customize.0.linux_options.0.domain:           "" => "example.com"
  clone.0.customize.0.linux_options.0.host_name:        "" => "terraform-test"
  clone.0.customize.0.linux_options.0.hw_clock_utc:     "" => "true"
  clone.0.customize.0.network_interface.#:              "" => "1"
  clone.0.customize.0.network_interface.0.ipv4_address: "" => "192.168.1.2"
  clone.0.customize.0.network_interface.0.ipv4_netmask: "" => "16"
  clone.0.customize.0.timeout:                          "" => "10"
  clone.0.template_uuid:                                "" => "42061bc5-fdec-03f3-67fd-b709ec06c7f2"
  clone.0.timeout:                                      "" => "30"
  cpu_limit:                                            "" => "-1"
  cpu_share_count:                                      "" => "&lt;computed>"
  cpu_share_level:                                      "" => "normal"
  datastore_id:                                         "" => "datastore-93"
  default_ip_address:                                   "" => "&lt;computed>"
  disk.#:                                               "" => "1"
...
vsphere_virtual_machine.vm: Still creating... (10s elapsed)
vsphere_virtual_machine.vm: Still creating... (20s elapsed)
vsphere_virtual_machine.vm: Still creating... (30s elapsed)
...
vsphere_virtual_machine.vm: Still creating... (1m50s elapsed)
vsphere_virtual_machine.vm: Creation complete after 1m55s (ID: 42068313-d169-03ff-9c55-a23e66a44b48)

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.</code></pre>



<p><span class="tlid-translation translation" lang="en"><span class="" title="">When you connect to the vCenter of your Private Cloud, you should see a new virtual machine in the inventory!<br></span></span></p>



<h4 class="wp-block-heading">Next steps</h4>



<p>Now that we have seen a standard Terraform workflow, you may want to test some modifications to your configuration file. For example, you can add another virtual disk to your VM by modifying the virtual_machine resource&#8217;s block like this:</p>



<pre class="wp-block-code"><code lang="json" class="language-json">disk {
  label = "disk0"
  size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
}

disk {
  label = "disk1"
  size  = "${data.vsphere_virtual_machine.template.disks.0.size}"
  unit_number = 1
}</code></pre>



<p>Then run <code>terraform plan</code> to see what Terraform is going to do to in order to reconcile the infrastructure state with your configuration file.</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.vsphere_datacenter.dc: Refreshing state...
data.vsphere_datastore.datastore: Refreshing state...
data.vsphere_network.network: Refreshing state...
data.vsphere_compute_cluster.cluster: Refreshing state...
data.vsphere_virtual_machine.template: Refreshing state...
vsphere_virtual_machine.vm: Refreshing state... (ID: 4206be6f-f462-c424-d386-7bd0a0d2cfae)

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ vsphere_virtual_machine.vm
      disk.#:                  "1" => "2"
      disk.1.attach:           "" => "false"
      disk.1.datastore_id:     "" => "&lt;computed>"
      ...


Plan: 0 to add, 1 to change, 0 to destroy.</code></pre>



<p>If you agree with terraform action&#8217;s proposal, you can rerun <code>terraform apply</code>, to add a new virtual disk to your virtual machine.</p>



<h4 class="wp-block-heading">Clean it up</h4>



<p><span class="tlid-translation translation" lang="en"><span title="">When you have finished your tests and you no longer require the utility of the infrastructure,</span>&nbsp;y<span class="" title="">ou can simply run the <code>terraform destroy</code> command to delete all previously-created resources. Be careful with this command, as there is no way to get your data back after that!<br></span></span></p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">$ terraform destroy

data.vsphere_datacenter.dc: Refreshing state...
data.vsphere_compute_cluster.cluster: Refreshing state...
data.vsphere_datastore.datastore: Refreshing state...
data.vsphere_network.network: Refreshing state...
data.vsphere_virtual_machine.template: Refreshing state...
vsphere_virtual_machine.vm: Refreshing state... (ID: 42068313-d169-03ff-9c55-a23e66a44b48)

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  - vsphere_virtual_machine.vm


Plan: 0 to add, 0 to change, 1 to destroy.

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

vsphere_virtual_machine.vm: Destroying... (ID: 42068313-d169-03ff-9c55-a23e66a44b48)
vsphere_virtual_machine.vm: Destruction complete after 3s

Destroy complete! Resources: 1 destroyed.</code></pre>



<p><span class="tlid-translation translation" lang="en"><span class="" title="">In this article, we have seen how to deploy a virtual machine with a Terraform configuration file.</span> <span class="" title="">This allowed us to learn the basic commands <code>plan</code>, <code>apply</code> and <code>destroy</code>, as well as the notions of <code>provider</code>,&nbsp;<code>data</code> and <code>resource</code>.</span> <span class="" title="">In the next article, we will develop this example, by modifying it to make it more adaptable and generic.</span></span></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fprivate_cloud_and_hashicorp_terraform_part1%2F&amp;action_name=OVH%20Private%20Cloud%20and%20HashiCorp%20Terraform%20%26%238211%3B%20Part%201&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Dedicated Servers: twice the bandwidth for the same price</title>
		<link>https://blog.ovhcloud.com/dedicated-servers-twice-the-bandwidth-for-the-same-price/</link>
		
		<dc:creator><![CDATA[Yaniv Fdida]]></dc:creator>
		<pubDate>Wed, 27 Mar 2019 10:02:27 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[Bare Metal servers]]></category>
		<category><![CDATA[Datacenters & network]]></category>
		<category><![CDATA[Evolution]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=15155</guid>

					<description><![CDATA[We announced it at the OVH Summit 2018&#8230; We were going to double the public bandwidth on OVH dedicated servers, without changing the price. A promise is a promise, so several weeks ago we fulfilled it: your servers now have twice the bandwidth, for the same price! We knew from the start that this upgrade [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdedicated-servers-twice-the-bandwidth-for-the-same-price%2F&amp;action_name=Dedicated%20Servers%3A%20twice%20the%20bandwidth%20for%20the%20same%20price&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[<p>We announced it at the OVH Summit 2018&#8230; We were going to <strong>double the public bandwidth</strong> on OVH dedicated servers, without changing the price.</p>
<p>A promise is a promise, so several weeks ago we fulfilled it: <strong>your servers now have twice the bandwidth, for the same price!</strong></p>
<p><img loading="lazy" decoding="async" class="aligncenter wp-image-15270 size-medium" src="/blog/wp-content/uploads/2019/03/IMG_0185-300x218.jpg" alt="2019-03-27 - Dedicated servers : twice the bandwidth for the same price" width="300" height="218" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0185-300x218.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0185-768x557.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0185.jpg 779w" sizes="auto, (max-width: 300px) 100vw, 300px" /></p>
<p>We knew from the start that this upgrade would be feasible, as our 20Tbps network core can definitely cope with the extra load!  We work daily to make sure you enjoy using our network, which is one of the largest in the world among hosting providers.</p>
<p>Indeed, our network is constantly evolving, and our teams work tirelessly to optimise the capacity planning and anticipate the load generated by all our customers, spread across our 28 datacentres.</p>
<p>It&#8217;s also more than capable of managing the <strong>waves of DDoS attacks</strong> that arrive almost daily, sending millions of requests to hosted servers in an attempt to render them unavailable. These are <strong>absorbed</strong> by our in-house <strong>Anti-DDoS Protection</strong>, without any customer impact! As a reminder, we suffered <a href="https://www.ovh.com/world/articles/news/a2367.the-ddos-that-didnt-break-the-camels-vac" data-wpel-link="exclude">one of the biggest attacks on record a few years ago</a>, which generated traffic of more than 1Tbps, but was nonetheless absorbed by our infrastructure, without any impact on our customers.</p>
<p>To guarantee this additional public bandwidth, our Network and Bare Metal teams have worked closely together to be more and more LEAN when it comes to our infrastructures. As a result, thousands of active devices (routers, switches, servers etc.) have been <strong>updated in a completely transparent</strong> <strong>way!</strong></p>
<p>The overall deployment process has taken some time, as we have done a rolling upgrade, taking a QoS and isolation approach to prevent possible traffic spikes. Product range by product range, datacentre by datacentre&#8230; The deployment itself was quick and painless, as it was fully automated. The potential bottleneck was making sure that everything worked as intended, which involved carefully monitoring our full server farm, as bandwidth doubling can have a huge impact, especially at OVH, where (let me mention it once again!) <strong>egress traffic</strong> is indeed <strong>unlimited</strong>!</p>
<p>Here&#8217;s a quick overview of the new bandwidth for each server range:</p>
<p><img loading="lazy" decoding="async" class="alignnone wp-image-15169 size-full" src="/blog/wp-content/uploads/2019/03/Bandwith.png" alt="" width="1180" height="712" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Bandwith.png 1180w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Bandwith-300x181.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Bandwith-768x463.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Bandwith-1024x618.png 1024w" sizes="auto, (max-width: 1180px) 100vw, 1180px" /></p>
<p>Even if the bandwidth doubling doesn&#8217;t yet cover the full extent of our ranges, or the So you Start and Kimsufi servers, we haven&#8217;t forgotten our customers who&#8217;re using those servers. We have also updated our <a href="https://www.ovh.com/fr/serveurs_dedies/upgrade-bande-passante.xml" data-wpel-link="exclude">bandwidth options</a> to offer all our customers an even better service, at an even better price.</p>
<p>We aren&#8217;t going to stop there though! We will soon announce some nice new features on the network side of things.  And of course, lots of other innovations will arrive in the coming months. But those are other stories, which will be told in other blog posts&#8230; &#x1f609;</p>
<p><img loading="lazy" decoding="async" class="sc-htpNat eEWWwP aligncenter" src="https://media1.giphy.com/media/jaXDDTuKmeJvwI56kV/200w.gif?cid=3640f6095c8bc77b366f436e55606833" alt="stay tuned watch GIF by Gadi Schwartz NBC News" width="373" height="211" /><img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fdedicated-servers-twice-the-bandwidth-for-the-same-price%2F&amp;action_name=Dedicated%20Servers%3A%20twice%20the%20bandwidth%20for%20the%20same%20price&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" /></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Continuous Delivery and Deployment Workflows with CDS</title>
		<link>https://blog.ovhcloud.com/continuous-delivery-and-deployment-workflows-with-cds/</link>
		
		<dc:creator><![CDATA[Yvonnick Esnault]]></dc:creator>
		<pubDate>Fri, 01 Mar 2019 12:38:13 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[CDS]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Industrialization]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14718</guid>

					<description><![CDATA[The CDS Workflow is a key feature of OVH CI/CD Platform. This structuring choice to add an additional concept above CI/CD pipelines and jobs is definitely an essential feature after more than three years of intense use.

Before going further on the explanation of a CDS workflow, we will make some reminders about the concepts of pipelines and jobs. Those concepts are based on the reference book 8 Principles of Continuous Delivery.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcontinuous-delivery-and-deployment-workflows-with-cds%2F&amp;action_name=Continuous%20Delivery%20and%20Deployment%20Workflows%20with%20CDS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>The CDS Workflow is a key feature of the OVH CI/CD Platform. This structural choice adds an additional concept above&nbsp;CI/CD&nbsp;pipelines and jobs, and after&nbsp;more than three years of intensive use, is definitely an essential feature.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/02/DE383951-7D79-4320-BB30-5EAE0F8186E5-1024x354.jpeg" alt="Continuous Delivery and Deployment Workflows with CDS" class="wp-image-14861" width="768" height="266" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/DE383951-7D79-4320-BB30-5EAE0F8186E5-1024x354.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/DE383951-7D79-4320-BB30-5EAE0F8186E5-300x104.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/DE383951-7D79-4320-BB30-5EAE0F8186E5-768x266.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/DE383951-7D79-4320-BB30-5EAE0F8186E5-1200x415.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/DE383951-7D79-4320-BB30-5EAE0F8186E5.jpeg 1529w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<p>Before delving into a full explanation of CDS workflows, let&#8217;s review some of the key concepts behind pipelines and jobs.&nbsp;Those concepts are drawn from the reference book,&nbsp;<a href="https://devopsnet.com/2011/08/04/continuous-delivery/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">8 Principles of Continuous Delivery</a></p>



<h3 class="wp-block-heading"><span class="TextRun SCXW16074631" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW16074631">The basic element: “The job”</span></span><span class="EOP SCXW16074631" data-ccp-props="{&quot;335559738&quot;:40}">&nbsp;</span></h3>



<p><span class="TextRun SCXW21425026" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW21425026">A job is composed of steps, which will be run sequentially. A&nbsp;</span></span><span class="TextRun SCXW21425026" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW21425026">j</span></span><span class="TextRun SCXW21425026" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW21425026">ob is executed in a dedicated workspace (i.e. filesystem). A new workspace is assigned for each new run of a job.</span></span></p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/02/1C6F2AC3-2321-4BC2-B449-DE3EA8FC1BCE.png" alt="CDS Job" class="wp-image-14858" width="512" height="444" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/1C6F2AC3-2321-4BC2-B449-DE3EA8FC1BCE.png 1276w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/1C6F2AC3-2321-4BC2-B449-DE3EA8FC1BCE-300x259.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/1C6F2AC3-2321-4BC2-B449-DE3EA8FC1BCE-768x664.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/1C6F2AC3-2321-4BC2-B449-DE3EA8FC1BCE-1024x885.png 1024w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p><span class="TextRun SCXW266219635" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW266219635">A</span></span><span class="TextRun SCXW266219635" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW266219635">&nbsp;standard build job looks like this</span></span><span class="TextRun SCXW266219635" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW266219635">:</span></span><span class="EOP SCXW266219635" data-ccp-props="{}">&nbsp;</span></p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/02/60B4A0FC-A0CA-44E2-9E06-79E94258DC6D.png" alt="CDS build job" class="wp-image-14859" width="447" height="373" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/60B4A0FC-A0CA-44E2-9E06-79E94258DC6D.png 1156w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/60B4A0FC-A0CA-44E2-9E06-79E94258DC6D-300x250.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/60B4A0FC-A0CA-44E2-9E06-79E94258DC6D-768x640.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/60B4A0FC-A0CA-44E2-9E06-79E94258DC6D-1024x853.png 1024w" sizes="auto, (max-width: 447px) 100vw, 447px" /></figure></div>



<p>You can use &#8220;built-in&#8221; actions, such as&nbsp;checkoutApplication, script,&nbsp;jUnit, artifact upload/download.</p>



<ul class="wp-block-list"><li>The&nbsp;<b><i>c</i></b><b><i>heckoutApplication</i></b>&nbsp;action clones&nbsp;your Git repository</li><li data-listid="1" data-aria-posinset="2" data-aria-level="1">The <b><i>Script</i></b>&nbsp;action executes your build command as “make build”</li><li data-listid="1" data-aria-posinset="1" data-aria-level="1">The&nbsp;<b><i>artifactUpload</i></b>&nbsp;action uploads&nbsp;previously-built binaries</li><li data-listid="1" data-aria-posinset="2" data-aria-level="1">The <b><i>jUnit</i></b>&nbsp;action parses a given Junit-formatted XML file to extract its test results</li></ul>



<h3 class="wp-block-heading"><span class="TextRun SCXW173436166" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW173436166">A pipeline: How to orchestrate your jobs with stages</span></span></h3>



<p><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">W</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">ith CDS, a pipeline is not a job&nbsp;</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">flow</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">. A&nbsp;</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">pipeline</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">&nbsp;is a sequence of stages, each of which contains&nbsp;</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">one</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">&nbsp;or more jobs</span></span><span class="TextRun SCXW221416392" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW221416392">.</span></span></p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/02/69F07485-71CE-49F9-9DCF-CD866B709D64-1024x847.png" alt="CDS Pipeline" class="wp-image-14857" width="512" height="424" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/69F07485-71CE-49F9-9DCF-CD866B709D64-1024x847.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/69F07485-71CE-49F9-9DCF-CD866B709D64-300x248.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/69F07485-71CE-49F9-9DCF-CD866B709D64-768x635.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/69F07485-71CE-49F9-9DCF-CD866B709D64-1200x992.png 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/69F07485-71CE-49F9-9DCF-CD866B709D64.png 1364w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p>A &nbsp;<b>Stage&nbsp;</b> is a&nbsp;<strong>set of jobs that will be run in parallel</strong>. Stages are executed sequentially, if the previous stage is successful.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></p>



<p>Let&#8217;s take a real-life use case: the pipeline that built CDS. This pipeline has four stages:<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="520" src="/blog/wp-content/uploads/2019/02/cds_blog_art2_pipeline_cds.png" alt="" class="wp-image-14721" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_pipeline_cds.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_pipeline_cds-300x176.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_pipeline_cds-768x451.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<ul class="wp-block-list"><li data-listid="2" data-aria-posinset="0" data-aria-level="1">The “Build Minimal” stage is launched for all Git&nbsp;branches. The main goal of this stage is to compile the&nbsp;Linux version of CDS binaries.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="2" data-aria-posinset="0" data-aria-level="1">The “Build other&nbsp;os/arch” stage is&nbsp;<i>only&nbsp;</i>launched on the master branch. This stage compiles all&nbsp;&nbsp;binaries supported by the os/arch:&nbsp;linux,&nbsp;openbsd,&nbsp;freebsd,&nbsp;darwin, windows&nbsp;– 386, amd64 and arm.&nbsp;<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="2" data-aria-posinset="0" data-aria-level="1">The “Package” stage is launched for all Git&nbsp;branches. This stage prepares the docker image and Debian package.</li><li data-listid="2" data-aria-posinset="0" data-aria-level="1">Finally, the “Publish” stage is launched, whatever the&nbsp;Git&nbsp;branch.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li></ul>



<p>Most tasks are executed in parallel, whenever possible. This results in very fast feedback, so we will quickly know if the compilation&nbsp;is&nbsp;OK or not.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></p>



<h3 class="wp-block-heading"><span class="TextRun SCXW40418950" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW40418950">CDS Workflows: How to orchestrate your pipelines</span></span></h3>



<p>The workflow concept is a key feature, and widely considered a native, manageable and feature-rich entity in CDS.&nbsp;A CDS Workflow&nbsp;allows you to chain pipelines with manual or automatic gates, using conditional branching.&nbsp;A workflow can be stored as code, designed on CDS UI, or both, depending on what best suits you.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></p>



<p>Let’s take an example. One workflow for building and deploying three micro-services:&nbsp;<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></p>



<ul class="wp-block-list"><li data-listid="2" data-aria-posinset="0" data-aria-level="1">Build each micro-service<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="2" data-aria-posinset="0" data-aria-level="1">Deploy them in preproduction<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="2" data-aria-posinset="0" data-aria-level="1">Run integration tests on preproduction environment<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="2" data-aria-posinset="0" data-aria-level="1">Deploy them in production, then re-run integration tests in production</li></ul>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="198" src="/blog/wp-content/uploads/2019/02/cds_blog_art2_workflow.png" alt="" class="wp-image-14728" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow-300x67.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow-768x172.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p>For the building part, there is only one pipeline to manage, which is used three times in the workflow with a different application/environment context each time.&nbsp;This is called the “pipeline context”.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></p>



<p>Any conditional branching against the workflow (e.g.“automatic deployment on the staging environment, only if the current Git branch is master”) can be executed through a “run conditional” set on the pipeline.<span data-ccp-props="{}">&nbsp;</span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="528" src="/blog/wp-content/uploads/2019/02/cds_blog_art2_run_conditions.png" alt="" class="wp-image-14723" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_run_conditions.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_run_conditions-300x179.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_run_conditions-768x458.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p><span class="TextRun SCXW249992857" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW249992857">Let&#8217;s look at a &nbsp;a real use case</span></span><span class="TextRun SCXW249992857" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW249992857">.&nbsp;</span></span><span class="TextRun SCXW249992857" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW249992857">This is the workflow that builds, tests and deploys CDS in production at OVH</span></span><span class="TextRun SCXW249992857" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW249992857">&nbsp;</span></span><span class="TextRun SCXW249992857" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW249992857">(</span></span><em><span class="TextRun SCXW249992857" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW249992857">yes, CDS builds and deploys itself!</span></span></em><span class="TextRun SCXW249992857" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW249992857">):</span></span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="446" src="/blog/wp-content/uploads/2019/02/cds_blog_art2_workflow_cds.png" alt="" class="wp-image-14724" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_cds.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_cds-300x151.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_cds-768x387.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<ol class="wp-block-list"><li data-listid="3" data-aria-posinset="1" data-aria-level="1">For each Git commit, the workflow is triggered<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="3" data-aria-posinset="1" data-aria-level="1">The UI is&nbsp;packaged, all binaries are prepared, and the docker images are built. The “UT” job launches the unit tests. The&nbsp;job “IT” job installs CDS in an ephemeral environment and launches the&nbsp;integration tests&nbsp;on it. Part 2 is&nbsp;automatically triggered&nbsp;on all Git commits.&nbsp;<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="3" data-aria-posinset="1" data-aria-level="1">Part 3 deploys&nbsp;CDS on our preproduction environment, then launches the&nbsp;integration&nbsp;tests on it.&nbsp;It is started automatically when the current branch is the master branch.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li><li data-listid="3" data-aria-posinset="1" data-aria-level="1">Last but not least, part 4 deploys&nbsp;CDS on our production environment.<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></li></ol>



<p><span class="TextRun SCXW65390992" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW65390992">I</span></span><span class="TextRun SCXW65390992" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW65390992">f there is a failure on a pipeline, it may look like this:</span></span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="298" src="/blog/wp-content/uploads/2019/02/cds_blog_art2_workflow_failed.png" alt="" class="wp-image-14725" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_failed.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_failed-300x101.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_failed-768x259.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">The same kind of workflow is used for&nbsp;</span></span><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">building</span></span><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">&nbsp;and deplo</span></span><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">ying</span></span><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">&nbsp;the Prescience Proje</span></span><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">c</span></span><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">t (</span></span><a class="Hyperlink SCXW149395370" href="https://labs.ovh.com/machine-learning-platform" target="_blank" rel="noopener noreferrer" data-wpel-link="exclude"><span class="TextRun Underlined SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">https://labs.ovh.com/machine-learning-platform</span></span></a><span class="TextRun SCXW149395370" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW149395370">):</span></span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="616" src="/blog/wp-content/uploads/2019/02/cds_blog_art2_workflow_prescience.png" alt="" class="wp-image-14726" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_prescience.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_prescience-300x209.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_prescience-768x535.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p><span class="TextRun SCXW181015173" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW181015173">But of course, you&#8217;re not limited to the most complex tasks with CDS Workflows! These two examples demonstrate the fact that workflows allow to build and deploy a coherent set of micro-services</span></span><span class="TextRun SCXW181015173" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW181015173">. I</span></span><span class="TextRun SCXW181015173" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW181015173">f you have simpler needs, your workflows are, of course, simpler.</span></span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="300" height="204" src="/blog/wp-content/uploads/2019/02/cds_blog_art2_workflow_simple-300x204.png" alt="" class="wp-image-14727" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_simple-300x204.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_simple-768x522.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds_blog_art2_workflow_simple.png 856w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>



<p>Pipeline reusability allows you to easily maintain the technical parts of the build, tests and deployments, even if you have hundreds of&nbsp;applications.<span data-ccp-props="{}">&nbsp;</span>If hundreds of applications share the same kind of workflows, you can leverage the maintainability of the workflow templates. We will talk more about this in a future post.</p>



<h3 class="wp-block-heading"><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">M</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">uch more than&nbsp;</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">“P</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">ipeline as&nbsp;</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">C</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">ode</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">”&#8230;</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">&nbsp;</span></span><span class="TextRun SCXW112993533" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW112993533">“Workflow as Code”</span></span></h3>



<p>There is no compromise with CDS. Some users prefer to draw the workflows by the&nbsp;web UI, others prefer to write&nbsp;yaml&nbsp;code. CDS lets you do both!<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">&nbsp;</span></p>



<p>There are two ways to store workflows: either in the CDS&nbsp;database&nbsp;or on your Git repository&nbsp;with your source code. We call this “Workflow as Code”<span data-ccp-props="{&quot;134233117&quot;:true,&quot;134233118&quot;:true}">.</span></p>



<p>This makes it possible to have a workflow on a given branch, and then develop it on another branch. CDS will instantiate the workflow&nbsp;on the fly, based on the&nbsp;yaml&nbsp;code present on the current branch.</p>



<p><span class="TextRun SCXW158005539" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW158005539">CDS is OVH open-source software, and can be found on&nbsp;</span></span><a class="Hyperlink SCXW158005539" href="https://github.com/ovh/cds" target="_blank" rel="noopener noreferrer nofollow external" data-wpel-link="external"><span class="TextRun Underlined SCXW158005539" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW158005539">https://github.com/ovh/cds</span></span></a>,<span class="TextRun SCXW158005539" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW158005539">&nbsp;with documentation on&nbsp;</span></span><a class="Hyperlink SCXW158005539" href="https://ovh.github.io/cds" target="_blank" rel="noopener noreferrer nofollow external" data-wpel-link="external"><span class="TextRun Underlined SCXW158005539" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW158005539">https://ovh.github.io/cds</span></span></a><span class="TextRun SCXW158005539" lang="EN-US" xml:lang="EN-US"><span class="NormalTextRun SCXW158005539">.&nbsp;</span></span></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="300" height="150" src="/blog/wp-content/uploads/2019/02/cds-header-300x150.jpg" alt="CDS" class="wp-image-14628" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds-header-300x150.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds-header-768x384.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/cds-header.jpg 800w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>



<p>Previous Posts:<span data-ccp-props="{}">&nbsp;</span></p>



<ul class="wp-block-list"><li data-listid="2" data-aria-posinset="0" data-aria-level="1">CDS Introduction:&nbsp;<a href="https://www.ovh.com/fr/blog/how-does-ovh-manage-the-ci-cd-at-scale/" data-wpel-link="exclude">https://www.ovh.com/fr/blog/how-does-ovh-manage-the-ci-cd-at-scale/</a>&nbsp;<span data-ccp-props="{&quot;134233279&quot;:true}">&nbsp;</span></li><li data-listid="2" data-aria-posinset="0" data-aria-level="1">DataBuzzWord&nbsp;Podcast (French):&nbsp;<a href="https://www.ovh.com/fr/blog/understanding-ci-cd-for-big-data-and-machine-learning/" data-wpel-link="exclude">https://www.ovh.com/fr/blog/understanding-ci-cd-for-big-data-and-machine-learning/</a>&nbsp;&nbsp;<span data-ccp-props="{&quot;134233279&quot;:true}">&nbsp;</span></li></ul>



<p><span data-ccp-props="{}">&nbsp;</span></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcontinuous-delivery-and-deployment-workflows-with-cds%2F&amp;action_name=Continuous%20Delivery%20and%20Deployment%20Workflows%20with%20CDS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How does OVH manage the CI/CD at scale?</title>
		<link>https://blog.ovhcloud.com/how-does-ovh-manage-the-ci-cd-at-scale/</link>
		
		<dc:creator><![CDATA[Yvonnick Esnault]]></dc:creator>
		<pubDate>Thu, 14 Feb 2019 15:22:39 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[CDS]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Industrialization]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14461</guid>

					<description><![CDATA[From git commit to production, the delivery process is the set of steps that take place to deliver your service to your customers. Continuous Integration and Continuous Delivery – CI/CD - are practices based on the Agile Values which aim to automate this process as much as possible.



The Continuous Delivery Team @OVH has a mission: to help the OVH developers to industrialize and automate their delivery process. The CD team is here to advocate CI/CD best practices and maintain the ecosystem tools, with a maximum focus on as-a-service solutions.



The central point of this ecosystem is a tool built in-house at OVH, named CDS.
CDS is an OVH opensource software, you will find it on https://github.com/ovh/cds with documentation on https://ovh.github.io/cds.<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-does-ovh-manage-the-ci-cd-at-scale%2F&amp;action_name=How%20does%20OVH%20manage%20the%20CI%2FCD%20at%20scale%3F&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[<p><i>The delivery process is the set of steps &#8211; from git commit to production &#8211; that takes place to deliver your service to your customers. Drawing on agile values, Continuous Integration and Continuous Delivery (CI/CD ) are practices that aim to automate this process as much as possible.</i></p>
<p><img loading="lazy" decoding="async" class="aligncenter wp-image-14529" src="/blog/wp-content/uploads/2019/02/FE68B6A7-7885-4C60-8FF4-B929005EEF96-300x56.png" alt="From git to production" width="512" height="97" /></p>
<p><i>The Continuous Delivery Team at OVH has one fundamental mission: to help the OVH developers industrialise and automate their delivery processes. The CD team is here to advocate CI/CD best practices and maintain our ecosystem tools, with the maximum focus on as-a-service solutions.</i></p>
<p><img loading="lazy" decoding="async" class="aligncenter size-medium wp-image-14512" src="/blog/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85-297x300.png" alt="CDS" width="297" height="300" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85-297x300.png 297w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85-768x775.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85.png 793w" sizes="auto, (max-width: 297px) 100vw, 297px" /></p>
<p><i>The centre of this ecosystem is a tool called CDS, developed in-house at OVH.</i><br />
<i>CDS is an open-source software solution that can be found at </i><a style="font-style: italic;" href="https://github.com/ovh/cds" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/ovh/cds,</a><i> with documentation at </i><a style="font-style: italic;" href="https://ovh.github.io/cds" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://ovh.github.io/cds</a><i>.</i></p>
<p>CDS is the third generation of CI/CD tools at OVH, following two previous solutions, that were based on Bash, Jenkins, Gitlab and Bamboo. It is the end-result of 12 years&#8217; experience in the field of CI/CD. Familiar with most of the standard tools of the industry, we found that none completely matched our expectations regarding the four key aspects we identified. That is what CDS tries to solve.</p>
<p>These four aspects are:</p>
<h3><strong>Elastic</strong></h3>
<p>CDS resources/workers are <strong>launched on demand</strong>, to guarantee low waiting times for users, with no over-consumption of idle resources.</p>
<h3><strong>Extensible</strong></h3>
<p>In CDS, any kind of action (Kubernetes and OpenStack deployments, pushing to Kafka, testing for CVEs…) can be captured in <strong>high-level plugins</strong>, to be used as <strong>building blocks</strong> by users. These plugins are straightforward to write and use, so it&#8217;s easy to meet the most exotic needs in an effective and stress-free way.</p>
<h3><strong>Flexible, but easy</strong></h3>
<p>CDS can run <strong>complex workflows</strong>, with all sorts of intermediary steps, including build, test, deploy 1/10/100, manual or automatic gates, rollback, conditional branches… These workflows can be <strong>stored as code</strong> in the git repository. CDS provides basic <strong>workflow templates</strong> for the Core team&#8217;s most common scenarios, in order to ease the adoption process. This way, building a functional CI/CD chain from nothing can be quick and easy.</p>
<h3><strong>Self-service</strong></h3>
<p>Finally, a key aspect is the idea of<strong> self-service</strong>. Once a CDS project is created by users, they are completely autonomous within that space, with the freedom to manage pipelines, delegate access rights etc. All users are free to customise their space as they see fit, and build on what is provided out-of-the-box. Personalising workflow templates, plugins, running build and tests on custom VM flavors or custom hardware… all this can be done without any intervention from the CDS administrators.</p>
<h3><strong>CI/CD in 2018 &#8211; Five million workers!</strong></h3>
<ul>
<li>About 5.7M workers started and deleted on demand.</li>
<li>3.7M containers</li>
<li>2M Virtual Machines</li>
</ul>
<h3>How is it possible?</h3>
<p>One of the initial CDS objectives at OVH was to build and deploy 150 applications as a container in less than seven minutes. This has been a reality since 2015. So what&#8217;s the secret? Auto-Scale on Demand!</p>
<p>With this approach, you can have hundreds of worker models that CDS will launch via hatcheries whenever necessary.</p>
<p><figure id="attachment_14542" aria-describedby="caption-attachment-14542" style="width: 512px" class="wp-caption aligncenter"><img loading="lazy" decoding="async" class="wp-image-14542" src="/blog/wp-content/uploads/2019/02/DA5984F5-6B7D-48B4-840E-6D7F3F590A35-300x76.png" alt="CDS Hatchery" width="512" height="130" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/DA5984F5-6B7D-48B4-840E-6D7F3F590A35-300x76.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/DA5984F5-6B7D-48B4-840E-6D7F3F590A35-768x194.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/DA5984F5-6B7D-48B4-840E-6D7F3F590A35.png 885w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption id="caption-attachment-14542" class="wp-caption-text">CDS Hatchery</figcaption></figure></p>
<p>&nbsp;</p>
<p>A hatchery is like an incubator: it gives birth to the CDS workers and maintains the power of life and death over them.</p>
<p><figure id="attachment_14546" aria-describedby="caption-attachment-14546" style="width: 512px" class="wp-caption aligncenter"><img loading="lazy" decoding="async" class="wp-image-14546" src="/blog/wp-content/uploads/2019/02/IMG_0052-300x206.png" alt="CDS Hatcheries - Worker @Scale" width="512" height="352" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/IMG_0052-300x206.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/IMG_0052-768x528.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/IMG_0052.png 885w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption id="caption-attachment-14546" class="wp-caption-text">CDS Hatcheries &#8211; Worker @Scale</figcaption></figure></p>
<p>&nbsp;</p>
<p>Each hatchery is dedicated to an orchestrator. Furthermore, one CDS Instance can create workers over many cloud platforms:<br />
&#8211; The <strong>Kubernetes</strong> hatchery starts workers in pods<br />
&#8211; The <strong>OpenStack</strong> hatchery starts virtual machines<br />
&#8211; The <strong>Swarm</strong> hatchery starts docker containers<br />
&#8211; The <strong>Marathon</strong> hatchery starts docker containers<br />
&#8211; The <strong>VSphere</strong> hatchery start virtual machines<br />
&#8211; The <strong>local</strong> hatchery starts process on a host</p>
<p><figure id="attachment_14548" aria-describedby="caption-attachment-14548" style="width: 512px" class="wp-caption aligncenter"><img loading="lazy" decoding="async" class="wp-image-14548" src="/blog/wp-content/uploads/2019/02/IMG_0053-300x87.png" alt="CDS Hatcheries" width="512" height="148" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/IMG_0053-300x87.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/IMG_0053-768x222.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/IMG_0053.png 885w" sizes="auto, (max-width: 512px) 100vw, 512px" /><figcaption id="caption-attachment-14548" class="wp-caption-text">CDS Hatcheries</figcaption></figure></p>
<h3>What&#8217;s next?</h3>
<p>This is all just a <strong>preview of CDS</strong>&#8230; we have lots more to tell you about! The CI/CD tool offers a wide range of features that we will explore in depth in our <strong>upcoming articles</strong>. We promise, before 2019 is done, you will not look at your CI/CD tool the same way again&#8230;<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-does-ovh-manage-the-ci-cd-at-scale%2F&amp;action_name=How%20does%20OVH%20manage%20the%20CI%2FCD%20at%20scale%3F&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" /></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Understanding CI/CD for Big Data and Machine Learning</title>
		<link>https://blog.ovhcloud.com/understanding-ci-cd-for-big-data-and-machine-learning/</link>
		
		<dc:creator><![CDATA[Yvonnick Esnault]]></dc:creator>
		<pubDate>Thu, 14 Feb 2019 12:28:36 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Automation]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[CDS]]></category>
		<category><![CDATA[DataBuzzWord]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Docker]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[OpenStack]]></category>
		<category><![CDATA[Podcast]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14588</guid>

					<description><![CDATA[This week, the OVH Integration and Continuous Deployment team was invited to the&#160;DataBuzzWord&#160;podcast. Together, we explored the topic of continuous deployment in the context of machine learning and big data.&#160;We also discussed continuous deployment for environments like&#160;Kubernetes, Docker, OpenStack and&#160;VMware VSphere. If you missed it, or would like to review everything that was discussed, you [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Funderstanding-ci-cd-for-big-data-and-machine-learning%2F&amp;action_name=Understanding%20CI%2FCD%20for%20Big%20Data%20and%20Machine%20Learning&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>This week, the OVH Integration and Continuous Deployment team was invited to the&nbsp;<a href="https://www.spreaker.com/show/2072727" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">DataBuzzWord</a>&nbsp;podcast.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="297" height="300" src="/blog/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85-297x300.png" alt="CDS" class="wp-image-14512" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85-297x300.png 297w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85-768x775.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/CE25CF9F-9489-4B9D-B123-FE4FD613EF85.png 793w" sizes="auto, (max-width: 297px) 100vw, 297px" /></figure></div>



<p>Together, we explored the topic of continuous deployment in the context of machine learning and big data.&nbsp;We also discussed continuous deployment for environments like&nbsp;<a href="https://www.ovh.com/fr/blog/?s=kubernetes" data-wpel-link="exclude">Kubernetes</a>, Docker, OpenStack and&nbsp;<a href="https://www.ovh.com/fr/blog/?s=vmware" data-wpel-link="exclude">VMware VSphere</a>.</p>



<p>If you missed it, or would like to review everything that was discussed, you can&nbsp;<a href="https://www.spreaker.com/show/2072727" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">listen to it again here</a>. We hope to return soon, to continue sharing our passion for testing, integration and continuous deployment.</p>



<p>Although the podcast was recorded in French, starting from tomorrow, we&#8217;ll be delving further into the key points of our discussion in a series of articles on this blog.</p>


<div class="lazyblock-youtube-gdpr-compliant-ZRXyrv wp-block-lazyblock-youtube-gdpr-compliant"><script type="module">
  import 'https://blog.ovhcloud.com/wp-content/assets/ovhcloud-gdrp-compliant-embedding-widgets/src/ovhcloud-gdrp-compliant-spreaker.js';
</script>
      
      <ovhcloud-gdrp-compliant-spreaker
          spreaker="17021384"
          debug></ovhcloud-gdrp-compliant-spreaker> 

</div>


<p>Find CDS on GitHub:</p>



<ul class="wp-block-list"><li><a href="https://github.com/ovh/cds" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/ovh/cds</a></li></ul>



<p>&#8230;. and follow us on Twitter:</p>



<ul class="wp-block-list"><li><a href="https://twitter.com/yesnault" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">https://twitter.com/yesnault</a></li><li><a href="https://twitter.com/francoissamin" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">https://twitter.com/francoissamin</a></li></ul>



<p>Come chat about these subjects with us on our Gitter channel:&nbsp;<a href="https://gitter.im/ovh-cds/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://gitter.im/ovh-cds/</a></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Funderstanding-ci-cd-for-big-data-and-machine-learning%2F&amp;action_name=Understanding%20CI%2FCD%20for%20Big%20Data%20and%20Machine%20Learning&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
