<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>OVHcloud Observability Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/ovhcloud-observability/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/ovhcloud-observability/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Wed, 08 Jul 2020 14:00:16 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>OVHcloud Observability Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/ovhcloud-observability/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Loops: Powering Continuous Queries with Observability FaaS</title>
		<link>https://blog.ovhcloud.com/v8-hot-code-injection-for-continuous-queries/</link>
		
		<dc:creator><![CDATA[Rémi Collignon-Ducret]]></dc:creator>
		<pubDate>Thu, 04 Apr 2019 08:29:51 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Loops]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[OVHcloud Observability]]></category>
		<category><![CDATA[V8]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14893</guid>

					<description><![CDATA[We&#8217;re all familiar with that small snippet of code that adds reasonable value to your business unit. It can materialise as a script, a program, a line of code&#8230; and it will produce a report, new metrics,&#160; KPIs, or create new composite data. This code is intended to run periodically, to meet requirements for up-to-date [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fv8-hot-code-injection-for-continuous-queries%2F&amp;action_name=Loops%3A%20Powering%20Continuous%20Queries%20with%20Observability%20FaaS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>We&#8217;re all familiar with that small snippet of code that adds reasonable value to your business unit. It can materialise as a script, a program, a line of code&#8230; and it will produce a report, new metrics,&nbsp; KPIs, or create new composite data. This code is intended to run periodically, to meet requirements for up-to-date information.</p>



<p>In the Observability team, we encounter these snippets as queries within the <strong>Time Series Database</strong> (TSDB), to express continuous queries that are responsible for automating different use cases like: deletes, rollups or any business logic that needs to manipulate Time Series data.</p>



<p>We already introduced&nbsp;<a href="/tsl-a-developer-friendly-time-series-query-language-for-all-our-metrics/" target="_blank" rel="noopener noreferrer" data-wpel-link="internal">TSL in a previous blog post</a>, which demonstrated how our customers use the available OVH Metrics protocols, like Graphite, OpenTSDB, PromQL and WarpScript™, but when it comes to <strong>manipulating, or even creating new data</strong>,&nbsp; you don&#8217;t have a lot of options, although you can use WarpScript™ or TSL as scripting language instead of a query one.</p>



<p>In most cases, this business logic requires building an application, which is more <strong>time-consuming</strong> than expressing the logic as a query targeting a TSDB. Building the base application code is the first step, followed by the CI/CD (or any delivery process), and setting up its monitoring. However, managing hundreds of little apps like these will add an organic cost, due to the need to&nbsp;<strong>maintain them</strong> along with the underlying infrastructure.</p>



<p>We wanted to ensure these valuable tasks did not stack up on the heads of few developers, who would then need to carry the responsibilities of <strong>data ownership</strong> and <strong>computing resources</strong>, so we wondered how we could automate things, without relying on the team to setup the compute jobs each time someone needed something.</p>



<p>We wanted a solution that would focus on the <strong>business logic</strong>, without needing to run an entire app. This way, someone wanting to generate a JSON file with a daily data report (for example) would only need to express the corresponding query.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img fetchpriority="high" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0194-1-1024x466.jpg" alt="Running business logic over Loops" class="wp-image-15311" width="512" height="233" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0194-1-1024x466.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0194-1-300x136.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0194-1-768x349.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0194-1-1200x546.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0194-1.jpg 1552w" sizes="(max-width: 512px) 100vw, 512px" /></figure></div>



<h3 class="wp-block-heading">You shall not FaaS!</h3>



<p>Scheduling jobs is an old, familiar routine. Be it bash cron jobs, runners, or specialised schedulers, when it comes to wrapping a snippet of code and making it run periodically, there is a name for it: FaaS.</p>



<p>FaaS was born with a simple goal in mind: <strong>reduce development time</strong>. We could have found an open source implementation to evaluate (e.g. <a href="https://www.openfaas.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OpenFaas</a>), but most of these relied upon a managed container stack. Having one container per query would be very costly, plus warming up a container to execute the function and then freezing it would have been very counterproductive.</p>



<p>This would have required more scheduling and automation than we wanted for our end-goal, would have&nbsp;lead to suboptimal performance, and would have introduced a new requirement for cluster capacity management. There is also a build time required to deploy a new function in a container, which is consequently not free.</p>



<h3 class="wp-block-heading">#def &lt;Loops&gt;</h3>



<p class="graf graf--p">That was when we decided to build &#8220;Loops&#8221;: an application platform where you can push the code you want to run. That&#8217;s all. The goal is to <strong>push a function</strong> (literally!) rather than a module, like all current FaaS solutions do:</p>



<pre class="wp-block-code"><code lang="javascript" class="language-javascript">function dailyReport(event) {
    return Promise.resolve('Today, everything is fine !')
}</code></pre>



<p class="graf graf--p">You can then execute it manually, with either an HTTP call or a Cron-like scheduler.<br>These both aspects are necessary, since you might (for example) have a monthly report, but one day will require an additional one, 15 days after the last report. Loops will make it easy to manually generate your new report, in addition to the monthly one.</p>



<p class="graf graf--p">There were some necessary constraints when we began building Loops:</p>



<ul class="postList wp-block-list"><li class="graf graf--li">This platform must be able to easily <strong>scale</strong>, to support OVH&#8217;s production load</li><li>It must be <strong>highly available</strong></li><li class="graf graf--li">It must be <strong>language-agnostic</strong>, because some of us prefer Python, and others JavaScript</li><li class="graf graf--li">It must be <strong>reliable</strong></li><li class="graf graf--li">The scheduling part mustn’t be correlated with the execution one (<strong>μService</strong> culture)</li><li class="graf graf--li">It must be <strong>secure</strong> and isolated, so anybody can push obscure code on the platform</li></ul>



<h3 class="wp-block-heading">Loops implementation</h3>



<p class="graf graf--p">We choose to build our first version on <strong>V8</strong>. We chose JavaScript as the first language, because it&#8217;s easy to learn, and asynchronous data flows are easily managed using Promises. Also, it fits very well with a FaaS, since Javascript functions are highly expressive. We built it around the new <strong>NodeJS </strong><a class="markup--anchor markup--p-anchor" href="https://nodejs.org/dist/latest-v11.x/docs/api/vm.html#vm_vm_executing_javascript" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://nodejs.org/dist/latest-v11.x/docs/api/vm.html#vm_vm_executing_javascript" data-wpel-link="external"><strong>VM module</strong></a>, which allows you to execute code in <strong>a dedicated <a class="markup--anchor markup--p-anchor" href="https://fr.wikipedia.org/wiki/V8_%28moteur_JavaScript%29" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://fr.wikipedia.org/wiki/V8_(moteur_JavaScript)" data-wpel-link="external">V8</a> context</strong>.</p>



<p class="graf graf--p">A&nbsp; V8 context is like an object (JSON), isolated from your execution. In context, you can find native functions and objects. However, if you craft a new V8 context, you will see that some variables or functions are not natively available (<strong class="markup--strong markup--p-strong">setTimeout(),</strong> <strong class="markup--strong markup--p-strong">setInterval() </strong>or <strong class="markup--strong markup--p-strong">Buffer</strong>, for example). If you want to use these, you will have to inject them into your new context. The last important thing to remember is that when you have your new context, you can easily execute a JavaScript script under string form on it.</p>



<p class="graf graf--p">Contexts fulfil the most important part of our original list of requirements: <strong>isolation</strong>.&nbsp;Each V8 context is isolated, so it cannot talk to another context. This means a global variable defined in one context is not available in a different one. You will have to build a bridge between them if you want this to be the case.</p>



<p class="graf graf--p">We didn&#8217;t want to execute scripts with <a class="markup--anchor markup--p-anchor" href="https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/eval" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/eval" data-wpel-link="external"><strong class="markup--strong markup--p-strong">eval()</strong></a>, since a call to this function allows you to execute JS code on the main shared context, with the code calling it. You can then access to the same objects, constants, variables, etc. This security issue was a deal breaker for the new platform.</p>



<p class="graf graf--p">Now we know how to execute our scripts, let&#8217;s implement some management for them. To be <strong>stateless</strong>, each Loops worker instance (i.e. a JavaScript engine able to run code in a VM context) must have the last version of each Loop (a loop is a script to execute). This means that when a user pushes a new Loop, we have to sync it on each Loops worker. This model fits well with the pub/sub paradigm, and since we already use <strong>Kafka</strong> as a pub/sub infrastructure, it was just a matter of creating a dedicated topic and consuming it from the workers. In this case, publication involves an API where a user submits their Loops, which produce a Kafka event containing the function body. As each worker has its own Kafka consumer group, they all receive the same messages.</p>



<p class="graf graf--p">Workers subscribe to Loops updates as Kafka consumers and maintain a Loop store, which is an embedded key (the Loop hash)/Value (the function&#8217;s current revision). In the API part, Loop hashes are used as URL parameters to identify which Loop to execute. Once called, a Loop is retrieved from the map, then <strong>injected in a V8 context</strong>, executed, and dropped. This <strong>hot code reload</strong> mechanism ensures that each Loop can be executed on every worker. We can also leverage our load balancers&#8217; capabilities to distribute the load on the workers. This simple distribution model <strong>avoids</strong> <strong>complex scheduling</strong> and <strong>eases the maintainability</strong> of the overall infrastructure.</p>



<p class="graf graf--p">In order to be <strong>reboot-proof</strong>, we make use of Kafka&#8217;s very handy&nbsp;<em class="markup--em markup--p-em">log compaction</em> feature. Log compaction allows Kafka to keep the last version of each keyed message. When a user creates a new Loop, it will be given a unique ID, which is used as a Kafka message key. When a user updates a Loop, this new message will be forwarded to all consumers, but since the key already exists, only the last revision will be kept by Kafka. When a worker restarts, it will consume all messages to rebuild its internal KV, so the previous state will be restored. Kafka is used here as a <strong>persistent store</strong>.</p>



<div class="wp-block-image wp-image-15313"><figure class="aligncenter is-resized"><img decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0196-1024x286.jpg" alt="Creating, editing and deleting loops" class="wp-image-15313" width="512" height="143" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0196-1024x286.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0196-300x84.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0196-768x214.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0196-1200x335.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0196.jpg 1577w" sizes="(max-width: 512px) 100vw, 512px" /></figure></div>



<h3 class="wp-block-heading">Loops runtimes</h3>



<p>Even if the underlying engine is able to run native Javascript, as stated above, we wanted it to run more idiomatic Time Series queries like <strong>TSL</strong> or <strong>WarpScript™</strong>. To achieve this, we created a <strong>Loops Runtime</strong> abstraction that wraps not only Javascript, but also TSL and WarpScript™ queries into Javascript code. Users have to declare a Loop with it&#8217;s runtime, after which it&#8217;s just a matter of wrappers working. For example, executing a WarpScript™ Loop involves taking the plain WarpScript™ and sending it through a node-request HTTP call.</p>



<div class="wp-block-image wp-image-15318"><figure class="aligncenter is-resized"><img decoding="async" src="/blog/wp-content/uploads/2019/04/IMG_0198.jpg" alt="Running a Loop" class="wp-image-15318" width="579" height="396" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0198.jpg 1158w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0198-300x205.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0198-768x525.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/04/IMG_0198-1024x700.jpg 1024w" sizes="(max-width: 579px) 100vw, 579px" /><figcaption>Running a Loop</figcaption></figure></div>



<h3 class="wp-block-heading">Loops feedback</h3>



<p>Executing code safely is a start, but when it comes to executing arbitrary code, it&#8217;s also useful to get some <strong>feedback on the execution state</strong>. Was it successful or not? Is there an error in the function? If a Loop is in a failure state, the user should be notified straight away.</p>



<p>This leads us to one special condition: a user&#8217;s scripts must be able to tell if everything is OK or not.&nbsp; There are two ways to do that in the <strong>underlying JavaScript engine</strong>: callbacks and Promises.<br>We choose to go with Promises which offers a better asynchronous management. <strong>Every Loop returns a Promise</strong> at the end of the script. A rejected promise will produce an HTTP 500 error status, while a resolved one will produce an HTTP 200 status.</p>



<h3 class="wp-block-heading">Loops scheduling</h3>



<p>When publishing Loops, you can declare several <strong>triggers</strong>, in a similar way to Cron.&nbsp;Each trigger will perform an <strong>HTTP call to your Loop</strong>, with optional parameters.</p>



<p>Based on this semantic, to generate multiple reports, we can register a single function that would be scheduled with different contexts, defined by various parameters (region, rate, etc.). See the example below:</p>



<pre class="wp-block-code"><code lang="yaml" class="language-yaml">functions:
  warp_apps_by_cells:
    handler: apps-by-cells.mc2
    runtime: ws
    timeout: 30
    environment:
    events:
      - agra:
          rate: R/2018-01-01T00:00:00Z/PT5M/ET1M
          params:
            cell: ovh-a-gra
      - abhs:
          rate: R/2018-01-01T00:00:00Z/PT5M/ET1M
          params:
            cell: ovh-a-bhs</code></pre>



<p>The scheduling is based on <a href="https://github.com/ovh/metronome" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Metronome</a>, which is an open-source event scheduler, with a specific focus on scheduling rather than execution. It&#8217;s a perfect fit for Loops, since Loops handle the execution, while relying on Metronome to drive execution calls.</p>



<h3 class="wp-block-heading">Loops pipelines</h3>



<p>A Loops project can have several Loops. One of our customers&#8217; common use cases was having was to use <strong>Loops as a data platform</strong>, in a data flow fashion. Data flow is a way to describe a pipeline of execution steps. In a Loops context, there is a global `Loop` object, which allows the script to execute another Loop with this name. You can then <strong>chain Loop executions</strong> that will act as step functions.</p>



<h3 class="wp-block-heading">Pain points: scaling a NodeJS application</h3>



<p>Loops workers are NodeJS applications. Most of NodeJS developers know that NodeJS uses an <strong>mono-threaded event loop</strong>. If you don&#8217;t take care of the threading model of your nodeJS&nbsp; application, you would likely suffer for a lack of performance, since only one host thread will be used.</p>



<p>NodeJS also has a<strong> <em>cluster</em> module</strong> available, which allows an app to use multiple threads. That&#8217;s why in a Loops worker, we start with an N-1 thread for handling API calls, where N is the total number of threads available, which leaves one dedicated to the master thread.</p>



<p>The <strong>master thread</strong> is in charge of <strong>consuming Kafka</strong> topics and<strong> maintaining the Loops store</strong>, while the worker thread starts an API server. For every requested Loop execution, it asks the master for the script content, and executes it in a dedicated thread.</p>



<p>With this setup, one NodeJS application with one Kafka consumer is started per server, which make it very easy to scale out the infrastructure, just by adding additional servers or cloud Instances.</p>



<h3 class="wp-block-heading">Conclusion</h3>



<p>In this post, we previewed Loops, a scalable,&nbsp;<strong>metrics-oriented FaaS</strong> with native <strong>JavaScript support</strong>, and extended <strong>WarpScript™ and TSL support</strong>.</p>



<p>We still have a few things to enhance, like ES5-style dependency imports and metrics previews for our customers&#8217; Loops projects. We also plan to add more <strong>runtimes</strong>, especially <strong>WASM</strong>, which would allow many other languages that can target it, like <strong>Go, Rust or Python</strong>, to suit most developer preferences.</p>



<p>The Loops platform was part of a requirement to build higher-level features around OVH Observability products. It&#8217;s a first step towards offering more automated services, like <strong>metrics rollups</strong>, <strong>aggregation pipelines</strong>,&nbsp;or <strong>logs-to-metrics extractors</strong>.</p>



<p>This tool was built part of the Observability products suite with a higher abstraction level in mind, but you might also want direct access to the API, in order to implement your own automated logic for your metrics. Would you be interested in such a feature? Visit <a href="https://gitter.im/ovh/metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">our Gitter channel</a> to discuss it with us!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fv8-hot-code-injection-for-continuous-queries%2F&amp;action_name=Loops%3A%20Powering%20Continuous%20Queries%20with%20Observability%20FaaS&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to monitor your Kubernetes Cluster with OVH Observability</title>
		<link>https://blog.ovhcloud.com/how-to-monitor-your-kubernetes-cluster-with-ovh-observability/</link>
		
		<dc:creator><![CDATA[Adrien Carreira]]></dc:creator>
		<pubDate>Fri, 08 Mar 2019 13:33:55 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Beamium]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Noderig]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[OVHcloud Managed Kubernetes]]></category>
		<category><![CDATA[OVHcloud Observability]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14897</guid>

					<description><![CDATA[Our colleagues in the K8S team launched the OVH Managed Kubernetes solution&#160;last week,&#160;in which they manage the Kubernetes master components and spawn your nodes on top of our Public Cloud solution. I will not describe the details of how it works here, but there are already many blog posts about it (here&#160;and&#160;here, to get you [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-monitor-your-kubernetes-cluster-with-ovh-observability%2F&amp;action_name=How%20to%20monitor%20your%20Kubernetes%20Cluster%20with%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="graf graf--p">Our colleagues in the K8S team launched the OVH Managed Kubernetes solution&nbsp;<a class="markup--anchor markup--p-anchor" href="https://www.ovh.com/fr/kubernetes/" target="_blank" rel="noopener noreferrer" data-href="https://www.ovh.com/fr/kubernetes/" data-wpel-link="exclude">last week,</a>&nbsp;in which they manage the Kubernetes master components and spawn your nodes on top of our Public Cloud solution. I will not describe the details of how it works here, but there are already many blog posts about it (<a class="markup--anchor markup--p-anchor" href="https://www.ovh.com/fr/blog/kubinception-and-etcd/" target="_blank" rel="noopener noreferrer" data-href="https://www.ovh.com/fr/blog/kubinception-and-etcd/" data-wpel-link="exclude">here</a>&nbsp;and&nbsp;<a class="markup--anchor markup--p-anchor" href="https://www.ovh.com/fr/blog/kubinception-using-kubernetes-to-run-kubernetes/" target="_blank" rel="noopener noreferrer" data-href="https://www.ovh.com/fr/blog/kubinception-using-kubernetes-to-run-kubernetes/" data-wpel-link="exclude">here,</a> to get you started).</p>



<p>In the <a href="https://labs.ovh.com/machine-learning-platform" data-wpel-link="exclude">Prescience team</a>, we have used Kubernetes for more than a year now. Our cluster includes 40 nodes, running on top of PCI. We continuously run about 800 pods, and generate a lot of metrics as a result.</p>



<p>Today, we&#8217;ll look at how we handle these metrics to monitor our Kubernetes Cluster, and (equally importantly!) how to do this with your own cluster.</p>



<h3 class="graf graf--h3 wp-block-heading">OVH Metrics</h3>



<p class="graf graf--p">Like any infrastructure, you need to monitor your Kubernetes Cluster. You need to know exactly how your nodes, cluster and applications behave once they have been deployed in order to provide reliable services to your customers. To do this with our own Cluster, we use <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">OVH Observability</a>.</p>



<p class="graf graf--p">OVH Observability is backend-agnostic, so we can push metrics with one format and query with another one. It can handle:</p>



<ul class="postList wp-block-list"><li class="graf graf--li">Graphite</li><li class="graf graf--li">InfluxDB</li><li class="graf graf--li">Metrics2.0</li><li class="graf graf--li">OpentTSDB</li><li class="graf graf--li">Prometheus</li><li class="graf graf--li">Warp10</li></ul>



<p class="graf graf--p">It also incorporates a managed <a class="markup--anchor markup--p-anchor" href="https://grafana.metrics.ovh.net" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.metrics.ovh.net" data-wpel-link="external">Grafana</a>, in order to display metrics and create monitoring dashboards.</p>



<h3 class="graf graf--h3 wp-block-heading">Monitoring Nodes</h3>



<p class="graf graf--p">The first thing to monitor is the health of nodes. Everything else starts from there.</p>



<p class="graf graf--p">In order to monitor your nodes, we will use <a class="markup--anchor markup--p-anchor" href="https://github.com/ovh/noderig" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://github.com/ovh/noderig" data-wpel-link="external">Noderig</a> and <a class="markup--anchor markup--p-anchor" href="https://github.com/ovh/beamium" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://github.com/ovh/beamium" data-wpel-link="external">Beamium</a>, as described <a href="/monitoring-guidelines-for-ovh-observability/" data-wpel-link="internal">here</a>. We will also use Kubernetes DaemonSets to start the process on all our nodes.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0135-1024x770.jpg" alt="" class="wp-image-15024" width="768" height="578" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-1024x770.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-300x226.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-768x578.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-1200x903.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135.jpg 2039w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<p class="graf graf--p">So let’s start creating a namespace&#8230;</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">kubectl create namespace metrics</code></pre>



<p class="graf graf--p">Next, create a secret with the write token metrics, which you can find in the OVH Control Panel.</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">kubectl create secret generic w10-credentials --from-literal=METRICS_TOKEN=your-token -n metrics</code></pre>



<p class="graf graf--p">Copy <code class="markup--code markup--p-code">metrics.yml</code> into a file and apply the configuration with kubectl</p>



<pre title="metrics.yml" class="wp-block-code"><code lang="yaml" class="language-yaml"># This will configure Beamium to scrap noderig
# And push metrics to warp 10
# We also add the HOSTNAME to the labels of the metrics pushed
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: beamium-config
  namespace: metrics
data:
  config.yaml: |
    scrapers:
      nodering:
        url: http://0.0.0.0:9100/metrics
        period: 30000
        format: sensision
        labels:
          app: nodering

    sinks:
      warp:
        url: https://warp10.gra1.metrics.ovh.net/api/v0/update
        token: $METRICS_TOKEN

    labels:
      host: $HOSTNAME

    parameters:
      log-file: /dev/stdout
---
# This is a custom collector that report the uptime of the node
apiVersion: v1
kind: ConfigMap
metadata:
  name: noderig-collector
  namespace: metrics
data:
  uptime.sh: |
    #!/bin/sh
    echo 'os.uptime' `date +%s%N | cut -b1-10` `awk '{print $1}' /proc/uptime`
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: metrics-daemon
  namespace: metrics
spec:
  selector:
    matchLabels:
      name: metrics-daemon
  template:
    metadata:
      labels:
        name: metrics-daemon
    spec:
      terminationGracePeriodSeconds: 10
      hostNetwork: true
      volumes:
      - name: config
        configMap:
          name: beamium-config
      - name: noderig-collector
        configMap:
          name: noderig-collector
          defaultMode: 0777
      - name: beamium-persistence
        emptyDir:{}
      containers:
      - image: ovhcom/beamium:latest
        imagePullPolicy: Always
        name: beamium
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: TEMPLATE_CONFIG
          value: /config/config.yaml
        envFrom:
        - secretRef:
            name: w10-credentials
            optional: false
        resources:
          limits:
            cpu: "0.05"
            memory: 128Mi
          requests:
            cpu: "0.01"
            memory: 128Mi
        workingDir: /beamium
        volumeMounts:
        - mountPath: /config
          name: config
        - mountPath: /beamium
          name: beamium-persistence
      - image: ovhcom/noderig:latest
        imagePullPolicy: Always
        name: noderig
        args: ["-c", "/collectors", "--net", "3"]
        volumeMounts:
        - mountPath: /collectors/60/uptime.sh
          name: noderig-collector
          subPath: uptime.sh
        resources:
          limits:
            cpu: "0.05"
            memory: 128Mi
          requests:
            cpu: "0.01"
            memory: 128Mi</code></pre>



<p class="graf graf--p"><em class="markup--em markup--p-em">Don’t hesitate to change the collector levels if you need more information.</em></p>



<p>Then apply the configuration with kubectl&#8230;</p>



<pre class="wp-block-code console"><code class="">$ kubectl apply -f metrics.yml
# Then, just wait a minutes for the pods to start
$ kubectl get all -n metrics
NAME                       READY   STATUS    RESTARTS   AGE
pod/metrics-daemon-2j6jh   2/2     Running   0          5m15s
pod/metrics-daemon-t6frh   2/2     Running   0          5m14s

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE AGE
daemonset.apps/metrics-daemon 40        40        40      40           40          122d</code></pre>



<p class="graf graf--p">You can import our dashboard in to your Grafana from <a class="markup--anchor markup--p-anchor" href="https://grafana.com/dashboards/9876" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.com/dashboards/9876" data-wpel-link="external">here</a>, and view some metrics about your nodes straight away.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1842" height="631" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08.png" alt="" class="wp-image-14899" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08.png 1842w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-300x103.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-768x263.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-1024x351.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-1200x411.png 1200w" sizes="auto, (max-width: 1842px) 100vw, 1842px" /></figure></div>



<h3 class="graf graf--h3 wp-block-heading">Kube Metrics</h3>



<p>As the OVH Kube is a managed service, you don&#8217;t need to monitor the apiserver, etcd, or controlplane. The OVH Kubernetes team takes care of this. So we will focus on <a href="https://github.com/google/cadvisor/blob/master/info/v1/container.go" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">cAdvisor</a> metrics and <a href="https://github.com/kubernetes/kube-state-metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kube state metrics</a></p>



<p>The most mature solution for dynamically scraping metrics inside the Kube (for now) is <a href="https://github.com/prometheus/prometheus" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Prometheus</a>.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0144-1024x770.jpg" alt="Kube metrics" class="wp-image-15033" width="512" height="385" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-1024x770.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-300x226.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-768x578.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-1200x903.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144.jpg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p class="graf graf--p"><em class="markup--em markup--p-em">In the next Beamium release, we should be able to reproduce the features of the Prometheus scraper.</em></p>



<p class="graf graf--p">To install the Prometheus server, you need to install Helm on the cluster&#8230;</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">kubectl -n kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller \
    --clusterrole cluster-admin \
    --serviceaccount=kube-system:tiller
helm init --service-account tiller</code></pre>



<p class="graf graf--p">You then need to create the following two files:&nbsp;<code class="markup--code markup--p-code">prometheus.yml</code> and <code class="markup--code markup--p-code">values.yml</code>.</p>



<pre title="prometheus.yml" class="wp-block-code"><code lang="yaml" class="language-yaml"># Based on https://github.com/prometheus/prometheus/blob/release-2.2/documentation/examples/prometheus-kubernetes.yml
serverFiles:
  prometheus.yml:
    remote_write:
    - url: "https://prometheus.gra1.metrics.ovh.net/remote_write"
      remote_timeout: 120s
      bearer_token: $TOKEN
      write_relabel_configs:
      # Filter metrics to keep
      - action: keep
        source_labels: [__name__]
        regex: "eagle.*|\
            kube_node_info.*|\
            kube_node_spec_taint.*|\
            container_start_time_seconds|\
            container_last_seen|\
            container_cpu_usage_seconds_total|\
            container_fs_io_time_seconds_total|\
            container_fs_write_seconds_total|\
            container_fs_usage_bytes|\
            container_fs_limit_bytes|\
            container_memory_working_set_bytes|\
            container_memory_rss|\
            container_memory_usage_bytes|\
            container_network_receive_bytes_total|\
            container_network_transmit_bytes_total|\
            machine_memory_bytes|\
            machine_cpu_cores"

    scrape_configs:
    # Scrape config for Kubelet cAdvisor.
    - job_name: 'kubernetes-cadvisor'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      
      relabel_configs:
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
        
      metric_relabel_configs:
      # Only keep systemd important services like docker|containerd|kubelet and kubepods,
      # We also want machine_cpu_cores that don't have id, so we need to add the name of the metric in order to be matched
      # The string will concat id with name and the separator is a ;
      # `/;container_cpu_usage_seconds_total` OK
      # `/system.slice;container_cpu_usage_seconds_total` OK
      # `/system.slice/minion.service;container_cpu_usage_seconds_total` NOK, Useless
      # `/kubepods/besteffort/e2514ad43202;container_cpu_usage_seconds_total` Best Effort POD OK
      # `/kubepods/burstable/e2514ad43202;container_cpu_usage_seconds_total` Burstable POD OK
      # `/kubepods/e2514ad43202;container_cpu_usage_seconds_total` Guaranteed POD OK
      # `/docker/pod104329ff;container_cpu_usage_seconds_total` OK, Container that run on docker but not managed by kube
      # `;machine_cpu_cores` OK, there is no id on these metrics, but we want to keep them also
      - source_labels: [id,__name__]
        regex: "^((/(system.slice(/(docker|containerd|kubelet).service)?|(kubepods|docker).*)?);.*|;(machine_cpu_cores|machine_memory_bytes))$"
        action: keep
      # Remove Useless parents keys like `/kubepods/burstable` or `/docker`
      - source_labels: [id]
        regex: "(/kubepods/burstable|/kubepods/besteffort|/kubepods|/docker)"
        action: drop
        # cAdvisor give metrics per container and sometimes it sum up per pod
        # As we already have the child, we will sum up ourselves, so we drop metrics for the POD and keep containers metrics
        # Metrics for the POD don't have container_name, so we drop if we have just the pod_name
      - source_labels: [container_name,pod_name]
        regex: ";(.+)"
        action: drop
    
    # Scrape config for service endpoints.
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

    # Example scrape config for pods
    #
    # The relabeling allows the actual pod scrape endpoint to be configured via the
    # following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
    # pod's declared ports (default is a port-free target if none are declared).
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod

      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: host
      - action: labeldrop
        regex: (pod_template_generation|job|release|controller_revision_hash|workload_user_cattle_io_workloadselector|pod_template_hash)
</code></pre>



<pre title="values.yml" class="wp-block-code"><code lang="yaml" class="language-yaml">alertmanager:
  enabled: false
pushgateway:
  enabled: false
nodeExporter:
  enabled: false
server:
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: traefik
      ingress.kubernetes.io/auth-type: basic
      ingress.kubernetes.io/auth-secret: basic-auth
    hosts:
    - prometheus.domain.com
  image:
    tag: v2.7.1
  persistentVolume:
    enabled: false
</code></pre>



<p class="graf graf--p">Don’t forget to replace your token!</p>



<p>The Prometheus scraper is quite powerful. You can relabel your time series, keep a few that match your regex, etc. This config removes a lot of useless metrics, so don’t hesitate to tweak it if you want to see more cAdvisor metrics (for example).</p>



<p class="graf graf--p">&nbsp;Install it with Helm&#8230;</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">helm install stable/prometheus \
    --namespace=metrics \
    --name=metrics \
    --values=values/values.yaml \
    --values=values/prometheus.yaml</code></pre>



<p class="graf graf--p">Add add a basic-auth secret&#8230;</p>



<pre class="wp-block-code console"><code class="">$ htpasswd -c auth foo
New password: &lt;bar>
New password:
Re-type new password:
Adding password for user foo
$ kubectl create secret generic basic-auth --from-file=auth -n metrics
secret "basic-auth" created</code></pre>



<p class="graf graf--p">You can can access the Prometheus server interface through <code class="markup--code markup--p-code">prometheus.domain.com.</code></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1876" height="809" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21.png" alt="" class="wp-image-14933" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21.png 1876w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-300x129.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-768x331.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-1024x442.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-1200x517.png 1200w" sizes="auto, (max-width: 1876px) 100vw, 1876px" /></figure></div>



<p class="graf graf--p">You will see all the metrics for your Cluster, although only the one you have filtered will be pushed to OVH Metrics.</p>



<p>The Prometheus interfaces is a good way to explore your metrics, as it&#8217;s quite straightforward to display and monitor your infrastructure. You can find our dashboard <a class="markup--anchor markup--p-anchor" href="https://grafana.com/dashboards/9880" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.com/dashboards/9880" data-wpel-link="external">here.</a></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1843" height="653" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20.png" alt="" class="wp-image-14900" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20.png 1843w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-300x106.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-768x272.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-1024x363.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-1200x425.png 1200w" sizes="auto, (max-width: 1843px) 100vw, 1843px" /></figure></div>



<h3 class="graf graf--h3 wp-block-heading">Resources Metrics</h3>



<p class="graf graf--p">As @<a class="markup--user markup--p-user" href="https://medium.com/u/7dfbd8de8b55" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://medium.com/u/7dfbd8de8b55" data-anchor-type="2" data-user-id="7dfbd8de8b55" data-action-value="7dfbd8de8b55" data-action="show-user-card" data-action-type="hover" data-wpel-link="external">Martin Schneppenheim</a> said in this <a class="markup--anchor markup--p-anchor" href="https://medium.com/@martin.schneppenheim/utilizing-and-monitoring-kubernetes-cluster-resources-more-effectively-using-this-tool-df4c68ec2053" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://medium.com/@martin.schneppenheim/utilizing-and-monitoring-kubernetes-cluster-resources-more-effectively-using-this-tool-df4c68ec2053" data-wpel-link="external">post</a>, in order to correctly manage a Kubernetes Cluster, you also need to monitor pod resources.</p>



<p>We will install <a class="markup--anchor markup--p-anchor" href="https://github.com/google-cloud-tools/kube-eagle" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://github.com/google-cloud-tools/kube-eagle" data-wpel-link="external">Kube Eagle</a>, which will fetch and expose some metrics about CPU and RAM requests and limits, so they can be fetched by the Prometheus server you just installed.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0145-1024x443.jpg" alt="Kube Eagle" class="wp-image-15035" width="512" height="222" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-1024x443.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-300x130.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-768x333.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-1200x520.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145.jpg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p>Create a file named <code class="markup--code markup--p-code">eagle.yml</code>.</p>



<pre title="eagle.yml" class="wp-block-code"><code lang="yaml" class="language-yaml">apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  labels:
    app: kube-eagle
  name: kube-eagle
  namespace: kube-eagle
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - pods
  verbs:
  - get
  - list
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  labels:
    app: kube-eagle
  name: kube-eagle
  namespace: kube-eagle
subjects:
- kind: ServiceAccount
  name: kube-eagle
  namespace: kube-eagle
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-eagle
---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: kube-eagle
  labels:
    app: kube-eagle
  name: kube-eagle
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: kube-eagle
  name: kube-eagle
  labels:
    app: kube-eagle
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-eagle
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
      labels:
        app: kube-eagle
    spec:
      serviceAccountName: kube-eagle
      containers:
      - name: kube-eagle
        image: "quay.io/google-cloud-tools/kube-eagle:1.0.0"
        imagePullPolicy: IfNotPresent
        env:
        - name: PORT
          value: "8080"
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /health
            port: http
        readinessProbe:
          httpGet:
            path: /health
            port: http
</code></pre>



<pre class="wp-block-code console"><code class="">$ kubectl create namespace kube-eagle
$ kubectl apply -f eagle.yml</code></pre>



<p class="graf graf--p">Next, add import this <a class="markup--anchor markup--p-anchor" href="https://grafana.com/dashboards/9875/revisions" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.com/dashboards/9875/revisions" data-wpel-link="external">Grafana dashboard</a> (it’s the same dashboard as Kube Eagle, but ported to Warp10).</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1838" height="784" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50.png" alt="" class="wp-image-14901" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50.png 1838w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-300x128.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-768x328.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-1024x437.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-1200x512.png 1200w" sizes="auto, (max-width: 1838px) 100vw, 1838px" /></figure></div>



<p class="graf graf--p">You now have an easy way of monitoring your pod resources in the Cluster!</p>



<h3 class="graf graf--h3 wp-block-heading">Custom Metrics</h3>



<p>How does Prometheus know that it needs to scrape kube-eagle? If you looks at the metadata of the <code class="markup--code markup--p-code">eagle.yml</code>, you&#8217;ll see that:</p>



<pre class="wp-block-code"><code lang="yaml" class="language-yaml">annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080" # The port where to find the metrics
  prometheus.io/path: "/metrics" # The path where to find the metrics</code></pre>



<p>Theses annotations will trigger the Prometheus auto-discovery process (described in <code class="markup--code markup--p-code">prometheus.yml</code> line 114).</p>



<p>This means you can easily add these annotations to pods or services that contain a Prometheus exporter, and then forward these metrics to OVH Observability. <a href="https://prometheus.io/docs/instrumenting/exporters/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">You can find a non-exhaustive list of Prometheus exporters here</a>.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0141-1024x443.jpg" alt="" class="wp-image-15027" width="512" height="222" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-1024x443.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-300x130.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-768x333.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-1200x520.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141.jpg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h3 class="graf graf--h3 wp-block-heading">Volumetrics Analysis</h3>



<p>As you saw on the&nbsp;<code class="markup--code markup--p-code">prometheus.yml</code> , we&#8217;ve tried to filter a lot of useless metrics. For example, with cAdvisor on a fresh cluster, with only three real production pods, and with the whole kube-system and Prometheus namespace, have about 2,600 metrics per node. With a smart cleaning approach, you can reduce this to 126 series.</p>



<p>Here&#8217;s a table to show the approximate number of metrics you will generate, based on the number of nodes&nbsp;<strong>(N)</strong> and the number of production pods <strong>(P) </strong>you have:</p>



<figure class="wp-block-table"><table><tbody><tr><td>&nbsp;</td><td><strong>Noderig</strong></td><td><strong>cAdvisor</strong></td><td><strong>Kube State</strong></td><td><strong>Eagle</strong></td><td><strong>Total</strong></td></tr><tr><td>nodes</td><td>N * 13<sup id="cite_ref-ned_1-3" class="reference">(1)</sup></td><td>N * 2<sup id="cite_ref-ned_1-3" class="reference">(2)</sup></td><td>N * 1<sup id="cite_ref-ned_1-3" class="reference">(3)</sup></td><td>N * 8<sup id="cite_ref-ned_1-3" class="reference">(4)</sup></td><td><strong>N * 24</strong></td></tr><tr><td>system.slice</td><td>0</td><td>N * 5<sup id="cite_ref-ned_1-3" class="reference">(5)</sup> * 16<sup id="cite_ref-ned_1-3" class="reference">(6)</sup></td><td>0</td><td>0</td><td><strong>N * 80</strong></td></tr><tr><td>kube-system + kube-proxy + metrics</td><td>0</td><td>N * 5<sup id="cite_ref-ned_1-3" class="reference">(9)</sup> * 26<sup id="cite_ref-ned_1-3" class="reference">(6)</sup></td><td>0</td><td>N * 5<sup id="cite_ref-ned_1-3" class="reference">(9)</sup> * 6<sup id="cite_ref-ned_1-3" class="reference">(10)</sup></td><td><strong>N * 160</strong></td></tr><tr><td>Production Pods</td><td>0</td><td>P * 26<sup id="cite_ref-ned_1-3" class="reference">(6)</sup></td><td>0</td><td>P * 6<sup id="cite_ref-ned_1-3" class="reference">(10)</sup></td><td><strong>P * 32</strong></td></tr></tbody></table></figure>



<p>For example, if you run three nodes with 60 Pods, you will generate 264 * 3 + 32 * 60 ~= 2,700 metrics</p>



<p><em>NB: A pod has a unique name, so if you redeploy a deployment, you will create 32 new metrics each time.</em></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(1) Noderig metrics: <code class="markup--code markup--p-code">os.mem / os.cpu / os.disk.fs / os.load1 / os.net.dropped (in/out) / os.net.errs (in/out) / os.net.packets (in/out) / os.net.bytes (in/out)/ os.uptime</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(2) cAdvisor nodes metrics: <code class="markup--code markup--p-code">machine_memory_bytes / machine_cpu_cores</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(3) Kube state nodes metrics: <code class="markup--code markup--p-code">kube_node_info</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(4) Kube Eagle nodes metrics: <code class="markup--code markup--p-code">eagle_node_resource_allocatable_cpu_cores / eagle_node_resource_allocatable_memory_bytes / eagle_node_resource_limits_cpu_cores / eagle_node_resource_limits_memory_bytes / eagle_node_resource_requests_cpu_cores / eagle_node_resource_requests_memory_bytes / eagle_node_resource_usage_cpu_cores / eagle_node_resource_usage_memory_bytes</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(5) With our filters, we will monitor around five system.slices&nbsp;</sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(6)&nbsp; Metrics are reported per container. A pod is a set of containers (a minimum of two): your container + the pause container for the network. So we can consider (2* 10&nbsp;+ 6) for the number of metrics per pod. 10 metrics from the cAdvisor and six for the network (see below) and for system.slice we will have 10 + 6, because it&#8217;s treated as one container.</sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(7) cAdvisor will provide these metrics for each container</sup><sup id="cite_ref-ned_1-3" class="reference">: </sup><sup id="cite_ref-ned_1-3" class="reference"><code class="markup--code markup--p-code">container_start_time_seconds / container_last_seen / container_cpu_usage_seconds_total / container_fs_io_time_seconds_total / container_fs_write_seconds_total / container_fs_usage_bytes / container_fs_limit_bytes / container_memory_working_set_bytes / container_memory_rss / container_memory_usage_bytes </code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(8) cAdvisor will provide these metrics for each interface: <code class="markup--code markup--p-code">container_network_receive_bytes_total * per interface / container_network_transmit_bytes_total * per interface</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(9) <code class="markup--code markup--p-code">kube-dns / beamium-noderig-metrics / kube-proxy / canal / metrics-server&nbsp;</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(10) Kube Eagle pods metrics: <code class="markup--code markup--p-code"> eagle_pod_container_resource_limits_cpu_cores /  eagle_pod_container_resource_limits_memory_bytes / eagle_pod_container_resource_requests_cpu_cores / eagle_pod_container_resource_requests_memory_bytes / eagle_pod_container_resource_usage_cpu_cores / eagle_pod_container_resource_usage_memory_bytes</code></sup></p>



<h3 class="graf graf--h3 wp-block-heading">Conclusion</h3>



<p class="graf graf--p">As you can see, monitoring your Kubernetes Cluster with OVH Observability is easy. You don&#8217;t need to worry about how and where to store your metrics, leaving you free to focus on leveraging your Kubernetes Cluster to handle your business workloads effectively, like we have in the Machine Learning Services Team.</p>



<p class="graf graf--p">The next step will be to add an alerting system, to notify you when your nodes are down (for example). For this, you can use the free&nbsp;<a class="markup--anchor markup--p-anchor" href="https://studio.metrics.ovh.net/" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://studio.metrics.ovh.net/" data-wpel-link="external">OVH Alert Monitoring</a>&nbsp;tool.</p>



<h4 class="graf graf--h4 graf-after--p wp-block-heading" id="a936">Stay in&nbsp;touch</h4>



<p class="graf graf--p graf-after--h4 graf--trailing">For any questions, feel free to&nbsp;<a href="https://gitter.im/ovh/metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">join the Observability Gitter</a>&nbsp;or <a href="https://gitter.im/ovh/kubernetes" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kubernetes Gitter!</a><br>Follow us on Twitter: <a href="https://twitter.com/OVH" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">@OVH</a></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-monitor-your-kubernetes-cluster-with-ovh-observability%2F&amp;action_name=How%20to%20monitor%20your%20Kubernetes%20Cluster%20with%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Monitoring guidelines for OVH Observability</title>
		<link>https://blog.ovhcloud.com/monitoring-guidelines-for-ovh-observability/</link>
		
		<dc:creator><![CDATA[Kevin Georges]]></dc:creator>
		<pubDate>Thu, 07 Mar 2019 11:19:08 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Beamium]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Noderig]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[OVHcloud Observability]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14929</guid>

					<description><![CDATA[At the OVH Observability (formerly Metrics) team, we collect, process and analyse most of OVH&#8217;s monitoring data. It represents about 500M unique metrics, pushing data points at a steady rate of 5M per second. This data can be classified in two ways: host or application monitoring. Host monitoring is mostly based on hardware counters (CPU, [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmonitoring-guidelines-for-ovh-observability%2F&amp;action_name=Monitoring%20guidelines%20for%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="graf graf--p graf-after--h3">At the OVH Observability (formerly Metrics) team, we collect, process and analyse most of OVH&#8217;s monitoring data. It represents about 500M unique metrics, pushing data points at a steady rate of 5M per second.</p>



<p class="graf graf--p graf-after--p">This data can be classified in two ways: host or application monitoring. Host monitoring is mostly based on hardware counters (CPU, memory, network, disk…) while application monitoring is based on the service and its scalability (requests, processing, business logic…).</p>



<p>We provide this service for internal teams, who enjoy the same experience as our customers. Basically, our Observability service is SaaS with a compatibility layer (supporting InfluxDB, OpenTSDB, Warp10, Prometheus, and Graphite) that allows it to integrate with most of the existing solutions out there. This way, a team that is used to a particular tool, or have already deployed a monitoring solution, won&#8217;t need to invest much time or effort when migrating to a fully managed and scalable service: they just pick a token, use the right endpoint, and they&#8217;re done. Besides, our compatibility layer offers a choice: you can push your data with OpenTSDB, then query it in either PromQL or WarpScript. Combining protocols in this way results in a unique open-source interoperability that delivers more value, with no restrictions created by a solution&#8217;s query capabilities.</p>



<h3 id="816f" class="graf graf--h3 graf-after--p wp-block-heading">Scollector, Snap, Telegraf, Graphite, Collectd…</h3>



<p class="graf graf--p graf-after--h3">Drawing on this experience, we collectively tried most of the collection tools, but we always arrived at the same conclusion: we were witnessing&nbsp;<strong class="markup--strong markup--p-strong">metrics bleeding.</strong> Each tool focused on scraping every reachable bit of data, which is great if you are a graph addict, but can be counterproductive from an operational point-of-view, &nbsp;if you have to monitor thousands of hosts. While it&#8217;s possible to filter them, teams still need to understand the whole metrics set in order to know what needs to be filtered.</p>



<p class="graf graf--p graf-after--p">At OVH, we use laser-cut collections of metrics. Each host has a specific template (web server, database, automation…) that exports a set amount of metrics, which can be used for health diagnostics and monitoring application performance.</p>



<p>This finely-grained management leads to greater understanding for operational teams, since they know what&#8217;s available and can progressively add metrics to manage their own services.</p>



<h3 id="9619" class="graf graf--h3 graf-after--p wp-block-heading">Beamium &amp; Noderig — The Perfect&nbsp;Fit</h3>



<p class="graf graf--p graf-after--h3">Our requirements were rather simple:<br>—&nbsp;<strong class="markup--strong markup--p-strong">Scalable</strong>: Monitor one node in the same way as we&#8217;d monitor thousands<br>—&nbsp;<strong class="markup--strong markup--p-strong">Laser-cut</strong>: Only collect the metrics that are relevant<br>—&nbsp;<strong class="markup--strong markup--p-strong">Reliable</strong>: We want metrics to be available even in the worst conditions<br>—&nbsp;<strong class="markup--strong markup--p-strong">Simple</strong>: Multiple plug-and-play components, instead of intricate ones<br>—&nbsp;<strong class="markup--strong markup--p-strong">Efficient</strong>: We believe in impact-free metrics collection</p>



<h4 id="babf" class="graf graf--h4 graf-after--p wp-block-heading">The first solution was&nbsp;Beamium</h4>



<p class="graf graf--p graf-after--h4"><a class="markup--anchor markup--p-anchor" href="https://github.com/ovh/beamium" target="_blank" rel="nofollow noopener noreferrer external" data-href="https://github.com/runabove/beamium" data-wpel-link="external">Beamium</a>&nbsp;handles two aspects of the monitoring process: application data <strong>scrapping</strong> and metrics <strong>forwarding</strong>.</p>



<p class="graf graf--p graf-after--p">Application data is collected is the well-known and widely-used&nbsp;<a class="markup--anchor markup--p-anchor" href="https://prometheus.io/docs/instrumenting/exposition_formats/" target="_blank" rel="nofollow noopener noreferrer external" data-href="https://prometheus.io/docs/instrumenting/exposition_formats/" data-wpel-link="external"><strong>Prometheus format</strong></a><strong>.</strong> We chose Prometheus as the community was growing rapidly at the time, and many <a class="markup--anchor markup--p-anchor" href="https://prometheus.io/docs/instrumenting/clientlibs/" target="_blank" rel="nofollow noopener noreferrer external" data-href="https://prometheus.io/docs/instrumenting/clientlibs/" data-wpel-link="external">instrumentation libraries</a> were available for it. There are two key concepts in Beamium: Sources and Sinks.</p>



<p>The Sources,&nbsp;where Beamium will scrape data, are just Prometheus HTTP endpoints. This means it&#8217;s as simple as supplying the HTTP endpoint, and eventually adding a few parameters. This data will be routed to Sinks, which allows us to filter them during the routing process between a Source and a Sink. Sinks are Warp 10(R) endpoints, where we can push the data.</p>



<p class="graf graf--p graf-after--p">Once scraped, metrics are first stored on disk, before being routed to a Sink. The Disk Fail-Over (DFO) mechanism allows for network or remote failure recovery . This way, locally we retain the Prometheus pull logic, but decentralized, and we reverse it to push to feed the platform which has many advantages:</p>



<ul class="wp-block-list"><li>support for a transactional logic over the metrics platform</li><li>recovery from network partitioning or platform unavailability</li><li>dual writes with data consistency (as there&#8217;s otherwise no guarantee that two Prometheus instances would scrape the same data at the same timestamp)</li></ul>



<p>We have many different customers, some of whom use the Time Series store behind the Observability product to manage their product consumption or transactional changes over licensing. These use cases can&#8217;t be handled with Prometheus instances, which are better suited to metrics-based monitoring.</p>



<div class="wp-block-image graf graf--figure graf-after--p"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-1024x883.jpeg" alt="From Prometheus to OVH Observability with Beamium" class="wp-image-14996" width="512" height="442" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-1024x883.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-300x259.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-768x662.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-1200x1035.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7.jpeg 1488w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h4 id="cd7f" class="graf graf--h4 graf-after--figure wp-block-heading">The second was Noderig</h4>



<p class="graf graf--p graf-after--h4">During conversations with some of our customers, we came to the conclusion that the existing tools needed a certain level of expertise if they were to be used at scale. For example, a team with a 20k node cluster with Scollector would end up with more than 10 million metrics, just for the nodes&#8230; In fact, depending on the hardware configuration, Scollector would generate between 350 and 1,000 metrics from a single node.</p>



<p class="graf graf--p graf-after--h4">That&#8217;s the reason behind <a href="https://github.com/ovh/noderig" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Noderig</a>. We wanted it to be as simple to use as the node-exporter from Prometheus, but with more finely-grained metrics production as the default.</p>



<p>Noderig collects OS metrics (CPU, memory, disk, and network) using a simple level semantic. This allows you to collect the right amount of metrics for any kind of host, which is particularly suitable for containerized environments.</p>



<p class="graf graf--p graf-after--p">We made it compatible with Scollector&#8217;s custom collectors to ease the migration process, and allow for extensibility. External collectors are simple executables that act as providers for data that is collected by Noderig, as with any other metrics.</p>



<p class="graf graf--p graf-after--p">The collected metrics are available through a simple rest endpoint, allowing you to see your metrics in real-time, and easily integrate them with Beamium.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-1024x728.jpeg" alt="Noderig and Beamium" class="wp-image-14998" width="512" height="364" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-1024x728.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-300x213.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-768x546.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-1200x853.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF.jpeg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h3 class="wp-block-heading">Does it work?</h3>



<p class="graf graf--p graf-after--h3">Beamium and Noderig are extensively used at OVH, and support the monitoring of very large infrastructures. At the time of writing, we collect and store hundreds of millions of metrics using these tools. So they certainly seem to work!</p>



<p class="graf graf--p graf-after--h3">In fact, we&#8217;re currently working on the 2.0 release, which will be a rework, incorporating autodiscovery and hot reload.</p>



<h3 id="a936" class="graf graf--h4 graf-after--p wp-block-heading">Stay in&nbsp;touch</h3>



<p class="graf graf--p graf-after--h4 graf--trailing">For any questions, feel free to <a href="https://gitter.im/ovh/metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">join our Gitter</a>!<br>Follow us on Twitter: @OVH</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmonitoring-guidelines-for-ovh-observability%2F&amp;action_name=Monitoring%20guidelines%20for%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
