<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Metrics Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/metrics/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/metrics/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Fri, 08 Jan 2021 10:54:57 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Metrics Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/metrics/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Erlenmeyer and PromQL compatibility</title>
		<link>https://blog.ovhcloud.com/erlenmeyer-and-promql-compatibility/</link>
		
		<dc:creator><![CDATA[Aurélien Hébert]]></dc:creator>
		<pubDate>Wed, 16 Dec 2020 11:03:27 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Time series]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=20123</guid>

					<description><![CDATA[Today in the monitoring world, we see the rise of the Prometheus tool. It&#8217;s a great tool to deploy in your infrastructure, as it allows you to scrap all of your servers or applications to retrieve, store and analyze the metrics. And all you have to do is to extract and run it, it does [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ferlenmeyer-and-promql-compatibility%2F&amp;action_name=Erlenmeyer%20and%20PromQL%20compatibility&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>Today in the monitoring world, we see the rise of the <a href="https://prometheus.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Prometheus</a> tool. It&#8217;s a great tool to deploy in your infrastructure, as it allows you to scrap all of your servers or applications to retrieve, store and analyze the metrics. And all you have to do is to extract and run it, it does all the work by itself. Of course, Prometheus comes with some trade-offs (pull, how to handle late ingestion), and some limits, as you have your data only for a couple of days.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/IMG_0404-1024x537.png" alt="Erlenmeyer and PromQL compatibility" class="wp-image-20266" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0404-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0404-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0404-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0404.png 1200w" sizes="(max-width: 512px) 100vw, 512px" /></figure></div>



<h2 class="wp-block-heading">Context</h2>



<p>How is it possible to handle Prometheus long-time storage? A vast amount of Time Series DataBase are now fully compatible with Prometheus. It&#8217;s easy to check that Prometheus ingest is working well, however, how can we validate the PromQL &#8211; or Prometheus queries &#8211; part? A few months ago, <a href="https://promlabs.com/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PromLab</a> released a new tool called &#8220;<a href="https://github.com/promlabs/promql-compliance-tester" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PromQL compliance tester</a>&#8220;. They recently created <a href="https://promlabs.com/promql-compliance-tests" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">this page</a> where they reference the result of several products PromQL compliance tests. On this blog post, we will see how this tool helps us improve our PromQL implementation. </p>



<h3 class="wp-block-heading">Compliance tester</h3>



<p>The <a href="https://github.com/promlabs/promql-compliance-tester" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PromQL compliance tester</a> is open source and contains a full set of tests. When using this tool, it generates for you around 500 PromQL queries covering the vast majority of the language. It includes tests on simple scalar, selectors, time range functions, operators, and so on. This tool will execute a request on both a Prometheus instance and the tested backend. It will then expect you to get the same result as PromQL output. It expects an exact match for all metadata of a series (tags and names). It&#8217;s more flexible for the ticks as you can set a parameter to round your check at the milliseconds. Finally, the compliance tool checks the equality of both query values, as many things can impact the floating predictability, it computes an approximated equality. </p>



<h3 class="wp-block-heading">Erlenmeyer</h3>



<p>At Metrics, we used a <a href="https://warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a> TSDB with it&#8217;s own analytical query engine <a href="https://warp10.io/content/03_Documentation/04_WarpScript" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">WarpScript</a>. We decided to build an open source tool to transpile PromQL queries into WarpScript called <a href="https://github.com/ovh/erlenmeyer" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Erlenmeyer</a>. This compliance tester was a great help to validate some of our implementation and to detect which query were not fully ISO.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img decoding="async" width="573" height="295" src="https://www.ovh.com/blog/wp-content/uploads/2020/12/IMG_0406.png" alt="Erlenmeyer and PromQL compatibility" class="wp-image-20265" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0406.png 573w, https://blog.ovhcloud.com/wp-content/uploads/2020/12/IMG_0406-300x154.png 300w" sizes="(max-width: 573px) 100vw, 573px" /></figure></div>



<h3 class="wp-block-heading">Set up</h3>



<p>To start testing our PromQL experience, we set up a local Prometheus with a default configuration. This configuration makes Prometheus run and collect some &#8220;Demo&#8221; Metrics, then we forwarded all of them to one of our Metrics regions using Prometheus remote write. We added a local instance of Erlenmeyer to query the data stored in a distributed Warp10 backend. Then, we iterated on each set of tests of the PromLab compliance tool to identify all issues and improved our existing PromQL implementation. </p>



<p>To be compliant, we had to reduce the precision for the value of the compliance tool. We set the precision to <code>0.001</code> instead of <code>0.00001</code>. We also had to remove the Warp10 <code>.app</code> label from the result. As on Warp10 instance, we identify users based on this <code>.app</code> label.</p>



<h3 class="wp-block-heading">A test query</h3>



<p>When running the test, you will get a full report of your failing queries. Let&#8217;s take an example:</p>



<pre class="wp-block-code"><code class="">RESULT: FAILED: Query returned different results:
  model.Matrix{
  	&amp;{
  		Metric: Inverse(DropResultLabels, s`{instance="demo.promlabs.com:10002", job="demo"}`),
  		Values: []model.SamplePair{
  			... // 52 identical elements
  			{Timestamp: s"1606323726.058", Value: Inverse(TranslateFloat64, float64(2.6928936527e+10))},
  			{Timestamp: s"1606323736.058", Value: Inverse(TranslateFloat64, float64(2.691644054725e+10))},
  			{
  				Timestamp: s"1606323746.058",
- 				Value:     Inverse(TranslateFloat64, float64(2.6922272529119648e+10)),
+ 				Value:     Inverse(TranslateFloat64, float64(2.689432207325e+10)),
  			},
  			{Timestamp: s"1606323756.058", Value: Inverse(TranslateFloat64, float64(2.6915188293125e+10))},
  			{Timestamp: s"1606323766.058", Value: Inverse(TranslateFloat64, float64(2.69215848005e+10))},
  			... // 4 identical elements
  		},
  	},
  }</code></pre>



<p>The test reports includes all errors occurring during the test. In this example, we can see, that for a single series we have 56 correct values. However one is invalid, we see it on two lines. The first one is the one starting by &#8220;-&#8220;. This stands for the expected value. And the second one starting by a &#8220;+&#8221; corresponds to the tested instance value. In this case, the value isn&#8217;t precise enough (2.68 instead of 2.69).</p>



<h2 class="wp-block-heading">Results</h2>



<p>Now that we have a full test set-up running, we can see what we improved from its results. If you want to access the full detailed fixes, you can check the code update made <code><a href="https://github.com/ovh/erlenmeyer/pull/48" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">here</a></code>. This tool helped us to fix some implementation, sanitize known issues, to know what PromQL features we missed, and detect a few new bugs! Let&#8217;s review the change.</p>



<h3 class="wp-block-heading">Quick implementation fixes</h3>



<p>Running those test was a great help for us to understand some of implementations errors we had when trying to match PromQL behavior. For example, the time range function was sampling before computing the operation. Reversing those steps provided us a direct match with a native query. It also helped us also fix some minor bugs on how to handle the comparison operators or multiple functions as label_replace, holt_winters, predict_linear or the full set of time functions (hour, minute, month&#8230;). </p>



<p>We improved also our handling of PromQL operator aggregators : by and without. </p>



<h3 class="wp-block-heading">Sanitize known issues</h3>



<p>We discovered recently, that we were not matching PromQL behavior on the series name. As a result we were keeping the name for all compute operations. Prometheus has, however, a different approach as the name is only kept when it&#8217;s relevant. The compliance tester helps us on how to validate this specific update for all queries. </p>



<p>With this tool, we test the validity of a query compared to a native PromQL query, it helps us to sanitize our query output. We knew that, in case of missing values or empty series, we were not ISO compliant. We have corrected the part of the Erlenmeyer software handling the output to match all PromQL cases included in the tests.</p>



<h3 class="wp-block-heading">Unimplemented features</h3>



<p>Running the test, lead us to discover that we missed some PromQL native features. As a matter of fact, Erlenmeyer now supports the PromQL unary or the &#8220;bool&#8221; keywords. The support of unary allows the use of &#8220;-my_series&#8221; for example. In PromQL, the bool keywords convert the result to booleans. It returns as series values 1 or 0 depending on the condition, where 1 stands for true and zero for false.</p>



<h3 class="wp-block-heading">Open issues</h3>



<p>Running all compliance tests and improving our code base lead to us to around 91% of success. For the rest, we open <a href="https://github.com/ovh/erlenmeyer/issues" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">new issues</a> on Erlenmeyer, we detected that: </p>



<ul class="wp-block-list"><li>the handling of the over_time function is not correct when the range is below the data point frequency,</li><li>rate, delta, increase and predict_linear, our result isn&#8217;t precise enough to match PromQL output when then the range is below 5 minutes,</li><li>some minor bugs on series selector (!=), or on the label_replace (some checks are missing on parameters validators),</li><li>the PromQL subqueries, as well as, some functions are not implemented: ^ and % on two series set and the deriv function.</li></ul>



<p>Those are the 4 missing points to cover the full PromQL feature set with Erlenmeyer. Our <a href="https://github.com/ovh/erlenmeyer/blob/master/doc/promql.md" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">documentation</a> already contained all the missing implementations. </p>



<h2 class="wp-block-heading">Actions</h2>



<p>This tool was a great help to improve our PromQL compliance and we are happy with our compliance result. Indeed we reach 91% with the provided test result:  </p>



<pre class="wp-block-code"><code class="">General query tweaks:
*  Metrics test
================================================================================
Total: 496 / 541 (91.68%) passed</code></pre>



<p>Our next action, is to release those fixes and improvements on all our Metrics regions. Looking forward to see what you think about our PromQL implementation! </p>



<p>We now see a lot of projects are implementing Prometheus writes and reads. These projects bring Prometheus a lot of missing features like long-term storage, delete, late ingestion, historical data analysis, HA&#8230; Being able to validate PromQL implementation is a big challenge, and is a great help in choosing the right backend according to the need. </p>
<img decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ferlenmeyer-and-promql-compatibility%2F&amp;action_name=Erlenmeyer%20and%20PromQL%20compatibility&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>The Open Source Metrics family welcomes Catalyst and Erlenmeyer</title>
		<link>https://blog.ovhcloud.com/the-open-source-metrics-family-welcomes-catalyst-and-erlenmeyer/</link>
		
		<dc:creator><![CDATA[Aurélien Hébert]]></dc:creator>
		<pubDate>Fri, 20 Mar 2020 09:43:32 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Time series]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16747</guid>

					<description><![CDATA[At OVHcloud Metrics, we love open source! Our goal is to provide all of our users with a full experience. We rely on the Warp10 time series database which enables us to build open source tools for our users benefit. Let&#8217;s take a look at some in this blogpost. Storage tool Our Infrastructure is based [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fthe-open-source-metrics-family-welcomes-catalyst-and-erlenmeyer%2F&amp;action_name=The%20Open%20Source%20Metrics%20family%20welcomes%20Catalyst%20and%20Erlenmeyer&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>At <strong>OVHcloud Metrics</strong>, we love open source! Our goal is to provide all of our users with a <strong>full experience</strong>. We rely on the<strong> <a href="https://warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a></strong> time series database which enables us to build <strong>open source tools</strong> for our users benefit. Let&#8217;s take a look at some in this blogpost. </p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="537" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/00FFA362-5C7E-41CC-9CFA-E4046F632282-1024x537.png" alt="" class="wp-image-17633" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/00FFA362-5C7E-41CC-9CFA-E4046F632282-1024x537.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/00FFA362-5C7E-41CC-9CFA-E4046F632282-300x157.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/00FFA362-5C7E-41CC-9CFA-E4046F632282-768x403.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/00FFA362-5C7E-41CC-9CFA-E4046F632282.png 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">Storage tool</h3>



<p>Our Infrastructure is based on the <strong>open source time series database: <a href="https://warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a></strong>. This database includes two versions: a stand-alone one and a <strong>distributed one</strong>. The distributed one relies on distributed tools such as <strong>Apache Kafka, Apache Hadoop</strong> and <strong>Apache HBase</strong>. </p>



<p>Unsurprisingly, our team makes its own <strong>contributions to the <a href="https://github.com/senx/warp10-platform" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a></strong> platform. Due to our unique requirements, we even <strong>contribute</strong> to the <a href="https://www.ovh.com/blog/contributing-to-apache-hbase-custom-data-balancing/" data-wpel-link="exclude">underlying open source database <strong>HBase</strong></a>! </p>



<h3 class="wp-block-heading">Metrics data ingest</h3>



<p>As a matter of fact, we were the first to get stuck in on the <strong>ingest process</strong>! We often build adapted tools to collect and push <strong>monitoring</strong> <strong>data</strong> on Warp10 &#8211; this is how <a href="https://github.com/ovh/noderig" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>noderig</strong></a> came to life. Noderig is an adapted tool that is able to <strong>collect</strong> a <strong>simple core</strong> of metrics from <strong>any server or any virtual machine</strong>. It is also possible to send metrics safely to a backend. <a href="https://github.com/ovh/beamium/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Beamium</strong></a>, a Rust tool, is able to push the Noderig metrics to <strong>one or several Warp 10</strong> backend(s).<em> </em></p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="369" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/734E67C9-BC8E-4D16-93BD-077E6BF1EE4E-1024x369.png" alt="" class="wp-image-17638" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/734E67C9-BC8E-4D16-93BD-077E6BF1EE4E-1024x369.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/734E67C9-BC8E-4D16-93BD-077E6BF1EE4E-300x108.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/734E67C9-BC8E-4D16-93BD-077E6BF1EE4E-768x277.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/734E67C9-BC8E-4D16-93BD-077E6BF1EE4E-1536x553.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/734E67C9-BC8E-4D16-93BD-077E6BF1EE4E.png 1857w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p><em>What if I want to collect my <strong>own custom metrics</strong>?</em> First, you&#8217;ll need to expose them following the &#8216;Prometheus model&#8217;. <a href="https://github.com/ovh/beamium/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Beamium</a> is then able to <strong>scrap</strong> applications based on their configuration file and forward all the data to the configured Warp 10 backend(s)! </p>



<p>If you are looking to monitor <strong>specific applications</strong> using the <strong><a href="https://www.influxdata.com/time-series-platform/telegraf/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Influx Telegraf</a> agent</strong> (in order to expose the Metrics you require) we have also contributed the <a href="https://github.com/influxdata/telegraf/tree/master/plugins/outputs/warp10" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10 Telegraf</a> <strong>connector</strong>, which was recently merged!</p>



<p><em>This looks great so far, but what if I usually push Graphite, Prometheus, Influx or OpenTSDB<strong> </strong>metrics; <strong>how can I simply migrate to Warp10</strong>? </em>Our answer is <strong>Catalyst</strong>: a proxy layer that is able to parse Metrics in the related formats, and convert them to Warp10 native. </p>



<h3 class="wp-block-heading">Catalyst</h3>



<p><a href="https://github.com/ovh/catalyst" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Catalyst</a> is a Go HTTP proxy used to parse multiple Open Source TimeSeries database writes. At the moment, it supports multiple Open Source TimeSeries database writes; such as OpenTSDB, PromQL, Prometheus-remote_write, Influx and Graphite. Catalyst runs a <strong>HTTP server</strong> that listens to a <strong>specific path</strong>; starting with the <strong>protocol time series name</strong> and then the <strong>native query</strong> one. For example, in order to collect influx data, you simply send a request to <code>influxdb/write</code>. Catalyst will natively parsed the <code>influx</code> data and convert it to <a href="https://warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp 10</a> ingest native format.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="700" height="158" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/9288E49A-F0A9-4A98-982B-9E0D4B40F7F5.png" alt="" class="wp-image-17636" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/9288E49A-F0A9-4A98-982B-9E0D4B40F7F5.png 700w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/9288E49A-F0A9-4A98-982B-9E0D4B40F7F5-300x68.png 300w" sizes="auto, (max-width: 700px) 100vw, 700px" /></figure></div>



<h3 class="wp-block-heading">Metrics queries</h3>



<p>Data collection is an important first step, but we have also considered how existing query Monitoring protocols could be used on top of Warp10. This has led us to implement <a href="https://github.com/ovh/tsl" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>TSL</strong></a>. TSL was discussed at length during the <a href="https://www.meetup.com/Paris-Time-Series-Meetup/events/266610627/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Paris Time Series Meetup</a> as well as in this <a href="https://www.ovh.com/blog/tsl-a-developer-friendly-time-series-query-language-for-all-our-metrics/" data-wpel-link="exclude">blog post</a>. </p>



<p>Now let&#8217;s take a user that is using Telegraf and pushing data to Warp10 with Catalyst. They will wish to use the <strong>native Influx Grafana dashboard</strong>, but how? And what about users that <strong>automate</strong> queries with the <strong>OpenTSDB</strong> query protocol? Our answer was to develop a proxy: <strong>Erlenmeyer</strong>.</p>



<h3 class="wp-block-heading">Erlenmeyer</h3>



<p><a href="https://github.com/ovh/erlenmeyer" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Erlenmeyer</strong></a> is a Go HTTP proxy that enables users to query <strong>Warp 10</strong> based on <strong>Open Source</strong> query protocol. At the moment, it supports multiple Open Source TimeSeries formats; such as PromQL, Prometheus remote-read, InfluxQL, OpenTSDB or Graphite. Erlenmeyer runs a <strong>HTTP server</strong> that listens to a <strong>specific path</strong>; starting with the <strong>protocol time series name</strong> and then the <strong>native query</strong> one. For example, to run a promQL query, the user sends a request to <code>prometheus/api/v0/query</code>. Erlenmeyer will natively parsed the <code>promQL</code> request and then build a native WarpScript request that any <a href="https://warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a> backend can support. </p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="687" height="228" src="https://www.ovh.com/blog/wp-content/uploads/2020/03/E25204ED-BBF9-41B5-8F96-592EF3475B8E.png" alt="" class="wp-image-17635" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/03/E25204ED-BBF9-41B5-8F96-592EF3475B8E.png 687w, https://blog.ovhcloud.com/wp-content/uploads/2020/03/E25204ED-BBF9-41B5-8F96-592EF3475B8E-300x100.png 300w" sizes="auto, (max-width: 687px) 100vw, 687px" /></figure></div>



<h2 class="wp-block-heading">To be continued</h2>



<p>At first, <strong><a href="https://github.com/ovh/erlenmeyer" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Erlenmeyer</a> and <a href="https://github.com/ovh/catalyst" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Catalyst</a></strong> represented a quick implementation of native protocols, aimed to help <strong>internal teams</strong> migrate, while still <strong>utilising a familiar tool</strong>. We have now integrated a lot of the <strong>native functionality</strong> of each protocol, and feel they are ready for sharing. It&#8217;s time to make them <strong>available to the <a href="https://warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a> community</strong>, so we can receive <strong>feedback </strong>and continue to work hard in supporting open source protocols. You can find us in <strong>OVHcloud Metrics <a href="https://gitter.im/ovh/metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">gitter room</a></strong>!</p>



<p>Other Warp10 users may require unimplemented protocol. They will be able to use <strong><a href="https://github.com/ovh/erlenmeyer" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Erlenmeyer</a> and <a href="https://github.com/ovh/catalyst" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Catalyst</a></strong> to support them on their <strong>own Warp10 backend</strong>.</p>



<p><strong>Welcome</strong> <strong><a href="https://github.com/ovh/erlenmeyer" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Erlenmeyer</a> and <a href="https://github.com/ovh/catalyst" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Catalyst</a></strong> &#8211; <strong>Metrics Open Source projects</strong>!</p>



<p></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fthe-open-source-metrics-family-welcomes-catalyst-and-erlenmeyer%2F&amp;action_name=The%20Open%20Source%20Metrics%20family%20welcomes%20Catalyst%20and%20Erlenmeyer&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Jerem: An Agile Bot</title>
		<link>https://blog.ovhcloud.com/jerem-an-agile-bot/</link>
		
		<dc:creator><![CDATA[Aurélien Hébert]]></dc:creator>
		<pubDate>Fri, 21 Feb 2020 16:58:47 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Agile Telemetry]]></category>
		<category><![CDATA[Agility]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Time series]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16943</guid>

					<description><![CDATA[At OVHCloud, we are open sourcing our “Agility Telemetry” project. Jerem, as our data collector, is the main component of this project. Jerem scrapes our JIRA at regular intervals, and extracts specific metrics for each project. It then forwards them to our long-time storage application, the OVHCloud Metrics Data Platform.&#160;&#160; Agility concepts from a developer&#8217;s [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fjerem-an-agile-bot%2F&amp;action_name=Jerem%3A%20An%20Agile%20Bot&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>At OVHCloud, we are open sourcing our “Agility Telemetry” project. <strong><a href="https://github.com/ovh/jerem" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Jerem</a></strong>, as our data collector, is the main component of this project. Jerem scrapes our <strong>JIRA</strong> at regular intervals, and extracts <strong>specific metrics</strong> for each project. It then forwards them to our long-time storage application, the <strong>OVHCloud <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">Metrics Data Platform</a></strong>.&nbsp;&nbsp;</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="537" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-1024x537.jpeg" alt="Jerem: an agile bot" class="wp-image-17160" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-1024x537.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-300x157.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Agility concepts from a developer&#8217;s point of view</h3>



<p>To help you understand our goals for <strong>Jerem</strong>, we need to explain some Agility concepts we will be using. First, we will establish a <strong>technical quarterly roadmap</strong> for a product, which sets out all <strong>features</strong> we <strong>plan to release</strong> every three months. This is what we call an <strong>epic</strong>.&nbsp;</p>



<p>For each epic, we identify the tasks that will need to be completed. For all of those tasks, we then evaluate the complexity of tasks using <strong>story points</strong>, during a team preparation session. A story point reflects the effort required to complete the specific JIRA task. </p>



<p>Then, to advance our roadmap, we will conduct regular <strong>sprints</strong> that correspond to a period of <strong>ten days</strong>, during which the team will onboard several tasks. The amount of story points taken in a sprint should match, or be close to, the <strong>team velocity</strong>. In other words, the average number of story points that the team is able to complete each day.</p>



<p>However, other urgent tasks may arise unexpectedly during sprints. That’s what we call an <strong>impediment</strong>. We might, for example, need to factor in helping customers, bug fixes, or urgent infrastructure tasks.&nbsp;&nbsp;&nbsp;</p>



<h3 class="wp-block-heading">How Jerem works </h3>



<p>At OVH we use JIRA to track our activity. Our <strong>Jerem</strong> bot scraps our <strong>projects</strong> <strong>from</strong> <strong>JIRA</strong> and exports all necessary data to our  <strong>OVHCloud <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">Metrics Data Platform</a></strong>. Jerem can also push data to any Warp 10 compatible database. In Grafana, you simply query the Metrics platform (using <a href="https://github.com/ovh/ovh-warp10-datasource" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10 datasource</a>) with for example our <a href="https://github.com/ovh/jerem/blob/master/grafana/program_management.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">program management dashboard</a>. All your KPI are now available in a nice dashboard!</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="256" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-1024x256.jpeg" alt="" class="wp-image-17164" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-1024x256.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-300x75.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-768x192.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-1536x384.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45.jpeg 1720w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">Discover Jerem metrics</h3>



<p>Now that we have an overview of the main Agility concepts involved, let&#8217;s dive into Jerem! How do we convert those Agility concepts into metrics? First of all, we&#8217;ll retrieve all metrics related to epics (i.e. new features). Then, we will have a deep look at the sprint metrics.</p>



<h4 class="wp-block-heading">Epic data</h4>



<p>To explain Jerem epic metrics, we&#8217;ll start by creating a new one. In this example, we called it <code>Agile Telemetry</code>. We add a Q2-20 label, which means that we plan to release it for Q2. To record an epic with Jerem, you need to set a quarter for the final delivery! Next, we&#8217;ll simply add four tasks, as shown below:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="651" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Epic-1-1024x651.png" alt="" class="wp-image-16984" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1-1024x651.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1-300x191.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1-768x489.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1.png 1182w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To get the metrics, we need to evaluate each individual task. We we&#8217;ll do this together during preparation sessions. In this example, we have custom story points for each task. For example, we estimated the <code>write a BlogPost about Jerem</code> task as being a 3.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="472" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-1024x472.png" alt="" class="wp-image-16957" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-1024x472.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-300x138.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-768x354.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-1536x709.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10.png 1697w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>As a result, Jerem now has everything it needs to start collecting epic metrics. This example provides five metrics:</p>



<ul class="wp-block-list"><li><code>jerem.jira.epic.storypoint</code>: the total number of story points needed to complete this epic. The value here is 14 (the sum of all the epic story points). This metric will evolve whenever as the epic is updated by adding or removing tasks.&nbsp;</li><li><code>jerem.jira.epic.storypoint.done</code>: the number of completed tasks. In our example, we have already completed the <code>Write Jerem bot</code> and <code>Deploy Jerem Bot</code>, so we have already completed eight story points.</li><li><code>jerem.jira.epic.storypoint.inprogress</code>: the number of &#8216;in progress&#8217; tasks, such as <code>Write a BlogPost about Jerem</code>.</li><li><code>jerem.jira.epic.unestimated</code>: the number of unestimated tasks, shown as <code>Unestimated Task</code> in our example.</li><li><code>jerem.jira.epic.dependency</code>: the number of tasks that have dependency labels, indicating that they are mandatory for other epics or projects.</li></ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="443" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Metris-epics-1024x443.png" alt="" class="wp-image-16958" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-1024x443.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-300x130.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-768x332.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-1536x665.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics.png 1784w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This way, for each epic in a project, Jerem collects five unique metrics.  </p>



<h4 class="wp-block-heading">Sprint data</h4>



<p>To complete epic tasks, we work using a <strong>sprint</strong> process. When doing sprints, we want to provide a lot of <strong>insights</strong> into our <strong>achievements</strong>. That&#8217;s why Jerem collects sprint data too! </p>



<p>So let&#8217;s open a new sprint in JIRA and start working on our task. This gives us the following JIRA view:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="241" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Sprint-ui-1024x241.png" alt="" class="wp-image-16963" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-1024x241.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-300x71.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-768x181.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-1536x362.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui.png 1804w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Jerem collects the following metrics for each sprint:&nbsp;</p>



<ul class="wp-block-list"><li><code>jerem.jira.sprint.storypoint.total</code>: the total number of story points onboarded into a sprint.</li><li><code>jerem.jira.sprint.storypoint.inprogress</code>: the story points currently in progress within a sprint.</li><li><code>jerem.jira.sprint.storypoint.done</code>: the number of story points currently completed within a sprint.</li><li><code>jerem.jira.sprint.events</code>: the &#8216;start&#8217; and of the &#8216;end&#8217; dates of sprint events, recorded as Warp10 string values.</li></ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="480" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Metrics-sprints-1024x480.png" alt="" class="wp-image-16964" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-1024x480.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-300x141.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-768x360.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-1536x720.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints.png 1785w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>As you can see in the Metrics view above, we will record every sprint metric twice. We do this to provide a quick view of the active sprint, which is why we use the &#8216;current&#8217; label&#8217;. This also enables us to query past sprints, using the real sprint name. Of course, an active sprint can also be queried using its name.   </p>



<h4 class="wp-block-heading">Impediment data</h4>



<p>Starting a sprint means you need to know all the tasks you will have to work on over the next few days. But how can we track and measure unplanned tasks? For example, the very urgent one for your manager, or the teammate that needs a bit of help?</p>



<p>We can add special tickets on JIRA to keep track of those task. That&#8217;s what we call an &#8216;impediment&#8217;. They are labelled according their nature. If, for example, the production requires your attention, then it&#8217;s an &#8216;Infra&#8217; impediment. You will also retrieve metrics for the &#8216;Total&#8217; (all kinds of impediments), &#8216;Excess&#8217; (the unplanned tasks), &#8216;Support&#8217; (helping teammates), and &#8216;Bug fixes or other&#8217; (for all other kinds of impediment).</p>



<p>Each impediment belongs to the active sprint it was closed in. To close an impediment, you only have to flag it as &#8216;Done&#8217; or &#8216;Closed&#8217;.</p>



<p>We also retrieve metrics like:</p>



<ul class="wp-block-list"><li><code>jerem.jira.impediment.TYPE.count</code>: the number of impediments that occurred during a sprint.</li><li><code>jerem.jira.impediment.TYPE.timespent</code>: the amount of time spent on impediments during a sprint.</li></ul>



<p><code>TYPE</code> corresponds to the <strong>kind</strong> of recorded impediment. As we didn&#8217;t open any actual impediments, Jerem collects only the <code>total</code> metrics.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="433" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Metrics-impediment-1024x433.png" alt="" class="wp-image-16965" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-1024x433.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-300x127.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-768x325.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-1536x650.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment.png 1773w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To start recording impediments, we simply create a new JIRA task, in which we add an &#8216;impediment&#8217; label. We we also set its nature, and the actual time spent on it.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="819" height="903" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54.png" alt="" class="wp-image-16967" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54.png 819w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54-272x300.png 272w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54-768x847.png 768w" sizes="auto, (max-width: 819px) 100vw, 819px" /></figure>



<p>For the impediment, we&#8217;ll we also retrieve the global metrics that Jerem always records:</p>



<ul class="wp-block-list"><li><code>jerem.jira.impediment.total.created</code>: the time spent from the creation date to complete an impediment. This allows us to retrieve a total impediment count. We can also record all impediment actions, even outside sprints.&nbsp;</li></ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>For a single Jira project, like our example, you can expect around 300 metrics. This might increase depending on the epic you create and flag on Jira, and the one you close.</p></blockquote>



<h3 class="wp-block-heading">Grafana dashboard</h3>



<p>We love building Grafana dashboards! They provide both the team and the manager a lot of insights into KPIs. The best part of it for me, as a developer, is that I see why it&#8217;s nice to fill a JIRA task!</p>



<p>In our first <a href="https://github.com/ovh/jerem/blob/master/grafana/program_management.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Grafana dashboard</a>, you will retrieve all the best program management KPIs. Let&#8217;s start with the global overview:</p>



<h4 class="wp-block-heading">Quarter data overview</h4>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="421" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/quarter-data-sandbox-1024x421.png" alt="" class="wp-image-16968" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-1024x421.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-300x123.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-768x316.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-1536x632.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox.png 1840w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Here, you will find the current epic in progress. You will also find the global team KPIs, such as the predictability, the velocity, and the impediment stats. It&#8217;s here where the magic happens! When filled correctly, this dashboard will show you exactly what your team should deliver in the current quarter. This means you have quick access to all important current subjects. You will also be able to see if your team is expected to deliver on too many subjects, so you can quickly take action and delay some of the new features.</p>



<h4 class="wp-block-heading">Active sprint data</h4>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="286" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/sprintdata-1024x286.png" alt="" class="wp-image-16969" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-1024x286.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-300x84.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-768x214.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-1536x428.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata.png 1839w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The active sprint data panel is often a great support during our daily meetings. In this panel, we get a quick overview of the team&#8217;s achievements, and can establish the time spent on parallel tasks. </p>



<h4 class="wp-block-heading">Detailed data</h4>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="286" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/detail-KPI-1024x286.png" alt="" class="wp-image-16970" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-1024x286.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-300x84.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-768x215.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-1536x429.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI.png 1847w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The last part provides more detailed data. Using the epic Grafana variable, we can check specific epics, along with the completion of more global projects. You have also a <code>velocity chart</code>, which plots the past sprint, and compares the expected story points to the ones actually completed.   </p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>The Grafana dashboard is directly availble in the Jerem project. You can import it directly in Grafana, provided you have a valid Warp 10 datasource configured. </p><p>To make the dashboard work as required, you have to configure the Grafana project variable in the form of a WarpScript list <code>[ 'SAN' 'OTHER-PROJECT' ]</code>. If our program manager can do it, I am sure you can! 😉 </p></blockquote>



<p>Setting up Jerem and automatically loading program management data give us a lot of insights. As a developer, I really enjoy it and I&#8217;ve quickly become used to tracking a lot more events in JIRA than I did before. You are able to directly see the impact of your tasks. For example, you see how quickly the roadmap is advancing, and you become able to identify any bottlenecks that are causing impediments. Those bottlenecks then become epics. In other words, once we start to use Jerem, we just auto-fill it! I hope you will enjoy it too! If you have any feedback, we would love to hear it.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fjerem-an-agile-bot%2F&amp;action_name=Jerem%3A%20An%20Agile%20Bot&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Contributing to Apache HBase: custom data balancing</title>
		<link>https://blog.ovhcloud.com/contributing-to-apache-hbase-custom-data-balancing/</link>
		
		<dc:creator><![CDATA[Pierre Zemb]]></dc:creator>
		<pubDate>Fri, 14 Feb 2020 16:37:19 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16524</guid>

					<description><![CDATA[In today&#8217;s blogpost, we&#8217;re going to take a look at our upstream contribution to Apache HBase’s stochastic load balancer, based on our experience of running HBase clusters to support OVHcloud&#8217;s monitoring. The context Have you ever wondered how: we generate the graphs for your OVHcloud server or web hosting package? our internal teams monitor their [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcontributing-to-apache-hbase-custom-data-balancing%2F&amp;action_name=Contributing%20to%20Apache%20HBase%3A%20custom%20data%20balancing&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>In today&#8217;s blogpost, we&#8217;re going to take a look at our upstream contribution to Apache HBase’s stochastic load balancer, based on our experience of running HBase clusters to support OVHcloud&#8217;s monitoring.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-1024x537.jpeg" alt="Contributing to Apache HBase: custom data balancing" class="wp-image-17086" width="1024" height="537" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-1024x537.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-300x157.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">The context</h3>



<p>Have you ever wondered how:</p>



<ul class="wp-block-list"><li>we generate the graphs for your OVHcloud server or web hosting package? </li><li>our internal teams monitor their own servers and applications?</li></ul>



<p><strong>All internal teams are constantly gathering telemetry and monitoring data</strong> and sending them to a <strong>dedicated team,</strong> who are responsible for <strong>handling all the metrics and logs generated by OVHcloud&#8217;s infrastructure</strong>: the Observability team.</p>



<p>We tried a lot of different <strong>Time Series databases</strong>, and eventually chose <a href="https://warp10.io" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a> to handle our workloads. <strong>Warp10</strong> can be integrated with the various <strong>big-data solutions</strong> provided by the <a href="https://www.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Foundation.</a> In our case, we use <a href="http://hbase.apache.org" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache HBase</a> as the long-term storage datastore for our metrics. </p>



<p><a href="http://hbase.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache HBase</a>, a datastore built on top of <a href="http://hadoop.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Hadoop</a>, provides <strong>an elastic, distributed, key-ordered map.</strong> As such, one of the key features of Apache HBase for us is the ability to <strong>scan</strong>, i.e. retrieve a range of keys. Thanks to this feature, we can fetch <strong>thousands of datapoints in an optimised way</strong>.</p>



<p>We have our own dedicated clusters, the biggest of which has more than 270 nodes to spread our workloads:</p>



<ul class="wp-block-list"><li>between 1.6 and 2 million writes per second, 24/7</li><li>between 4 and 6 million reads per second</li><li>around 300TB of telemetry, stored within Apache HBase</li></ul>



<p>As you can probably imagine, storing 300TB of data in 270 nodes comes with some challenges regarding repartition, as <strong>every</strong> <strong>bit is hot data, and should be accessible at any time</strong>. Let&#8217;s dive in!</p>



<h3 class="wp-block-heading">How does balancing work in Apache HBase?</h3>



<p>Before diving into the balancer, let&#8217;s take a look at how it works. In Apache HBase, data is split into shards called <code>Regions</code>, and distributed through <code>RegionServers</code>. The number of regions will increase as the data is coming in, and regions will be split as a result. This is where the <code>Balancer</code> comes in. It will <strong>move regions</strong> to avoid hotspotting a single <code>RegionServer</code> and effectively distribute the load.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="768" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-1024x768.jpeg" alt="" class="wp-image-17007" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-1024x768.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-300x225.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-768x576.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-1536x1152.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA.jpeg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>The actual implementation, called <a href="https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">StochasticBalancer</a>, uses <strong>a cost-based approach:</strong></p>



<ol class="wp-block-list"><li>It first computes the <strong>overall cost</strong> of the cluster, by looping through <code>cost functions</code>. Every cost function <strong>returns a number between 0 and 1 inclusive</strong>, where 0 is the lowest cost-best solution, and 1 is the highest possible cost and worst solution. Apache Hbase is coming with several cost functions, which are measuring things like region load, table load, data locality, number of regions per RegionServers&#8230; The computed costs are <strong>scaled by their respective coefficients, defined in the configuration</strong>. </li><li>Now that the initial cost is computed, we can try to <code>Mutate</code> our cluster. For this, the Balancer creates a random <code>nextAction</code>, which could be something like <strong>swapping two regions</strong>, or <strong>moving one region to another RegionServer</strong>. The action is <strong>applied</strong> <strong>virtually</strong> , and then the <strong>new cost is calculated</strong>. If the new cost is lower than our previous one, the action is stored. If not, it is skipped. This operation is repeated <code>thousands of times</code>, hence the <code>Stochastic</code>. </li><li>At the end,<strong> the list of valid actions is applied to the actual cluster. </strong></li></ol>



<h3 class="wp-block-heading">What was not working for us?</h3>



<p>We found out that <strong>for our specific use case</strong>, which involved:</p>



<ul class="wp-block-list"><li>Single table</li><li>Dedicated Apache HBase and Apache Hadoop, <strong>tailored for our requirements</strong></li><li>Good key distribution</li></ul>



<p>&#8230; <strong>the number of regions per RegionServer was the real limit for us</strong>.</p>



<p>Even if the balancing strategy seems simple, <strong>we do think that being able to run an Apache HBase cluster on heterogeneous hardware is vital</strong>, especially in cloud environments, because you <strong>may not be able to buy the same server specs again in the future.</strong> In our earlier example, our cluster grew from 80 to ~250 machines in four years. Throughout that time, we bought new dedicated server references, and even tested some special internal references.</p>



<p>We ended-up with differents groups of hardware: <strong>some servers can handle only 180 regions, whereas the biggest can handle more than 900</strong>. Because of this disparity, we had to disable the Load Balancer to avoid the <a href="https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java#L1194" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">RegionCountSkewCostFunction</a>, which would try to bring all RegionServers to the same number of regions.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="768" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-1024x768.jpeg" alt="RegionCountSkewCostFunction balancing" class="wp-image-17010" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-1024x768.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-300x225.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-768x576.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-1536x1152.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6.jpeg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Two years ago we developed some internal tools, which are responsible for load balancing regions across RegionServers. The tooling worked really good for our use case, simplifying the day-to-day operation of our cluster. </p>



<p><strong>Open source is at the DNA of OVHcloud</strong>, and that means that we build our tools on open source software, but also that we <strong>contribute</strong> and give it back to the community. When we talked around, we saw that we weren&#8217;t the only one concerned by the heterogenous cluster problem. We decided to rewrite our tooling to make it more general, and to <strong>contribute</strong> it <strong> directly upstream</strong> to the HBase project<strong>. </strong></p>



<h3 class="wp-block-heading">Our contributions</h3>



<p>The first contribution was pretty simple, the cost function list was a <a href="https://github.com/apache/hbase/blob/8cb531f207b9f9f51ab1509655ae59701b66ac37/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java#L199-L213" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">constant</a>. We <a href="https://github.com/apache/hbase/commit/836f26976e1ad8b35d778c563067ed0614c026e9" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">added the possibility to load custom cost functions</a>.</p>



<p>The second contribution was about <a href="https://github.com/apache/hbase/commit/42d535a57a75b58f585b48df9af9c966e6c7e46a" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">adding an optional costFunction to balance regions according to a capacity rule</a>.</p>



<h3 class="wp-block-heading">How does it works?</h3>



<p>The balancer will load a file containing lines of rules. <strong>A rule is composed of a regexp for hostname, and a limit.</strong> For example, we could have:</p>



<pre class="wp-block-code"><code class="">rs[0-9] 200
rs1[0-9] 50</code></pre>



<p>RegionServers with <strong>hostnames matching the first rules will have a limit of 200</strong>, and <strong>the others 50</strong>. If there&#8217;s no match, a default is set.</p>



<p>Thanks to these rule, we have two key pieces of information:</p>



<ul class="wp-block-list"><li>the<strong> max number of regions for this cluster</strong></li><li>the<strong> rules for each servers</strong></li></ul>



<p>The <code>HeterogeneousRegionCountCostFunction</code> will try to <strong>balance regions, according to their capacity.</strong></p>



<p>Let&#8217;s take an example&#8230; Imagine that we have 20 RS:</p>



<ul class="wp-block-list"><li>10 RS, named <code>rs0</code> to <code>rs9</code>, loaded with 60 regions each, which can each handle 200 regions.</li><li>10 RS, named <code>rs10</code> to <code>rs19</code>, loaded with 60 regions each, which can each handle 50 regions.</li></ul>



<p>So, based on the following rules:</p>



<pre class="wp-block-code"><code class="">rs[0-9] 200
rs1[0-9] 50</code></pre>



<p>&#8230; we can see that the <strong>second group is overloaded</strong>, whereas the first group has plenty of space.</p>



<p>We know that we can handle a maximum of <strong>2,500 regions</strong> (200&#215;10 + 50&#215;10), and we have currently <strong>1,200 regions</strong> (60&#215;20). As such, the <code>HeterogeneousRegionCountCostFunction</code> will understand that the cluster is <strong>full at 48.0%</strong> (1200/2500). Based on this information, we will then <strong>try to put all the RegionServers at ~48% of the load, according to the rules.</strong></p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="768" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-1024x768.jpeg" alt="HeterogeneousRegionCountCostFunction balancing" class="wp-image-17084" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-1024x768.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-300x225.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-768x576.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-1536x1152.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A.jpeg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">Where to next?</h3>



<p>Thanks to Apache HBase&#8217;s contributors, our patches are now <strong>merged</strong> into the master branch. As soon as Apache HBase maintainers publish a new release, we will deploy and use it at scale. This <strong>will allow more automation on our side, and ease operations for the Observability Team.</strong></p>



<p>Contributing was an awesome journey. What I love most about open source is the opportunity ability to contribute back, and build stronger software. We <strong>had an opinion</strong> about how a particular issue should addressed, but <strong>the discussions with the community helped us to refine it</strong>. We spoke with e<strong>ngineers from other companies, who were struggling with Apache HBase&#8217;s cloud deployments, just as we were</strong>, and thanks to those exchanges, <strong>our contribution became more and more relevant. </strong></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcontributing-to-apache-hbase-custom-data-balancing%2F&amp;action_name=Contributing%20to%20Apache%20HBase%3A%20custom%20data%20balancing&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>TSL (or how to query time series databases)</title>
		<link>https://blog.ovhcloud.com/tsl-or-how-to-query-time-series-databases/</link>
		
		<dc:creator><![CDATA[Aurélien Hébert]]></dc:creator>
		<pubDate>Fri, 31 Jan 2020 13:41:34 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Time series]]></category>
		<category><![CDATA[TSL]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16734</guid>

					<description><![CDATA[Last year, we released TSL as an open source tool to query a Warp 10 platform, and by extension, the OVHcloud Metrics Data Platform. But how has it evolved since then? Is TSL ready to query other time series databases? What about TSL states on the Warp10 eco-system? TSL to query many time series databases [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ftsl-or-how-to-query-time-series-databases%2F&amp;action_name=TSL%20%28or%20how%20to%20query%20time%20series%20databases%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>Last year, we released <a href="https://github.com/ovh/tsl/)" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>TSL</strong></a> as an <strong>open source tool</strong> to <strong>query</strong> a<strong> Warp 10</strong> platform, and by extension, the <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude"><strong>OVHcloud Metrics Data Platform</strong></a>. </p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://www.ovh.com/blog/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-1024x540.jpeg" alt="TSL by OVHcloud" class="wp-image-16774" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-1024x540.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-300x158.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-768x405.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A.jpeg 1202w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>But how has it evolved since then? Is TSL ready to query <strong>other time series databases</strong>? What about TSL states on the <strong>Warp10 eco-system</strong>?</p>



<hr class="wp-block-separator"/>



<h3 class="wp-block-heading">TSL to query many time series databases</h3>



<p>We wanted TSL to be usable in front of <strong>multiple time series databases</strong>. That&#8217;s why we also released a <strong>PromQL query generator</strong>.</p>



<p>One year later, we now know this wasn&#8217;t the way to go. Based on what we learned, the <strong><a href="https://github.com/aurrelhebert/TSL-Adaptor/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL-Adaptor</a> project</strong> was open sourced, as a proof of concept for how TSL can be used  to query a <em>non-Warp 10</em> database. Put simply, TSL-Adaptor allows TSL to <strong>query an InfluxDB</strong>.</p>



<h4 class="wp-block-heading">What is TSL-Adaptor?</h4>



<p>TSL-Adaptor is a <strong><a href="https://quarkus.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Quarkus</a> JAVA REST API</strong> that can be used to query a backend. TSL-Adaptor parses the TSL query, identifies the fetch operation, and loads natively raw data from the backend.  It then computes the TSL operations on the data, before returning  the result to the user. The main goal of TSL-Adaptor is <strong>to make TSL available</strong> on top of <strong>other TSDBs</strong>.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="971" height="702" src="https://www.ovh.com/blog/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197.png" alt="" class="wp-image-16866" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197.png 971w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197-300x217.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197-768x555.png 768w" sizes="auto, (max-width: 971px) 100vw, 971px" /></figure></div>



<p>In concrete terms, we are running a JAVA REST API that<strong> integrates the WarpScript library</strong> in its runtime. TSL is  then used to compile the query into a valid WarpScript one. This is <strong>fully transparent</strong>, and only deals with TSL queries on the user&#8217;s side. </p>



<p>To load raw data from the InfluxDB, we created a WarpScript extension. This extension integrates an abstract class <code>LOADSOURCERAW</code> that needs  to be implemented to create an TSL-Adaptor data source. This requires only two methods: <code>find</code> and <code>fetch</code>. <code>Find</code> gathers all series selectors matching a query (class names, tags or labels), while <code>fetch</code>actually retrieves the raw data within a time span.</p>



<h4 class="wp-block-heading">Query Influx with TSL-Adaptor</h4>



<p>To get started, simply run an <a href="https://www.influxdata.com/products/influxdb-overview/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">InfluxDB</a> locally on the 8086 port. Then, let&#8217;s start an influx <a href="https://www.influxdata.com/time-series-platform/telegraf/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Telegraf</a> agent and record Telegraf data on the local influx instance.</p>



<p>Next, make sure you have locally installed TSL-Adaptor and updated its config with the path to a <a href="https://github.com/ovh/tsl/releases" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code>tsl.so</code> library</a>.</p>



<p>To specify a custom influx address or databases, update the <a href="https://github.com/aurrelhebert/TSL-Adaptor/blob/master/src/main/resources/application.properties" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL-Adaptor configuration</a> file accordingly.</p>



<p>You can start TSL-Adaptor with the following example command:</p>



<pre class="wp-block-code"><code class="">java -XX:TieredStopAtLevel=1 -Xverify:none -Djava.util.logging.manager=org.jboss.logmanager.LogManager -jar build/tsl-adaptor-0.0.1-SNAPSHOT-runner.jar </code></pre>



<p>And that&#8217;s it! You can now query your influx database with TSL and TSL-Adaptor.</p>



<p>Let&#8217;s start with the retrieval of the time series relating to the disk measurements.</p>



<pre class="wp-block-code"><code class="">curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk")'</code></pre>



<p>Now let&#8217;s use the TSL analytics power! </p>



<p>First, we would like to retrieve only the data containing a mode set to <code>rw</code>.&nbsp;</p>



<pre class="wp-block-code"><code class="">curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw")'</code></pre>



<p>We would like to retrieve the maximum value at every five-minute interval, for the last 20 minutes. The TSL query will therefore be:</p>



<pre class="wp-block-code"><code class="">curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw").last(20m).sampleBy(5m,max)'</code></pre>



<p>Now it&#8217;s your turn to have some fun with TSL and InfluxDB. You can find details of all the implemented functions in the <a href="https://github.com/ovh/tsl/blob/master/spec/doc.md" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL documentation</a>. Enjoy exploring!</p>



<hr class="wp-block-separator"/>



<h3 class="wp-block-heading">What&#8217;s new on TSL with Warp10?</h3>



<p>We originally built TSL as a <a href="https://github.com/ovh/tsl" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GO proxy</a> in front of Warp10. <strong>TSL</strong> has now integrated the Warp 10 ecosystem as a <strong>Warp10 extension</strong>, or as a <strong>WASM library</strong>! We have also added some <strong>new native TSL functions</strong> to make the language even richer!</p>



<h4 class="wp-block-heading">TSL as WarpScript function</h4>



<p>To make TSL work as a Warp10 function, you need to have the <code>tsl.so</code> library available locally. This library can be found in <a href="https://github.com/ovh/tsl/releases" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL github repository</a>. We have also made a <a href="https://warpfleet.senx.io/browse/io.ovh/tslToWarpScript#main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL WarpScript extension</a> available from <a href="https://warpfleet.senx.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">WarpFleet</a>, the extension repository of the Warp 10 community. </p>



<p>To set up TSL extension on your Warp 10, simply download the JAR indicated in <a href="https://warpfleet.senx.io/browse/io.ovh/tslToWarpScript#main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">WarpFleet</a>. You can then configure the extension in the Warp 10 configuration file: </p>



<pre class="wp-block-code"><code class="">warpscript.extensions = io.ovh.metrics.warp10.script.ext.tsl.TSLWarpScriptExtension
warpscript.tsl.libso.path = &lt;PATH_TO_THE_tsl.so_FILE></code></pre>



<p>Once you reboot Warp 10, you are ready to go. You can test if it&#8217;s working by running the following query:</p>



<pre class="wp-block-code"><code class="">// You will need to put here a valid Warp10 token when computing a TSL select statement
// '&lt;A_VALID_WARP_10_TOKEN>' 

// A valid TOKEN isn't needed on the create series statement in this example
// You can simply put an empty string
''

// Native TSL create series statement
 &lt;'
    create(series('test').setValues("now", [-5m, 2], [0, 1])) 
'>
TSL</code></pre>



<p>With the WarpScript TSL function, you can use native WarpScript variables in your script, as shown in the example below:</p>



<pre class="wp-block-code"><code class="">// Set a Warp10 variable

NOW 'test' STORE

'' 

// A Warp10 variable can be reused in TSL script as a native TSL variable
 &lt;'
    create(series('test').setValues("now", [-5m, 2], [0, 1])).add(test)
'>
TSL</code></pre>



<h4 class="wp-block-heading">TSL WASM</h4>



<p>To expand TSL&#8217;s potential uses, we have also exported it as a Wasm library, so you can use it directly in a browser! The Wasm version of the library parses TSL queries locally and generates the WarpScript. The result can then be used to query a Warp 10 backend. You will find more details on the <a href="https://github.com/ovh/tsl#use-tsl-with-webassembly" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL github</a>.</p>



<h4 class="wp-block-heading">TSL&#8217;s new features</h4>



<p>As TSL has grown in popularity, we have detected and fixed a few bugs, and also added some additional native functions to accommodate new use cases.</p>



<p>We added the <code>setLabelFromName</code>method, to set a new label to a series, based on its name. This label can be the exact series name, or the result of a regular expression. </p>



<p>We also completed the <code>sort</code>method, to allow users to sort their series set based on series meta data (i.e. selector, name or labels).</p>



<p>Finally, we added a <code>filterWithoutLabels</code>, to filter a series set and remove any series that do not contain specific labels.</p>



<p>Thanks for reading! I hope you will give TSL a try, as I would love to hear your feedback.  </p>



<hr class="wp-block-separator"/>



<h2 class="wp-block-heading">Paris Time Series meetup</h2>



<p>We are delighted to be <strong>hosting</strong>, at the <strong>OVHcloud office</strong> in Paris, soon the third<strong> <a href="https://www.meetup.com/Paris-Time-Series-Meetup/events/266610627/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Paris Time Series meetup</a></strong>, organised by Nicolas Steinmetz. During this meetup, we will be speaking about TSL, as well as listening to an introduction of the Redis Times Series platform.</p>



<p>If you are available, we will be happy to meet you there!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ftsl-or-how-to-query-time-series-databases%2F&amp;action_name=TSL%20%28or%20how%20to%20query%20time%20series%20databases%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>IOT: Pushing data to OVHcloud metrics timeseries from Arduino</title>
		<link>https://blog.ovhcloud.com/iot-pushing-data-to-ovhcloud-metrics-timeseries-from-arduino/</link>
		
		<dc:creator><![CDATA[Cyrille Meichel]]></dc:creator>
		<pubDate>Thu, 24 Oct 2019 12:56:04 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Arduino]]></category>
		<category><![CDATA[IOT]]></category>
		<category><![CDATA[Metrics]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16087</guid>

					<description><![CDATA[Last spring, I built a wood oven in my garden. I&#8217;ve wanted to have one for years, and I finally decided to make it. To use it, I make a big fire inside for two hours, remove all the embers, and then it&#8217;s ready for cooking. The oven accumulates the heat during the fire and [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fiot-pushing-data-to-ovhcloud-metrics-timeseries-from-arduino%2F&amp;action_name=IOT%3A%20Pushing%20data%20to%20OVHcloud%20metrics%20timeseries%20from%20Arduino&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1024" height="537" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/IMG_0417-1024x537.jpg" alt="From Arduino to OVHcloud Metrics" class="wp-image-16156" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/IMG_0417-1024x537.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/IMG_0417-300x157.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/IMG_0417-768x403.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/IMG_0417.jpg 1199w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>Last spring, I built a wood oven in my garden. I&#8217;ve wanted to have one for years, and I finally decided to make it. To use it, I make a big fire inside for two hours, remove all the embers, and then it&#8217;s ready for cooking. The oven accumulates the heat during the fire and then releases it.</p>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/IMG_20190820_162149.jpg" alt="" class="wp-image-16089" width="252" height="225" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/IMG_20190820_162149.jpg 800w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/IMG_20190820_162149-300x268.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/IMG_20190820_162149-768x686.jpg 768w" sizes="auto, (max-width: 252px) 100vw, 252px" /></figure></div>



<p>Once the embers are removed, I have to prioritise the dishes I want to cook as the temperature drops:</p>



<ul class="wp-block-list"><li>Pizza: 280°C</li><li>Bread: 250°C</li><li>Rice pudding: 180°C</li><li>Meringues: 100°C</li></ul>



<p>I built a first version of a thermometer with an Arduino, to be able to check the temperature. This thermometer, made of a thermocouple (i.e. a sensor that measures high temperatures), displays the inside temperature on a little LCD screen.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/thermocouple-arduino-1024x324.png" alt="" class="wp-image-16090" width="461" height="145" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/thermocouple-arduino-1024x324.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/thermocouple-arduino-300x95.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/thermocouple-arduino-768x243.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/thermocouple-arduino.png 1159w" sizes="auto, (max-width: 461px) 100vw, 461px" /></figure></div>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/diagram-1024x598.png" alt="" class="wp-image-16092" width="338" height="197" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/diagram-1024x598.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/diagram-300x175.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/diagram-768x449.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/diagram.png 1364w" sizes="auto, (max-width: 338px) 100vw, 338px" /></figure></div>



<p>The next step was to anticipate when to stuff dishes into the oven. Watching the temperature dropping down for hours was not a good idea. I needed the heat diagram of my oven! A heat diagram is just the chart of the temperature over a given period of time. But writing down temperature on a paper every ten minutes&#8230; wait&#8230; it will last more than 30 hours. </p>



<p><strong>Please, let me sleep !</strong></p>



<p>This needs some automation. Fortunately, OVHcloud has the solution: Metrics Data Platform: <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">https://www.ovh.com/fr/data-platforms/metrics/</a></p>



<h2 class="wp-block-heading">The Hardware</h2>



<p>The aim of the project is to plug a sensor onto an Arduino that will send data to <strong>OVHcloud Metrics Data Platform</strong> (<a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">https://www.ovh.com/fr/data-platforms/metrics/</a>) via the network. Basically, the Arduino will use the local wifi network to push temperature data to OVHcloud servers.</p>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/buy-esp8266.png" alt="" class="wp-image-16095" width="239" height="86" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/buy-esp8266.png 995w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/buy-esp8266-300x109.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/buy-esp8266-768x278.png 768w" sizes="auto, (max-width: 239px) 100vw, 239px" /></figure></div>



<p>Do you know ESP8266? It&#8217;s&nbsp;a low-cost (<strong>less than</strong> <strong>2€!</strong>) wifi microchip with full TCP/IP stack and microcontroller capability.</p>



<h3 class="wp-block-heading">ESP8266 functional diagram</h3>



<figure class="wp-block-image is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/esp8266-1.png" alt="" class="wp-image-16096" width="754" height="364" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/esp8266-1.png 933w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/esp8266-1-300x145.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/esp8266-1-768x370.png 768w" sizes="auto, (max-width: 754px) 100vw, 754px" /></figure>



<h3 class="wp-block-heading">Implementation: Wemos</h3>



<p>ESP8266 is not quite so easy to use on its own:</p>



<ul class="wp-block-list"><li>Must be powered at 3.3V (not too much, or it will burn)</li><li>No USB</li></ul>



<div class="wp-block-image"><figure class="alignright"><img loading="lazy" decoding="async" width="207" height="250" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/wemos.png" alt="" class="wp-image-16097"/></figure></div>



<p>That&#8217;s why it is better to use a solution that implements ESP8266 for us. Here is the Wemos!</p>



<ul class="wp-block-list"><li>Powered at 5V (6V is still ok)</li><li>USB for serial communication (for debugging)</li><li>Can be programmed via USB</li><li>Can be programmed with Arduino IDE</li><li>Costs less than 3€</li></ul>



<h2 class="wp-block-heading">Prepare your Arduino IDE</h2>



<h3 class="wp-block-heading">Install the integrated development environment</h3>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/download-arduino.png" alt="" class="wp-image-16098" width="423" height="169" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/download-arduino.png 917w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/download-arduino-300x120.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/download-arduino-768x307.png 768w" sizes="auto, (max-width: 423px) 100vw, 423px" /></figure></div>



<p>First of all you need to install Arduino IDE. It&#8217;s free, and available for any platform (Mac, Windows, Linux). Go to&nbsp;<a href="https://www.arduino.cc/en/main/software" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://www.arduino.cc/en/main/software</a>&nbsp;and download the version corresponding to your platform. At the time of writing, the current version is 1.8.10.</p>



<h3 class="wp-block-heading">Additional configuration for ESP8266</h3>



<p>When you install the Arduino IDE, it will only be capable of programming official Arduinos. Let&#8217;s add the firmware and libraries for ESP8266&#8230;&nbsp;</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/step06.png" alt="" class="wp-image-16099" width="392" height="226" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/step06.png 999w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step06-300x174.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step06-768x445.png 768w" sizes="auto, (max-width: 392px) 100vw, 392px" /></figure></div>



<p>Start Arduino and open the &#8220;Preferences&#8221; window (<strong>File &gt; Preferences</strong>).</p>



<p>Enter&nbsp;<code><a href="https://arduino.esp8266.com/stable/package_esp8266com_index.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://arduino.esp8266.com/stable/package_esp8266com_index.json</a></code>&nbsp;into the&nbsp;&#8220;Additional Board Manager URLs&#8221;&nbsp;field. You can add multiple URLs, separating them with commas.</p>



<div style="height:115px" aria-hidden="true" class="wp-block-spacer"></div>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/step07.png" alt="" class="wp-image-16100" width="393" height="220" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/step07.png 1002w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step07-300x168.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step07-768x431.png 768w" sizes="auto, (max-width: 393px) 100vw, 393px" /></figure></div>



<p>Now open &#8220;Boards Manager&#8221; from the<strong>&nbsp;Tools &gt; Board</strong>&nbsp;menu and install&nbsp;the <em>esp8266</em>&nbsp;platform (don&#8217;t forget to select your ESP8266 board from the <strong>Tools &gt; Board</strong> menu after installation).</p>



<p>You are now ready!</p>



<h2 class="wp-block-heading">Order a Metrics Data Platform</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/step00-1024x484.png" alt="" class="wp-image-16101" width="308" height="145" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/step00-1024x484.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step00-300x142.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step00-768x363.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step00.png 1321w" sizes="auto, (max-width: 308px) 100vw, 308px" /></figure></div>



<p></p>



<p>Go&nbsp;to the&nbsp;OVHcloud&nbsp;Metrics&nbsp;Data&nbsp;Platform&nbsp;website:&nbsp;<a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">https://www.ovh.com/fr/data-platforms/metrics/</a>. Click&nbsp;on&nbsp;the&nbsp;free&nbsp;trial, and finalise your order. If you don&#8217;t have an account, just create one. With this trial you will have 12 metrics (i.e. 12 sets of records). In this example, we will only use one.</p>



<h3 class="wp-block-heading">Retrieve your token</h3>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/step03.png" alt="" class="wp-image-16102" width="127" height="129" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/step03.png 353w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step03-296x300.png 296w" sizes="auto, (max-width: 127px) 100vw, 127px" /></figure></div>



<p>Go to the OVH Control Panel:&nbsp;<a href="https://www.ovh.com/manager/cloud/#/" data-wpel-link="exclude">https://www.ovh.com/manager/cloud/#/</a>. On the left-hand panel, you should have&nbsp;<strong>Metrics</strong>&nbsp;and a new service inside.</p>



<div style="height:65px" aria-hidden="true" class="wp-block-spacer"></div>



<p>In the&nbsp;&#8220;Tokens&#8221;&nbsp;tab, you can copy the<strong>&nbsp;write&nbsp;token</strong>.&nbsp;Keep it, as we will need it later.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/step04-1024x295.png" alt="" class="wp-image-16103" width="633" height="182" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/step04-1024x295.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step04-300x87.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step04-768x222.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step04.png 1539w" sizes="auto, (max-width: 633px) 100vw, 633px" /></figure></div>



<p>Note that to configure Grafana, you will need the <strong>read token</strong>.</p>



<h3 class="wp-block-heading">Retrieve the host of the Metrics Data Platform</h3>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/step05-1024x439.png" alt="" class="wp-image-16104" width="339" height="146" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/step05-1024x439.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step05-300x129.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step05-768x329.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step05.png 1526w" sizes="auto, (max-width: 339px) 100vw, 339px" /></figure></div>



<p>The host of your Metrics Data Platform is given in your service description.&nbsp;In the &#8220;Platforms&#8221; tab, copy the <strong>opentsdb host</strong>. Keep it, as we will need it later.</p>



<div style="height:100px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading">Deeper into the program</h2>



<p>Now let&#8217;s have a look at an example. Here is a code that will push static data to OVHcloud Metrics Data Platform. You can use it with your sensor. You just have to code the sensor measure. When running, the Wemos will:</p>



<ul class="wp-block-list"><li>Try to connect to you wifi network</li><li>If successful, push data to OVHcloud Metrics Data Platform</li></ul>



<p>The whole source code is available on my github: &nbsp;<a href="https://github.com/landru29/ovh_metrics_wemos" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://github.com/landru29/ovh_metrics_wemos</a>. </p>



<p>There are six main files:</p>



<ul class="wp-block-list"><li><em>ovh_metrics_wemos.ino</em>: the main file</li><li><em>wifi.cpp</em>: class that implements the process to connect to wifi via WPS (Wifi Protected Setup)</li><li><em>wifi.h</em>: header file for the wifi</li><li><em>metrics.cpp</em>: class that sends the metric data to OVHcloud Metrics Data Platform via HTTPS</li><li><em>metrics.h</em>: header file for metrics</li><li><em>config.h.sample</em>: model to create your configuration file (see below)</li></ul>



<h3 class="wp-block-heading">Create your configuration file</h3>



<p>If you try to compile the program, you will get errors, as some definitions are missing. We need to declare them in a file:&nbsp;<strong>config.h</strong>.</p>



<ol class="wp-block-list"><li>Copy&nbsp;<strong>config.h.sample</strong>&nbsp;into&nbsp;<strong>config.h</strong></li><li>Copy the write token you got in paragraph 5.1 (#define&nbsp;<strong>TOKEN</strong>&nbsp;&#8220;xxxxxx&#8221;)</li><li>Copy the host you got in paragraph 5.2 (#define&nbsp;<strong>HOST</strong>&nbsp;&#8220;xxxxxx&#8221;)</li></ol>



<h3 class="wp-block-heading">Get the fingerprint of the certificate</h3>



<p>As the Wemos will request through HTTPS, we need the certificate fingerprint. You will need the host you just grabbed from the&nbsp;&#8220;Platforms&#8221;&nbsp;tab and then:</p>



<h4 class="wp-block-heading">Linux users</h4>



<p>Just run this little script:</p>



<pre class="wp-block-code"><code class="">HOST=opentsdb.gra1.metrics.ovh.net; echo | openssl s_client -showcerts -servername ${HOST} -connect ${HOST}:443 2>/dev/null | openssl x509 -noout -fingerprint -sha1 -inform pem | sed -e "s/.*=//g" | sed -e "s/\:/ /g"</code></pre>



<p>Copy the result in your&nbsp;<strong><code>config.h&nbsp;</code></strong><code>(#define&nbsp;<strong>FINGERPRINT</strong>&nbsp;"xx xx ..")</code>.</p>



<h4 class="wp-block-heading">MAC users</h4>



<p>Just run this little script:</p>



<pre class="wp-block-code"><code class="">HOST=opentsdb.gra1.metrics.ovh.net; echo | openssl s_client -showcerts -servername ${HOST} -connect ${HOST}:443 2>/dev/null | openssl x509 -noout -fingerprint -sha1 -inform pem | sed -e "s/.*=//g" | sed -e "s/\:/ /g"</code></pre>



<p>Copy the result in your&nbsp;<strong><code>config.h&nbsp;</code></strong><code>(#define&nbsp;<strong>FINGERPRINT</strong>&nbsp;"xx xx ..")</code>.</p>



<h4 class="wp-block-heading">Windows users</h4>



<p>In your browser, go to <a href="https://opentsdb.gra1.metrics.ovh.net" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://opentsdb.gra1.metrics.ovh.net</a>. Click on the lock next to the URL to display the fingerprint of the certificate. Replace all &#8216;:&#8217; with one space.</p>



<h2 class="wp-block-heading">Compile the project and upload it to the Wemos</h2>



<ol class="wp-block-list"><li>Open the&nbsp;<code>.ino</code>&nbsp;file in the Arduino IDE (you should have six tabs in the project)</li><li>Plug the&nbsp;Wemos&nbsp;into you computer</li><li>Select the port from&nbsp;<strong>Tools &gt; Port</strong></li><li>On the top-left side, click on the arrow to upload the program</li><li>Once uploaded, you can open the serial monitor:&nbsp;<strong>Tools &gt; Serial Monitor</strong></li></ol>



<p>Right now, the program should fail, as the&nbsp;Wemos&nbsp;will not be able to connect to your wifi network.</p>



<h2 class="wp-block-heading">Run the program</h2>



<p>As we&#8217;ve already seen, the first run crashes. It&#8217;s because you need to launch a WPS connection, so depending on your internet modem, you will need to launch a WPS transaction. This could be a physical button on the modem, or a software action to trigger on the console (<a href="https://en.wikipedia.org/wiki/Wi-Fi_Protected_Setup" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://en.wikipedia.org/wiki/Wi-Fi_Protected_Setup</a>).</p>



<p>When the process is launched on the modem side, you have something like 30 seconds to power the Wemos.</p>



<ol class="wp-block-list"><li>Plug in your&nbsp;Wemos&nbsp;via USB =&gt; the program is running</li><li>Select the port from&nbsp;<strong>Tools &gt; Port</strong>&nbsp;(it may have changed)</li><li>Open the serial monitor:<strong>&nbsp;Tools &gt; Serial Monitor</strong></li></ol>



<p>Now you can follow the process.</p>



<h3 class="wp-block-heading">Wifi connection</h3>



<p>In the serial monitor (adjust the bit rate to 9600), you should get:</p>



<pre class="wp-block-preformatted"><code>Try to connect</code>
&nbsp;
<code>WPS config start</code>
<code>Trying to connect to &lt;your modem&gt; with saved config ...|SUCCESS</code>
<code>IP address:&nbsp;192.168.xx.xx</code></pre>



<p>If the wifi connection was successful, the serial console should display a local IP address (192.168.xx.xx), otherwise, it failed. Try again by triggering WPS on your modem and restarting the Wemos (unplug it and plug it back in).</p>



<h3 class="wp-block-heading">Sending data to OVHcloud Metrics Data Platform</h3>



<p>Now the Wemos is POSTing a request on the OVHcloud server. The serial console shows you the JSON it will send:</p>



<pre class="wp-block-preformatted"><code>------------------------------------------------</code>
<code>POST opentsdb.gra1.metrics.ovh.net/api/put</code>
<code>[{"metric":&nbsp;"universe","value":42,"tags":{}}]</code>
<code>------------------------------------------------</code>
<code>beginResult:&nbsp;0</code>
<code>http:&nbsp;204</code>
<code>response: xxxx</code></pre>



<p>If&nbsp;<strong><code>beginResult</code></strong>&nbsp;is negative, connection to the OVHcloud server failed. It could mean that the&nbsp;<code>FINGERPRINT</code>&nbsp;is wrong.</p>



<p>If&nbsp;<strong><code>http</code></strong>&nbsp;is not&nbsp;<code>2xx</code>&nbsp;(it should be&nbsp;<code>204</code>), the server could not process your request. It may mean that the&nbsp;<code>TOKEN</code>&nbsp;is wrong.</p>



<p>You got a 204? Great! It&#8217;s a success. Let&#8217;s check that on Grafana&#8230;</p>



<h2 class="wp-block-heading">Configure Grafana</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/step08.png" alt="" class="wp-image-16107" width="280" height="168" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/step08.png 747w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/step08-300x180.png 300w" sizes="auto, (max-width: 280px) 100vw, 280px" /></figure></div>



<p>Go to OVHcloud Grafana:&nbsp;<a href="https://grafana.metrics.ovh.net/login" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://grafana.metrics.ovh.net/login</a>.&nbsp;Log in with your OVHcloud account.</p>



<h3 class="wp-block-heading">Configure a data source</h3>



<p>Click on &#8220;Add data source&#8221;.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="93" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/grafana00-1024x93.png" alt="" class="wp-image-16108" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana00-1024x93.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana00-300x27.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana00-768x70.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana00.png 1885w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/grafana01.png" alt="" class="wp-image-16109" width="273" height="224" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana01.png 973w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana01-300x247.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana01-768x631.png 768w" sizes="auto, (max-width: 273px) 100vw, 273px" /></figure></div>



<ul class="wp-block-list"><li><em>Name</em>: choose one</li><li><em>Type</em>: OpenTSDB</li><li><em>URL</em>: https://&lt;host you got from your manager (see below)&gt;</li><li><em>Access</em>: direct</li><li>Check &#8220;Basic Auth&#8221;</li><li><em>User</em>: metrics</li><li><em>Password</em>: &lt;Read token from your manager (see below)&gt;</li></ul>



<p>Click on the &#8220;Add&#8221; button&#8230;</p>



<p>&#8230; and save it.</p>



<h3 class="wp-block-heading">Create your first chart</h3>



<p>Go back to&nbsp;<a href="https://grafana.metrics.ovh.net/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">https://grafana.metrics.ovh.net/</a>&nbsp;and click on &#8220;New Dashboard&#8221;.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="84" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/grafana03-1024x84.png" alt="" class="wp-image-16110" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana03-1024x84.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana03-300x24.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana03-768x63.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana03.png 1886w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Click on &#8220;Graph&#8221;.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="84" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/grafana04-1024x84.png" alt="" class="wp-image-16111" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana04-1024x84.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana04-300x25.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana04-768x63.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana04.png 1894w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Click on &#8220;Panel title&#8221;, then &#8220;Edit&#8221;.</p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="126" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/grafana05-1024x126.png" alt="" class="wp-image-16112" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana05-1024x126.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana05-300x37.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana05-768x95.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/grafana05.png 1887w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Select your metric in the &#8220;metric name&#8221; field. The software must suggest the name <em>universe</em> (the name specified in the Arduino program). If it doesn&#8217;t, this means the metrics were not correctly sent by the Wemos. Close the &#8220;edit&#8221; panel (click the cross on the right) and save your configuration (top-left of the window).</p>



<h2 class="wp-block-heading">Result analysis</h2>



<h3 class="wp-block-heading">Temperature rise</h3>



<p>The first result to analyse is the temperature rise. The sensor was lying on the bricks of the oven. The yellow chart is the oven temperature, and the green chart is the ambient temperature.</p>



<ol class="wp-block-list"><li>Between 11:05 and 11:10, there is a step at about 85°C. It seems to be the moisture of the oven that was drying.</li><li>Then there&#8217;s a temperature drop, so I added some more wood to the oven (i.e. introduced cold stuff).</li><li>At about 11:20, the slope is lighter, and I have no idea why. Fire not strong enough? Moisture deeper in the bricks?</li></ol>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="418" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/four-1024x418.png" alt="" class="wp-image-16113" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/four-1024x418.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/four-300x122.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/four-768x313.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/four.png 1871w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Temperature dropdown</h3>



<p>At this point, I moved all the embers at the back of the oven and put the sensor where the fire was burning. That&#8217;s why the chart begins at 400°C.</p>



<ol class="wp-block-list"><li>The temperature dropdown seems to be something like <strong>F(t) = A/t</strong></li><li>At about 15:40, I changed the power supply from a phone power supply plugged in at 230V to a car battery with a voltage regulator (which seemed to be shitty)</li><li>The ambient temperature is quite high between 15:00 and 17:00. It was a sunny day, so the sun was directly heating the circuit.</li></ol>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="1024" height="417" src="https://www.ovh.com/blog/wp-content/uploads/2019/10/four2-1024x417.png" alt="" class="wp-image-16114" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/10/four2-1024x417.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/four2-300x122.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/four2-768x313.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/10/four2.png 1871w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fiot-pushing-data-to-ovhcloud-metrics-timeseries-from-arduino%2F&amp;action_name=IOT%3A%20Pushing%20data%20to%20OVHcloud%20metrics%20timeseries%20from%20Arduino&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to monitor your Kubernetes Cluster with OVH Observability</title>
		<link>https://blog.ovhcloud.com/how-to-monitor-your-kubernetes-cluster-with-ovh-observability/</link>
		
		<dc:creator><![CDATA[Adrien Carreira]]></dc:creator>
		<pubDate>Fri, 08 Mar 2019 13:33:55 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Beamium]]></category>
		<category><![CDATA[Kubernetes]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Noderig]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[OVHcloud Managed Kubernetes]]></category>
		<category><![CDATA[OVHcloud Observability]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14897</guid>

					<description><![CDATA[Our colleagues in the K8S team launched the OVH Managed Kubernetes solution&#160;last week,&#160;in which they manage the Kubernetes master components and spawn your nodes on top of our Public Cloud solution. I will not describe the details of how it works here, but there are already many blog posts about it (here&#160;and&#160;here, to get you [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-monitor-your-kubernetes-cluster-with-ovh-observability%2F&amp;action_name=How%20to%20monitor%20your%20Kubernetes%20Cluster%20with%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="graf graf--p">Our colleagues in the K8S team launched the OVH Managed Kubernetes solution&nbsp;<a class="markup--anchor markup--p-anchor" href="https://www.ovh.com/fr/kubernetes/" target="_blank" rel="noopener noreferrer" data-href="https://www.ovh.com/fr/kubernetes/" data-wpel-link="exclude">last week,</a>&nbsp;in which they manage the Kubernetes master components and spawn your nodes on top of our Public Cloud solution. I will not describe the details of how it works here, but there are already many blog posts about it (<a class="markup--anchor markup--p-anchor" href="https://www.ovh.com/fr/blog/kubinception-and-etcd/" target="_blank" rel="noopener noreferrer" data-href="https://www.ovh.com/fr/blog/kubinception-and-etcd/" data-wpel-link="exclude">here</a>&nbsp;and&nbsp;<a class="markup--anchor markup--p-anchor" href="https://www.ovh.com/fr/blog/kubinception-using-kubernetes-to-run-kubernetes/" target="_blank" rel="noopener noreferrer" data-href="https://www.ovh.com/fr/blog/kubinception-using-kubernetes-to-run-kubernetes/" data-wpel-link="exclude">here,</a> to get you started).</p>



<p>In the <a href="https://labs.ovh.com/machine-learning-platform" data-wpel-link="exclude">Prescience team</a>, we have used Kubernetes for more than a year now. Our cluster includes 40 nodes, running on top of PCI. We continuously run about 800 pods, and generate a lot of metrics as a result.</p>



<p>Today, we&#8217;ll look at how we handle these metrics to monitor our Kubernetes Cluster, and (equally importantly!) how to do this with your own cluster.</p>



<h3 class="graf graf--h3 wp-block-heading">OVH Metrics</h3>



<p class="graf graf--p">Like any infrastructure, you need to monitor your Kubernetes Cluster. You need to know exactly how your nodes, cluster and applications behave once they have been deployed in order to provide reliable services to your customers. To do this with our own Cluster, we use <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">OVH Observability</a>.</p>



<p class="graf graf--p">OVH Observability is backend-agnostic, so we can push metrics with one format and query with another one. It can handle:</p>



<ul class="postList wp-block-list"><li class="graf graf--li">Graphite</li><li class="graf graf--li">InfluxDB</li><li class="graf graf--li">Metrics2.0</li><li class="graf graf--li">OpentTSDB</li><li class="graf graf--li">Prometheus</li><li class="graf graf--li">Warp10</li></ul>



<p class="graf graf--p">It also incorporates a managed <a class="markup--anchor markup--p-anchor" href="https://grafana.metrics.ovh.net" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.metrics.ovh.net" data-wpel-link="external">Grafana</a>, in order to display metrics and create monitoring dashboards.</p>



<h3 class="graf graf--h3 wp-block-heading">Monitoring Nodes</h3>



<p class="graf graf--p">The first thing to monitor is the health of nodes. Everything else starts from there.</p>



<p class="graf graf--p">In order to monitor your nodes, we will use <a class="markup--anchor markup--p-anchor" href="https://github.com/ovh/noderig" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://github.com/ovh/noderig" data-wpel-link="external">Noderig</a> and <a class="markup--anchor markup--p-anchor" href="https://github.com/ovh/beamium" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://github.com/ovh/beamium" data-wpel-link="external">Beamium</a>, as described <a href="/monitoring-guidelines-for-ovh-observability/" data-wpel-link="internal">here</a>. We will also use Kubernetes DaemonSets to start the process on all our nodes.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0135-1024x770.jpg" alt="" class="wp-image-15024" width="768" height="578" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-1024x770.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-300x226.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-768x578.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135-1200x903.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0135.jpg 2039w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<p class="graf graf--p">So let’s start creating a namespace&#8230;</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">kubectl create namespace metrics</code></pre>



<p class="graf graf--p">Next, create a secret with the write token metrics, which you can find in the OVH Control Panel.</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">kubectl create secret generic w10-credentials --from-literal=METRICS_TOKEN=your-token -n metrics</code></pre>



<p class="graf graf--p">Copy <code class="markup--code markup--p-code">metrics.yml</code> into a file and apply the configuration with kubectl</p>



<pre title="metrics.yml" class="wp-block-code"><code lang="yaml" class="language-yaml"># This will configure Beamium to scrap noderig
# And push metrics to warp 10
# We also add the HOSTNAME to the labels of the metrics pushed
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: beamium-config
  namespace: metrics
data:
  config.yaml: |
    scrapers:
      nodering:
        url: http://0.0.0.0:9100/metrics
        period: 30000
        format: sensision
        labels:
          app: nodering

    sinks:
      warp:
        url: https://warp10.gra1.metrics.ovh.net/api/v0/update
        token: $METRICS_TOKEN

    labels:
      host: $HOSTNAME

    parameters:
      log-file: /dev/stdout
---
# This is a custom collector that report the uptime of the node
apiVersion: v1
kind: ConfigMap
metadata:
  name: noderig-collector
  namespace: metrics
data:
  uptime.sh: |
    #!/bin/sh
    echo 'os.uptime' `date +%s%N | cut -b1-10` `awk '{print $1}' /proc/uptime`
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: metrics-daemon
  namespace: metrics
spec:
  selector:
    matchLabels:
      name: metrics-daemon
  template:
    metadata:
      labels:
        name: metrics-daemon
    spec:
      terminationGracePeriodSeconds: 10
      hostNetwork: true
      volumes:
      - name: config
        configMap:
          name: beamium-config
      - name: noderig-collector
        configMap:
          name: noderig-collector
          defaultMode: 0777
      - name: beamium-persistence
        emptyDir:{}
      containers:
      - image: ovhcom/beamium:latest
        imagePullPolicy: Always
        name: beamium
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: TEMPLATE_CONFIG
          value: /config/config.yaml
        envFrom:
        - secretRef:
            name: w10-credentials
            optional: false
        resources:
          limits:
            cpu: "0.05"
            memory: 128Mi
          requests:
            cpu: "0.01"
            memory: 128Mi
        workingDir: /beamium
        volumeMounts:
        - mountPath: /config
          name: config
        - mountPath: /beamium
          name: beamium-persistence
      - image: ovhcom/noderig:latest
        imagePullPolicy: Always
        name: noderig
        args: ["-c", "/collectors", "--net", "3"]
        volumeMounts:
        - mountPath: /collectors/60/uptime.sh
          name: noderig-collector
          subPath: uptime.sh
        resources:
          limits:
            cpu: "0.05"
            memory: 128Mi
          requests:
            cpu: "0.01"
            memory: 128Mi</code></pre>



<p class="graf graf--p"><em class="markup--em markup--p-em">Don’t hesitate to change the collector levels if you need more information.</em></p>



<p>Then apply the configuration with kubectl&#8230;</p>



<pre class="wp-block-code console"><code class="">$ kubectl apply -f metrics.yml
# Then, just wait a minutes for the pods to start
$ kubectl get all -n metrics
NAME                       READY   STATUS    RESTARTS   AGE
pod/metrics-daemon-2j6jh   2/2     Running   0          5m15s
pod/metrics-daemon-t6frh   2/2     Running   0          5m14s

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE AGE
daemonset.apps/metrics-daemon 40        40        40      40           40          122d</code></pre>



<p class="graf graf--p">You can import our dashboard in to your Grafana from <a class="markup--anchor markup--p-anchor" href="https://grafana.com/dashboards/9876" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.com/dashboards/9876" data-wpel-link="external">here</a>, and view some metrics about your nodes straight away.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1842" height="631" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08.png" alt="" class="wp-image-14899" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08.png 1842w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-300x103.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-768x263.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-1024x351.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.09.08-1200x411.png 1200w" sizes="auto, (max-width: 1842px) 100vw, 1842px" /></figure></div>



<h3 class="graf graf--h3 wp-block-heading">Kube Metrics</h3>



<p>As the OVH Kube is a managed service, you don&#8217;t need to monitor the apiserver, etcd, or controlplane. The OVH Kubernetes team takes care of this. So we will focus on <a href="https://github.com/google/cadvisor/blob/master/info/v1/container.go" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">cAdvisor</a> metrics and <a href="https://github.com/kubernetes/kube-state-metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kube state metrics</a></p>



<p>The most mature solution for dynamically scraping metrics inside the Kube (for now) is <a href="https://github.com/prometheus/prometheus" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Prometheus</a>.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0144-1024x770.jpg" alt="Kube metrics" class="wp-image-15033" width="512" height="385" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-1024x770.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-300x226.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-768x578.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144-1200x903.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0144.jpg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p class="graf graf--p"><em class="markup--em markup--p-em">In the next Beamium release, we should be able to reproduce the features of the Prometheus scraper.</em></p>



<p class="graf graf--p">To install the Prometheus server, you need to install Helm on the cluster&#8230;</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">kubectl -n kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller \
    --clusterrole cluster-admin \
    --serviceaccount=kube-system:tiller
helm init --service-account tiller</code></pre>



<p class="graf graf--p">You then need to create the following two files:&nbsp;<code class="markup--code markup--p-code">prometheus.yml</code> and <code class="markup--code markup--p-code">values.yml</code>.</p>



<pre title="prometheus.yml" class="wp-block-code"><code lang="yaml" class="language-yaml"># Based on https://github.com/prometheus/prometheus/blob/release-2.2/documentation/examples/prometheus-kubernetes.yml
serverFiles:
  prometheus.yml:
    remote_write:
    - url: "https://prometheus.gra1.metrics.ovh.net/remote_write"
      remote_timeout: 120s
      bearer_token: $TOKEN
      write_relabel_configs:
      # Filter metrics to keep
      - action: keep
        source_labels: [__name__]
        regex: "eagle.*|\
            kube_node_info.*|\
            kube_node_spec_taint.*|\
            container_start_time_seconds|\
            container_last_seen|\
            container_cpu_usage_seconds_total|\
            container_fs_io_time_seconds_total|\
            container_fs_write_seconds_total|\
            container_fs_usage_bytes|\
            container_fs_limit_bytes|\
            container_memory_working_set_bytes|\
            container_memory_rss|\
            container_memory_usage_bytes|\
            container_network_receive_bytes_total|\
            container_network_transmit_bytes_total|\
            machine_memory_bytes|\
            machine_cpu_cores"

    scrape_configs:
    # Scrape config for Kubelet cAdvisor.
    - job_name: 'kubernetes-cadvisor'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      
      relabel_configs:
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
        
      metric_relabel_configs:
      # Only keep systemd important services like docker|containerd|kubelet and kubepods,
      # We also want machine_cpu_cores that don't have id, so we need to add the name of the metric in order to be matched
      # The string will concat id with name and the separator is a ;
      # `/;container_cpu_usage_seconds_total` OK
      # `/system.slice;container_cpu_usage_seconds_total` OK
      # `/system.slice/minion.service;container_cpu_usage_seconds_total` NOK, Useless
      # `/kubepods/besteffort/e2514ad43202;container_cpu_usage_seconds_total` Best Effort POD OK
      # `/kubepods/burstable/e2514ad43202;container_cpu_usage_seconds_total` Burstable POD OK
      # `/kubepods/e2514ad43202;container_cpu_usage_seconds_total` Guaranteed POD OK
      # `/docker/pod104329ff;container_cpu_usage_seconds_total` OK, Container that run on docker but not managed by kube
      # `;machine_cpu_cores` OK, there is no id on these metrics, but we want to keep them also
      - source_labels: [id,__name__]
        regex: "^((/(system.slice(/(docker|containerd|kubelet).service)?|(kubepods|docker).*)?);.*|;(machine_cpu_cores|machine_memory_bytes))$"
        action: keep
      # Remove Useless parents keys like `/kubepods/burstable` or `/docker`
      - source_labels: [id]
        regex: "(/kubepods/burstable|/kubepods/besteffort|/kubepods|/docker)"
        action: drop
        # cAdvisor give metrics per container and sometimes it sum up per pod
        # As we already have the child, we will sum up ourselves, so we drop metrics for the POD and keep containers metrics
        # Metrics for the POD don't have container_name, so we drop if we have just the pod_name
      - source_labels: [container_name,pod_name]
        regex: ";(.+)"
        action: drop
    
    # Scrape config for service endpoints.
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

    # Example scrape config for pods
    #
    # The relabeling allows the actual pod scrape endpoint to be configured via the
    # following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
    # pod's declared ports (default is a port-free target if none are declared).
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod

      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: host
      - action: labeldrop
        regex: (pod_template_generation|job|release|controller_revision_hash|workload_user_cattle_io_workloadselector|pod_template_hash)
</code></pre>



<pre title="values.yml" class="wp-block-code"><code lang="yaml" class="language-yaml">alertmanager:
  enabled: false
pushgateway:
  enabled: false
nodeExporter:
  enabled: false
server:
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: traefik
      ingress.kubernetes.io/auth-type: basic
      ingress.kubernetes.io/auth-secret: basic-auth
    hosts:
    - prometheus.domain.com
  image:
    tag: v2.7.1
  persistentVolume:
    enabled: false
</code></pre>



<p class="graf graf--p">Don’t forget to replace your token!</p>



<p>The Prometheus scraper is quite powerful. You can relabel your time series, keep a few that match your regex, etc. This config removes a lot of useless metrics, so don’t hesitate to tweak it if you want to see more cAdvisor metrics (for example).</p>



<p class="graf graf--p">&nbsp;Install it with Helm&#8230;</p>



<pre class="wp-block-code"><code lang="bash" class="language-bash">helm install stable/prometheus \
    --namespace=metrics \
    --name=metrics \
    --values=values/values.yaml \
    --values=values/prometheus.yaml</code></pre>



<p class="graf graf--p">Add add a basic-auth secret&#8230;</p>



<pre class="wp-block-code console"><code class="">$ htpasswd -c auth foo
New password: &lt;bar>
New password:
Re-type new password:
Adding password for user foo
$ kubectl create secret generic basic-auth --from-file=auth -n metrics
secret "basic-auth" created</code></pre>



<p class="graf graf--p">You can can access the Prometheus server interface through <code class="markup--code markup--p-code">prometheus.domain.com.</code></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1876" height="809" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21.png" alt="" class="wp-image-14933" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21.png 1876w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-300x129.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-768x331.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-1024x442.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-10.01.21-1200x517.png 1200w" sizes="auto, (max-width: 1876px) 100vw, 1876px" /></figure></div>



<p class="graf graf--p">You will see all the metrics for your Cluster, although only the one you have filtered will be pushed to OVH Metrics.</p>



<p>The Prometheus interfaces is a good way to explore your metrics, as it&#8217;s quite straightforward to display and monitor your infrastructure. You can find our dashboard <a class="markup--anchor markup--p-anchor" href="https://grafana.com/dashboards/9880" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.com/dashboards/9880" data-wpel-link="external">here.</a></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1843" height="653" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20.png" alt="" class="wp-image-14900" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20.png 1843w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-300x106.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-768x272.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-1024x363.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-16.07.20-1200x425.png 1200w" sizes="auto, (max-width: 1843px) 100vw, 1843px" /></figure></div>



<h3 class="graf graf--h3 wp-block-heading">Resources Metrics</h3>



<p class="graf graf--p">As @<a class="markup--user markup--p-user" href="https://medium.com/u/7dfbd8de8b55" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://medium.com/u/7dfbd8de8b55" data-anchor-type="2" data-user-id="7dfbd8de8b55" data-action-value="7dfbd8de8b55" data-action="show-user-card" data-action-type="hover" data-wpel-link="external">Martin Schneppenheim</a> said in this <a class="markup--anchor markup--p-anchor" href="https://medium.com/@martin.schneppenheim/utilizing-and-monitoring-kubernetes-cluster-resources-more-effectively-using-this-tool-df4c68ec2053" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://medium.com/@martin.schneppenheim/utilizing-and-monitoring-kubernetes-cluster-resources-more-effectively-using-this-tool-df4c68ec2053" data-wpel-link="external">post</a>, in order to correctly manage a Kubernetes Cluster, you also need to monitor pod resources.</p>



<p>We will install <a class="markup--anchor markup--p-anchor" href="https://github.com/google-cloud-tools/kube-eagle" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://github.com/google-cloud-tools/kube-eagle" data-wpel-link="external">Kube Eagle</a>, which will fetch and expose some metrics about CPU and RAM requests and limits, so they can be fetched by the Prometheus server you just installed.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0145-1024x443.jpg" alt="Kube Eagle" class="wp-image-15035" width="512" height="222" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-1024x443.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-300x130.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-768x333.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145-1200x520.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0145.jpg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<p>Create a file named <code class="markup--code markup--p-code">eagle.yml</code>.</p>



<pre title="eagle.yml" class="wp-block-code"><code lang="yaml" class="language-yaml">apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  labels:
    app: kube-eagle
  name: kube-eagle
  namespace: kube-eagle
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - pods
  verbs:
  - get
  - list
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  labels:
    app: kube-eagle
  name: kube-eagle
  namespace: kube-eagle
subjects:
- kind: ServiceAccount
  name: kube-eagle
  namespace: kube-eagle
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-eagle
---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: kube-eagle
  labels:
    app: kube-eagle
  name: kube-eagle
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: kube-eagle
  name: kube-eagle
  labels:
    app: kube-eagle
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-eagle
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
      labels:
        app: kube-eagle
    spec:
      serviceAccountName: kube-eagle
      containers:
      - name: kube-eagle
        image: "quay.io/google-cloud-tools/kube-eagle:1.0.0"
        imagePullPolicy: IfNotPresent
        env:
        - name: PORT
          value: "8080"
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /health
            port: http
        readinessProbe:
          httpGet:
            path: /health
            port: http
</code></pre>



<pre class="wp-block-code console"><code class="">$ kubectl create namespace kube-eagle
$ kubectl apply -f eagle.yml</code></pre>



<p class="graf graf--p">Next, add import this <a class="markup--anchor markup--p-anchor" href="https://grafana.com/dashboards/9875/revisions" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://grafana.com/dashboards/9875/revisions" data-wpel-link="external">Grafana dashboard</a> (it’s the same dashboard as Kube Eagle, but ported to Warp10).</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="1838" height="784" src="/blog/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50.png" alt="" class="wp-image-14901" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50.png 1838w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-300x128.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-768x328.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-1024x437.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-05-at-15.06.50-1200x512.png 1200w" sizes="auto, (max-width: 1838px) 100vw, 1838px" /></figure></div>



<p class="graf graf--p">You now have an easy way of monitoring your pod resources in the Cluster!</p>



<h3 class="graf graf--h3 wp-block-heading">Custom Metrics</h3>



<p>How does Prometheus know that it needs to scrape kube-eagle? If you looks at the metadata of the <code class="markup--code markup--p-code">eagle.yml</code>, you&#8217;ll see that:</p>



<pre class="wp-block-code"><code lang="yaml" class="language-yaml">annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080" # The port where to find the metrics
  prometheus.io/path: "/metrics" # The path where to find the metrics</code></pre>



<p>Theses annotations will trigger the Prometheus auto-discovery process (described in <code class="markup--code markup--p-code">prometheus.yml</code> line 114).</p>



<p>This means you can easily add these annotations to pods or services that contain a Prometheus exporter, and then forward these metrics to OVH Observability. <a href="https://prometheus.io/docs/instrumenting/exporters/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">You can find a non-exhaustive list of Prometheus exporters here</a>.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/IMG_0141-1024x443.jpg" alt="" class="wp-image-15027" width="512" height="222" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-1024x443.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-300x130.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-768x333.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141-1200x520.jpg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/IMG_0141.jpg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h3 class="graf graf--h3 wp-block-heading">Volumetrics Analysis</h3>



<p>As you saw on the&nbsp;<code class="markup--code markup--p-code">prometheus.yml</code> , we&#8217;ve tried to filter a lot of useless metrics. For example, with cAdvisor on a fresh cluster, with only three real production pods, and with the whole kube-system and Prometheus namespace, have about 2,600 metrics per node. With a smart cleaning approach, you can reduce this to 126 series.</p>



<p>Here&#8217;s a table to show the approximate number of metrics you will generate, based on the number of nodes&nbsp;<strong>(N)</strong> and the number of production pods <strong>(P) </strong>you have:</p>



<figure class="wp-block-table"><table><tbody><tr><td>&nbsp;</td><td><strong>Noderig</strong></td><td><strong>cAdvisor</strong></td><td><strong>Kube State</strong></td><td><strong>Eagle</strong></td><td><strong>Total</strong></td></tr><tr><td>nodes</td><td>N * 13<sup id="cite_ref-ned_1-3" class="reference">(1)</sup></td><td>N * 2<sup id="cite_ref-ned_1-3" class="reference">(2)</sup></td><td>N * 1<sup id="cite_ref-ned_1-3" class="reference">(3)</sup></td><td>N * 8<sup id="cite_ref-ned_1-3" class="reference">(4)</sup></td><td><strong>N * 24</strong></td></tr><tr><td>system.slice</td><td>0</td><td>N * 5<sup id="cite_ref-ned_1-3" class="reference">(5)</sup> * 16<sup id="cite_ref-ned_1-3" class="reference">(6)</sup></td><td>0</td><td>0</td><td><strong>N * 80</strong></td></tr><tr><td>kube-system + kube-proxy + metrics</td><td>0</td><td>N * 5<sup id="cite_ref-ned_1-3" class="reference">(9)</sup> * 26<sup id="cite_ref-ned_1-3" class="reference">(6)</sup></td><td>0</td><td>N * 5<sup id="cite_ref-ned_1-3" class="reference">(9)</sup> * 6<sup id="cite_ref-ned_1-3" class="reference">(10)</sup></td><td><strong>N * 160</strong></td></tr><tr><td>Production Pods</td><td>0</td><td>P * 26<sup id="cite_ref-ned_1-3" class="reference">(6)</sup></td><td>0</td><td>P * 6<sup id="cite_ref-ned_1-3" class="reference">(10)</sup></td><td><strong>P * 32</strong></td></tr></tbody></table></figure>



<p>For example, if you run three nodes with 60 Pods, you will generate 264 * 3 + 32 * 60 ~= 2,700 metrics</p>



<p><em>NB: A pod has a unique name, so if you redeploy a deployment, you will create 32 new metrics each time.</em></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(1) Noderig metrics: <code class="markup--code markup--p-code">os.mem / os.cpu / os.disk.fs / os.load1 / os.net.dropped (in/out) / os.net.errs (in/out) / os.net.packets (in/out) / os.net.bytes (in/out)/ os.uptime</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(2) cAdvisor nodes metrics: <code class="markup--code markup--p-code">machine_memory_bytes / machine_cpu_cores</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(3) Kube state nodes metrics: <code class="markup--code markup--p-code">kube_node_info</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(4) Kube Eagle nodes metrics: <code class="markup--code markup--p-code">eagle_node_resource_allocatable_cpu_cores / eagle_node_resource_allocatable_memory_bytes / eagle_node_resource_limits_cpu_cores / eagle_node_resource_limits_memory_bytes / eagle_node_resource_requests_cpu_cores / eagle_node_resource_requests_memory_bytes / eagle_node_resource_usage_cpu_cores / eagle_node_resource_usage_memory_bytes</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(5) With our filters, we will monitor around five system.slices&nbsp;</sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(6)&nbsp; Metrics are reported per container. A pod is a set of containers (a minimum of two): your container + the pause container for the network. So we can consider (2* 10&nbsp;+ 6) for the number of metrics per pod. 10 metrics from the cAdvisor and six for the network (see below) and for system.slice we will have 10 + 6, because it&#8217;s treated as one container.</sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(7) cAdvisor will provide these metrics for each container</sup><sup id="cite_ref-ned_1-3" class="reference">: </sup><sup id="cite_ref-ned_1-3" class="reference"><code class="markup--code markup--p-code">container_start_time_seconds / container_last_seen / container_cpu_usage_seconds_total / container_fs_io_time_seconds_total / container_fs_write_seconds_total / container_fs_usage_bytes / container_fs_limit_bytes / container_memory_working_set_bytes / container_memory_rss / container_memory_usage_bytes </code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(8) cAdvisor will provide these metrics for each interface: <code class="markup--code markup--p-code">container_network_receive_bytes_total * per interface / container_network_transmit_bytes_total * per interface</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(9) <code class="markup--code markup--p-code">kube-dns / beamium-noderig-metrics / kube-proxy / canal / metrics-server&nbsp;</code></sup></p>



<p><sup id="cite_ref-ned_1-3" class="reference">(10) Kube Eagle pods metrics: <code class="markup--code markup--p-code"> eagle_pod_container_resource_limits_cpu_cores /  eagle_pod_container_resource_limits_memory_bytes / eagle_pod_container_resource_requests_cpu_cores / eagle_pod_container_resource_requests_memory_bytes / eagle_pod_container_resource_usage_cpu_cores / eagle_pod_container_resource_usage_memory_bytes</code></sup></p>



<h3 class="graf graf--h3 wp-block-heading">Conclusion</h3>



<p class="graf graf--p">As you can see, monitoring your Kubernetes Cluster with OVH Observability is easy. You don&#8217;t need to worry about how and where to store your metrics, leaving you free to focus on leveraging your Kubernetes Cluster to handle your business workloads effectively, like we have in the Machine Learning Services Team.</p>



<p class="graf graf--p">The next step will be to add an alerting system, to notify you when your nodes are down (for example). For this, you can use the free&nbsp;<a class="markup--anchor markup--p-anchor" href="https://studio.metrics.ovh.net/" target="_blank" rel="noopener noreferrer nofollow external" data-href="https://studio.metrics.ovh.net/" data-wpel-link="external">OVH Alert Monitoring</a>&nbsp;tool.</p>



<h4 class="graf graf--h4 graf-after--p wp-block-heading" id="a936">Stay in&nbsp;touch</h4>



<p class="graf graf--p graf-after--h4 graf--trailing">For any questions, feel free to&nbsp;<a href="https://gitter.im/ovh/metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">join the Observability Gitter</a>&nbsp;or <a href="https://gitter.im/ovh/kubernetes" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kubernetes Gitter!</a><br>Follow us on Twitter: <a href="https://twitter.com/OVH" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">@OVH</a></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhow-to-monitor-your-kubernetes-cluster-with-ovh-observability%2F&amp;action_name=How%20to%20monitor%20your%20Kubernetes%20Cluster%20with%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Monitoring guidelines for OVH Observability</title>
		<link>https://blog.ovhcloud.com/monitoring-guidelines-for-ovh-observability/</link>
		
		<dc:creator><![CDATA[Kevin Georges]]></dc:creator>
		<pubDate>Thu, 07 Mar 2019 11:19:08 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Beamium]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Noderig]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[OVHcloud Observability]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14929</guid>

					<description><![CDATA[At the OVH Observability (formerly Metrics) team, we collect, process and analyse most of OVH&#8217;s monitoring data. It represents about 500M unique metrics, pushing data points at a steady rate of 5M per second. This data can be classified in two ways: host or application monitoring. Host monitoring is mostly based on hardware counters (CPU, [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmonitoring-guidelines-for-ovh-observability%2F&amp;action_name=Monitoring%20guidelines%20for%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="graf graf--p graf-after--h3">At the OVH Observability (formerly Metrics) team, we collect, process and analyse most of OVH&#8217;s monitoring data. It represents about 500M unique metrics, pushing data points at a steady rate of 5M per second.</p>



<p class="graf graf--p graf-after--p">This data can be classified in two ways: host or application monitoring. Host monitoring is mostly based on hardware counters (CPU, memory, network, disk…) while application monitoring is based on the service and its scalability (requests, processing, business logic…).</p>



<p>We provide this service for internal teams, who enjoy the same experience as our customers. Basically, our Observability service is SaaS with a compatibility layer (supporting InfluxDB, OpenTSDB, Warp10, Prometheus, and Graphite) that allows it to integrate with most of the existing solutions out there. This way, a team that is used to a particular tool, or have already deployed a monitoring solution, won&#8217;t need to invest much time or effort when migrating to a fully managed and scalable service: they just pick a token, use the right endpoint, and they&#8217;re done. Besides, our compatibility layer offers a choice: you can push your data with OpenTSDB, then query it in either PromQL or WarpScript. Combining protocols in this way results in a unique open-source interoperability that delivers more value, with no restrictions created by a solution&#8217;s query capabilities.</p>



<h3 id="816f" class="graf graf--h3 graf-after--p wp-block-heading">Scollector, Snap, Telegraf, Graphite, Collectd…</h3>



<p class="graf graf--p graf-after--h3">Drawing on this experience, we collectively tried most of the collection tools, but we always arrived at the same conclusion: we were witnessing&nbsp;<strong class="markup--strong markup--p-strong">metrics bleeding.</strong> Each tool focused on scraping every reachable bit of data, which is great if you are a graph addict, but can be counterproductive from an operational point-of-view, &nbsp;if you have to monitor thousands of hosts. While it&#8217;s possible to filter them, teams still need to understand the whole metrics set in order to know what needs to be filtered.</p>



<p class="graf graf--p graf-after--p">At OVH, we use laser-cut collections of metrics. Each host has a specific template (web server, database, automation…) that exports a set amount of metrics, which can be used for health diagnostics and monitoring application performance.</p>



<p>This finely-grained management leads to greater understanding for operational teams, since they know what&#8217;s available and can progressively add metrics to manage their own services.</p>



<h3 id="9619" class="graf graf--h3 graf-after--p wp-block-heading">Beamium &amp; Noderig — The Perfect&nbsp;Fit</h3>



<p class="graf graf--p graf-after--h3">Our requirements were rather simple:<br>—&nbsp;<strong class="markup--strong markup--p-strong">Scalable</strong>: Monitor one node in the same way as we&#8217;d monitor thousands<br>—&nbsp;<strong class="markup--strong markup--p-strong">Laser-cut</strong>: Only collect the metrics that are relevant<br>—&nbsp;<strong class="markup--strong markup--p-strong">Reliable</strong>: We want metrics to be available even in the worst conditions<br>—&nbsp;<strong class="markup--strong markup--p-strong">Simple</strong>: Multiple plug-and-play components, instead of intricate ones<br>—&nbsp;<strong class="markup--strong markup--p-strong">Efficient</strong>: We believe in impact-free metrics collection</p>



<h4 id="babf" class="graf graf--h4 graf-after--p wp-block-heading">The first solution was&nbsp;Beamium</h4>



<p class="graf graf--p graf-after--h4"><a class="markup--anchor markup--p-anchor" href="https://github.com/ovh/beamium" target="_blank" rel="nofollow noopener noreferrer external" data-href="https://github.com/runabove/beamium" data-wpel-link="external">Beamium</a>&nbsp;handles two aspects of the monitoring process: application data <strong>scrapping</strong> and metrics <strong>forwarding</strong>.</p>



<p class="graf graf--p graf-after--p">Application data is collected is the well-known and widely-used&nbsp;<a class="markup--anchor markup--p-anchor" href="https://prometheus.io/docs/instrumenting/exposition_formats/" target="_blank" rel="nofollow noopener noreferrer external" data-href="https://prometheus.io/docs/instrumenting/exposition_formats/" data-wpel-link="external"><strong>Prometheus format</strong></a><strong>.</strong> We chose Prometheus as the community was growing rapidly at the time, and many <a class="markup--anchor markup--p-anchor" href="https://prometheus.io/docs/instrumenting/clientlibs/" target="_blank" rel="nofollow noopener noreferrer external" data-href="https://prometheus.io/docs/instrumenting/clientlibs/" data-wpel-link="external">instrumentation libraries</a> were available for it. There are two key concepts in Beamium: Sources and Sinks.</p>



<p>The Sources,&nbsp;where Beamium will scrape data, are just Prometheus HTTP endpoints. This means it&#8217;s as simple as supplying the HTTP endpoint, and eventually adding a few parameters. This data will be routed to Sinks, which allows us to filter them during the routing process between a Source and a Sink. Sinks are Warp 10(R) endpoints, where we can push the data.</p>



<p class="graf graf--p graf-after--p">Once scraped, metrics are first stored on disk, before being routed to a Sink. The Disk Fail-Over (DFO) mechanism allows for network or remote failure recovery . This way, locally we retain the Prometheus pull logic, but decentralized, and we reverse it to push to feed the platform which has many advantages:</p>



<ul class="wp-block-list"><li>support for a transactional logic over the metrics platform</li><li>recovery from network partitioning or platform unavailability</li><li>dual writes with data consistency (as there&#8217;s otherwise no guarantee that two Prometheus instances would scrape the same data at the same timestamp)</li></ul>



<p>We have many different customers, some of whom use the Time Series store behind the Observability product to manage their product consumption or transactional changes over licensing. These use cases can&#8217;t be handled with Prometheus instances, which are better suited to metrics-based monitoring.</p>



<div class="wp-block-image graf graf--figure graf-after--p"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-1024x883.jpeg" alt="From Prometheus to OVH Observability with Beamium" class="wp-image-14996" width="512" height="442" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-1024x883.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-300x259.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-768x662.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7-1200x1035.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/1ECBCE14-CDDA-4802-9506-A20325B9FFC7.jpeg 1488w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h4 id="cd7f" class="graf graf--h4 graf-after--figure wp-block-heading">The second was Noderig</h4>



<p class="graf graf--p graf-after--h4">During conversations with some of our customers, we came to the conclusion that the existing tools needed a certain level of expertise if they were to be used at scale. For example, a team with a 20k node cluster with Scollector would end up with more than 10 million metrics, just for the nodes&#8230; In fact, depending on the hardware configuration, Scollector would generate between 350 and 1,000 metrics from a single node.</p>



<p class="graf graf--p graf-after--h4">That&#8217;s the reason behind <a href="https://github.com/ovh/noderig" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Noderig</a>. We wanted it to be as simple to use as the node-exporter from Prometheus, but with more finely-grained metrics production as the default.</p>



<p>Noderig collects OS metrics (CPU, memory, disk, and network) using a simple level semantic. This allows you to collect the right amount of metrics for any kind of host, which is particularly suitable for containerized environments.</p>



<p class="graf graf--p graf-after--p">We made it compatible with Scollector&#8217;s custom collectors to ease the migration process, and allow for extensibility. External collectors are simple executables that act as providers for data that is collected by Noderig, as with any other metrics.</p>



<p class="graf graf--p graf-after--p">The collected metrics are available through a simple rest endpoint, allowing you to see your metrics in real-time, and easily integrate them with Beamium.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-1024x728.jpeg" alt="Noderig and Beamium" class="wp-image-14998" width="512" height="364" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-1024x728.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-300x213.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-768x546.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF-1200x853.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/A5F0A98F-BBAA-4C23-BCA2-7ACD8012D8CF.jpeg 2039w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h3 class="wp-block-heading">Does it work?</h3>



<p class="graf graf--p graf-after--h3">Beamium and Noderig are extensively used at OVH, and support the monitoring of very large infrastructures. At the time of writing, we collect and store hundreds of millions of metrics using these tools. So they certainly seem to work!</p>



<p class="graf graf--p graf-after--h3">In fact, we&#8217;re currently working on the 2.0 release, which will be a rework, incorporating autodiscovery and hot reload.</p>



<h3 id="a936" class="graf graf--h4 graf-after--p wp-block-heading">Stay in&nbsp;touch</h3>



<p class="graf graf--p graf-after--h4 graf--trailing">For any questions, feel free to <a href="https://gitter.im/ovh/metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">join our Gitter</a>!<br>Follow us on Twitter: @OVH</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fmonitoring-guidelines-for-ovh-observability%2F&amp;action_name=Monitoring%20guidelines%20for%20OVH%20Observability&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>TSL: a developer-friendly Time Series query language for all our metrics</title>
		<link>https://blog.ovhcloud.com/tsl-a-developer-friendly-time-series-query-language-for-all-our-metrics/</link>
		
		<dc:creator><![CDATA[Aurélien Hébert]]></dc:creator>
		<pubDate>Wed, 13 Feb 2019 13:11:28 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Time series]]></category>
		<category><![CDATA[TSL]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14445</guid>

					<description><![CDATA[At the Metrics team we have been working on time series for several years. From our experience the data analytics capabilities of a Time Series Database (TSDB) platform is a key factor to create value from your metrics. And these analytics capabilities are mostly defined by the query languages they support. 

TSL stands for Time Series Language. In a few words, TSL is an abstracted way, under the form of an HTTP proxy, to generate queries for different TSDB backends. Currently it supports Warp 10's WarpScript and  Prometheus' PromQL query languages but we aim to extend the support to other major TSDB.


To better understand why we created TSL, we are reviewing some of the TSDB query languages supported on OVH Metrics Data Platform. When implementing them, we learnt the good, the bad and the ugly of each one. At the end, we decided to build TSL to simplify the querying on our platform, before open-sourcing it to use it on any TSDB solution. 

Why did we decide to invest some of our Time in such a proxy? Let me tell you the story of the OVH metrics protocol!<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ftsl-a-developer-friendly-time-series-query-language-for-all-our-metrics%2F&amp;action_name=TSL%3A%20a%20developer-friendly%20Time%20Series%20query%20language%20for%20all%20our%20metrics&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>At the Metrics team, we have been working on Time Series for several years now. In our experience, the <strong>data analytics capabilities</strong> of a Time Series Database (TSDB) platform is a key factor in <strong>creating value</strong> from your metrics. These analytics capabilities are mostly defined by the <strong>query languages</strong> they support. </p>



<p>TSL stands for <strong>Time Series Language</strong>. In simple terms, <a rel="noreferrer noopener nofollow external" href="https://github.com/ovh/tsl" target="_blank" data-wpel-link="external">TSL</a> is an abstracted way of generating queries for <strong>different TSDB backends</strong>, in the form of an HTTP proxy. It currently supports Warp 10&#8217;s WarpScript and  Prometheus&#8217; PromQL query languages, but we aim to extend the support to other major TSDBs.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="300" height="213" src="/blog/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A-300x213.png" alt="TSL - Time Series Language" class="wp-image-14499" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A-300x213.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A-768x545.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A.png 885w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>



<p><em>To provide some context around why we created TSL, it began with a review of some of the <strong>TSDB query languages</strong> supported on the&nbsp;<strong>OVH Metrics Data Platform</strong>. When implementing them, we learned the good, the bad and the ugly of each one. In the end, we decided to build TSL to <strong>simplify the querying</strong> on our platform, before open-sourcing it to use it on any TSDB solution.&nbsp;</em></p>



<p><em>So why did we decide to invest our time in developing such a proxy? Well, let me tell you <strong>the story of the OVH Metrics protocol</strong>!</em></p>



<h3 class="wp-block-heading">From OpenTSDB&#8230;</h3>



<div class="wp-block-image"><figure class="alignleft"><img loading="lazy" decoding="async" width="300" height="80" src="/blog/wp-content/uploads/2019/02/AEEDC522-26CE-44F6-A6F2-6B80480F8FC2-300x80.png" alt="OpenTSDB" class="wp-image-14507" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/AEEDC522-26CE-44F6-A6F2-6B80480F8FC2-300x80.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/AEEDC522-26CE-44F6-A6F2-6B80480F8FC2-768x205.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/AEEDC522-26CE-44F6-A6F2-6B80480F8FC2.png 810w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>



<p>The first aim of our platform is to be able to support the OVH infrastructure and application monitoring. When this project started, a lot of people were using <a href="http://opentsdb.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OpenTSDB</a>, and were familiar with its query syntax. <a href="http://opentsdb.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">OpenTSDB</a> is a scalable database for Time Series. The <a href="http://opentsdb.net/docs/build/html/user_guide/query/index.html" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OpenTSDB query</a>&nbsp;syntax is easy to read, as you send a JSON document describing the request. The document below will load all<code>sys.cpu.0</code>&nbsp;metrics of the<code>test</code>datacentre, summing them between the<code>start</code>&nbsp;and <code>end</code> dates:</p>



<pre class="wp-block-code"><code lang="java" class="language-java">{
    "start": 1356998400,
    "end": 1356998460,
    "queries": [
        {
            "aggregator": "sum",
            "metric": "sys.cpu.0",
            "tags": {
                "host": "*",
                "dc": "test"
            }
        }
    ]
}</code></pre>



<p>This enables the quick retrieval of specific data, in a specific time range. At OVH, this was used for graphs purpose, in conjunction with Grafana, and helped us to spot potential issues in real time, as well as investigate past events. OpenTSDB integrates simple queries, where you can define your own sampling and deal with counter data, as well as filtered and aggregated raw data.</p>



<p>OpenTSDB was the first protocol supported by the Metrics team, and is still widely used today. Internal statistics shows that 30-40% of our traffic is based on OpenTSDB queries. A lot of internal use cases can still be entirely resolved with this protocol, and the queries are easy to write and understand.</p>



<p>For example, a query with OpenTSDB to get the max value of the<code>usage_system</code>for the<code>cpu</code>0 to 9, sampled for a 2-minute span by their values&#8217; average, looks like this:</p>



<pre class="wp-block-code"><code lang="java" class="language-java">{
    "start": 1535797890,
    "end": 1535818770,
    "queries":  [{
        "metric":"cpu.usage_system",
        "aggregator":"max",
        "downsample":"2m-avg",
        "filters": [{
            "type":"regexp",
            "tagk":"cpu",
            "filter":"cpu[0–9]+",
            "groupBy":false
            }]
        }]
}</code></pre>



<p>However, OpenTSDB quickly shows its limitations, and some specific uses cases can&#8217;t be resolved with it. For example, you can&#8217;t apply any operations directly on the back-end. You have to load the data on an external tool and use it to apply any analytics.</p>



<p>One of the main areas where OpenTSDB (version 2.3) is lacking&nbsp; is multiple Time Series set operators, which allow actions like a divide series. Those operators can be a useful way to compute the individual query time per request, when you have (for example) a set of total time spend in requests and a set of total requests count series. That’s one of the reasons why the OVH Metrics Data Platform supports other protocols.</p>



<h2 class="wp-block-heading">&#8230; to PromQL</h2>



<div class="wp-block-image"><figure class="alignleft"><img loading="lazy" decoding="async" width="150" height="150" src="/blog/wp-content/uploads/2019/02/790C8FFD-E734-475E-BF7E-B93F0708C604-150x150.png" alt="Prometheus" class="wp-image-14503"/></figure></div>



<p>The second protocol we worked on was <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">PromQL</a>, the query language of the <a href="https://prometheus.io/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus</a> Time Series database. When we made that choice in 2015, <a href="https://prometheus.io/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus</a> was gaining some traction, and it still has an impressive adoption rate. But if Prometheus is a success, it isn&#8217;t for it’s query language, PromQL. This language never took off internally, although it has started to gain some adoption recently, mainly due to the arrival of people that worked with Prometheus in their previous companies. Internally,&nbsp;<strong>PromQL queries represent about 1-2%</strong> of our daily traffic. The main reasons are that a lot of simple use cases can be solved quickly and with more control of the raw data with OpenTSDB queries, while a lot of more complex use cases cannot be solved with PromQL. A similar request to the one defined in OpenTSDB would be:</p>



<pre class="wp-block-code"><code lang="c" class="language-c">api/v1/query_range?
query=max(cpu. usage_system{cpu=~"cpu[0–9]%2B"})
start=1535797890&amp;
end=1535818770&amp;
step=2m</code></pre>



<p>With PromQL, <strong>you lose control of how you sample the data</strong>, as the only operator is <strong>last</strong>. This means that &nbsp;if (for example) you want to downsample your series with a 5-minute duration, you are only able to keep the last value of each 5-minute series span. In contrast, all competitors include a range of operators. For example, with OpenTSDB, you can choose between several operators, including average, count, standard deviation, first, last, percentiles, minimal, maximal or summing all values inside your defined span.</p>



<p>In the end, a lot of people choose to use a much more complex method: WarpScript, which is powered by the Warp10 Analytics Engine we use behind the scenes.</p>



<h2 class="wp-block-heading">Our internal adoption of WarpScript</h2>



<div class="wp-block-image"><figure class="alignleft"><img loading="lazy" decoding="async" width="300" height="106" src="/blog/wp-content/uploads/2019/02/559ED4E1-4AD2-453B-A96B-74BF6490D6B9-300x106.png" alt="WarpScript by SenX" class="wp-image-14505" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/559ED4E1-4AD2-453B-A96B-74BF6490D6B9-300x106.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/559ED4E1-4AD2-453B-A96B-74BF6490D6B9-768x272.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/559ED4E1-4AD2-453B-A96B-74BF6490D6B9.png 837w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>



<p><a rel="noreferrer noopener nofollow external" href="https://www.warp10.io/content/03_Documentation/04_WarpScript/01_Concepts" target="_blank" data-wpel-link="external">WarpScript</a> is the current Time Series language of <a rel="noreferrer noopener nofollow external" href="https://www.warp10.io/" target="_blank" data-wpel-link="external">Warp 10(R)</a>, our underlying backend. WarpScript will help for any complex Time Series use case, and solves numerous real-world problems, as you have full control of all your operations. You have dedicated frameworks of functions to sample raw data and fill missing values. You also have frameworks to apply operations on single-value or window operations. You can apply operations on multiple Time Series sets, and have dedicated functions to manipulate Time Series times, statistics, etc.</p>



<p>It works with a Reverse Polish Notation (like a good, old-fashioned HP48, for those who&#8217;ve got one!), and simple uses cases can be easy to express. But when it comes to analytics, while it certainly solves problems, it’s still complex to learn. In particular, Time Series use cases are complex and require a thinking model, so WarpScript helped to solve a lot of hard ones.</p>



<p>This is why it&#8217;s still the main query used at OVH on the OVH Metrics platform, with <strong>nearly 60%</strong> of internal queries making use of it. The same request that that we just computed in OpenTSDB and PromQL would be as follows in WarpScript:</p>



<pre class="wp-block-code"><code lang="c" class="language-c">[ "token" "cpu.average" { "cpu" "~cpu[0–9]+" } NOW 2 h ] FETCH
[ SWAP bucketizer.mean 0 2 m 0 ] BUCKETIZE
[ SWAP [ "host" ] reducer.max ] REDUCE</code></pre>



<p>A lot of users find it hard to learn WarpScript at first, but after solving their initial issues with some (sometimes a lot of) support, it becomes&nbsp;the first step of their Time Series adventure. Later, they figure out some new ideas about how they can gain knowledge from their metrics. They then come back with many demands and questions about their daily issues, some of which can be solved quickly, with their own knowledge and experience.</p>



<p>What we learned from WarpScript is that it’s a fantastic tool with which to build analytics for our Metrics data. We pushed many complex use cases with advanced signal-processing algorithms like <a href="https://www.warp10.io/doc/LTTB#sigTitle" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">LTTB</a>, <a href="https://www.warp10.io/tags/outlier" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Outliers</a> or <a href="https://www.warp10.io/doc/PATTERNDETECTION#sigTitle" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Patterns</a> detections, and&nbsp;<a href="https://en.wikipedia.org/wiki/Kernel_smoother" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Kernel Smoothing,</a> where it proved to be a&nbsp; real enabler. However, it proved quite expensive to support for basic requirements, and feedback indicated the syntax and overall complexity were big concerns.</p>



<p>A WarpScript can involve dozens (or even hundreds) of lines, and a successful execution is often an accomplishment, with the special feeling that comes from having made full use of one&#8217;s brainpower. In fact, an inside joke amongst our team is being born able to write a WarpScript in a single day, or to earn a WarpScript Pro Gamer badge! That&#8217;s why we&#8217;ve distributed Metrics t-shirts to users that have achieved significant successes with the Metrics Data Platform.</p>



<p>We liked the WarpScript semantic, but we wanted it to have a significant impact on a broader range of use cases. This is why we started to write TSL with few simple goals:</p>



<ul class="wp-block-list"><li>Offer a clear Time Series analytics semantic</li><li>Simplify the writing and making it developer-friendly</li><li>Support data flow queries and ease debugging for complex queries</li><li>Don&#8217;t try and be the ultimate tool box. Keep it simple.</li></ul>



<p>We know that users will probably have to switch back to WarpScript every so often. However, we hope that using TSL will simplify their learning curve. TSL is simply a new step in the Time Series adventure!</p>



<h3 class="wp-block-heading">The path to&nbsp;TSL</h3>



<div class="wp-block-image"><figure class="alignleft"><img loading="lazy" decoding="async" width="300" height="213" src="/blog/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A-300x213.png" alt="TSL - Time Series Language" class="wp-image-14499" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A-300x213.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A-768x545.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/02/E80CF711-CFD0-45F8-B7E4-4CEBE7E5815A.png 885w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>



<p><a rel="noreferrer noopener nofollow external" href="https://github.com/ovh/tsl" target="_blank" data-wpel-link="external">TSL</a> is the result of three years of Time Series analytics support, and offers a functional Time Series Language. The aim of TSL is to build a Time Series data flow as code. </p>



<p>With TSL, native methods, such as <code>select</code> and <code>where</code>, exist to choose which metrics to work on. Then, as Time Series data is time-related, we have to use a time selector method on the selected meta. The two available methods are <code>from</code> and <code>last</code>. The vast majority of the other TSL methods take Time Series sets as input and provide Time Series sets as the result. For example, you have methods that only select values above a specific threshold, compute rate, and so on. We have also included specific operations to apply to multiple subsets of Time Series sets, as additions or multiplications.</p>



<p>Finally, for a more readable language, you can define variables to store Time Series queries and reuse them in your script any time you wish. For now, we support only a few native types, such as <code>Numbers</code>, <code>Strings</code>, <code>Time durations</code>, <code>Lists</code>, and <code>Time Series</code> (of course!).</p>



<p>Finally, the same query used throughout this article will be as follows in TSL:</p>



<pre class="wp-block-code"><code lang="javascript" class="language-javascript">select("cpu.usage_system")
.where("cpu~cpu[0–9]+")
.last(12h)
.sampleBy(2m,mean)
.groupBy(max)</code></pre>



<p>You can also write more complex queries. For example, we condensed our WarpScript hands-on, designed to detect <a href="https://helloexoworld.github.io/hew-hands-on/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">exoplanets from NASA raw data</a>, into a single TSL query:</p>



<pre class="wp-block-code"><code lang="javascript" class="language-javascript">sample = select('sap.flux')
 .where('KEPLERID=6541920')
 .from("2009–05–02T00:56:10.000000Z", to="2013–05–11T12:02:06.000000Z")
 .timesplit(6h,100,"record")
 .filterByLabels('record~[2–5]')
 .sampleBy(2h, min, false, "none")

trend = sample.window(mean, 5, 5)

sub(sample,trend)
 .on('KEPLERID','record')
 .lessThan(-20.0)</code></pre>



<p>So what did we do here? First we instantiated a <strong>sample</strong> variable in which we loaded the ‘sap.flux’ raw data of one star, the 6541920. We then cleaned the series, using the timesplit function (to split the star series when there is a hole in the data with a length greater than 6h), keeping only four records. Finally, we sampled the result, keeping the minimal value of each 2-hour bucket.</p>



<p>We then used this result to compute the series trend, using a moving average of 10 hours.</p>



<p>To conclude, the query returns only the points less than 20 from the result of the subtraction of the trend and the sample series.</p>



<h3 class="wp-block-heading">TSL is Open Source</h3>



<p>Even if our first community of users was mostly inside OVH, we&#8217;re pretty confident that TSL can be used to solve a lot of Time Series use cases.</p>



<p>We are currently beta testing TSL on our OVH Metrics public platform. Furthermore, TSL is <a href="https://github.com/ovh/tsl" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">open-sourced on Github</a>, so you can also test it on your own platforms.</p>



<p>We would love to get your feedback or comments on TSL, or Time Series in general. We&#8217;re available on the <a href="https://gitter.im/ovh/metrics" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">OVH Metrics gitter,</a> and you can find out more about TSL &nbsp;in <a href="https://labs.ovh.com/metrics-beta-features/" data-wpel-link="exclude">our Beta features documentation</a>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ftsl-a-developer-friendly-time-series-query-language-for-all-our-metrics%2F&amp;action_name=TSL%3A%20a%20developer-friendly%20Time%20Series%20query%20language%20for%20all%20our%20metrics&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Handling OVH&#8217;s alerts with Apache Flink</title>
		<link>https://blog.ovhcloud.com/handling-ovhs-alerts-with-apache-flink/</link>
		
		<dc:creator><![CDATA[Pierre Zemb]]></dc:creator>
		<pubDate>Thu, 31 Jan 2019 09:01:32 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Alerting]]></category>
		<category><![CDATA[Apache Flink]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Omni]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14337</guid>

					<description><![CDATA[OVH relies extensively on metrics to effectively monitor its entire stack. Whenever they are low-level or business centric, they allow teams to gain insight into how our services are operating on a daily basis. The need to store millions of datapoints per second has produced the need to create a dedicated team to build a operate a product to handle that load: Metrics Data Platform. By relying on Apache Hbase, Apache Kafka and Warp 10, we succeeded in creating a fully distributed platform that is handling all our metrics... and yours!

After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: Alerting. 
<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhandling-ovhs-alerts-with-apache-flink%2F&amp;action_name=Handling%20OVH%26%238217%3Bs%20alerts%20with%20Apache%20Flink&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>OVH relies extensively on <strong>metrics</strong> to effectively monitor its entire stack. Whether they are <strong>low-level</strong> or <strong>business</strong> centric, they allow teams to gain <strong>insight</strong> into how our services are operating on a daily basis. The need to store <strong>millions of datapoints per second</strong> has produced the need to create a dedicated team to build a operate a product to handle that load: <strong><a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">Metrics Data Platform</a>.</strong> By relying on <strong><a href="https://hbase.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Hbase</a>, <a href="https://kafka.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Kafka</a></strong> and <a href="https://www.warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Warp 10</strong></a>, we succeeded in creating a fully distributed platform that is handling all our metrics&#8230; and yours! </p>



<p>After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: the <strong>Alerting. </strong></p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="885" height="290" src="https://www.ovh.com/blog/wp-content/uploads/2019/01/001-1.png" alt="OVH &amp; Apache Flink" class="wp-image-14367" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1-300x98.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1-768x252.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure>



<h3 class="wp-block-heading" id="6c01">Meet OMNI, our alerting&nbsp;layer</h3>



<p>OMNI is our code name for a&nbsp;<strong>fully distributed</strong>,&nbsp;<strong>as-code</strong>,&nbsp;<strong>alerting</strong>&nbsp;system that we developed on top of Metrics. It is split into components:</p>



<ul class="wp-block-list"><li><strong>The management part</strong>, taking your alerts definitions defined in a Git repository, and represent them as continuous queries,</li><li><strong>The query executor</strong>, scheduling your queries in a distributed way.</li></ul>



<p>The query executor is pushing the query results into Kafka, ready to be handled! We now need to perform all the tasks that an alerting system does:</p>



<ul class="wp-block-list"><li>Handling alerts&nbsp;<strong>deduplication</strong>&nbsp;and&nbsp;<strong>grouping</strong>, to avoid&nbsp;<a href="https://en.wikipedia.org/wiki/Alarm_fatigue" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">alert fatigue.&nbsp;</a></li><li>Handling&nbsp;<strong>escalation</strong>&nbsp;steps,&nbsp;<strong>acknowledgement&nbsp;</strong>or&nbsp;<strong>snooze</strong>.</li><li><strong>Notify</strong>&nbsp;the end user, through differents&nbsp;<strong>channels</strong>: SMS, mail, Push notifications,&nbsp;…</li></ul>



<p>To handle that, we looked at open-source projects, such as&nbsp;<a href="https://github.com/prometheus/alertmanager" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus AlertManager,</a>&nbsp;<a href="https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">LinkedIn Iris,</a>&nbsp;we discovered the&nbsp;<em>hidden</em>&nbsp;truth:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>Handling alerts as streams of data,<br>moving from operators to&nbsp;another.</p></blockquote>



<p>We embraced it, and decided to leverage <a href="https://flink.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Flink</a> to create&nbsp;<strong>Beacon</strong>. In the next section we are going to describe the architecture of Beacon, and how we built and operate it.</p>



<p>If you want some more information on Apache Flink, we suggest to read the introduction article on the official website:&nbsp;<a href="https://flink.apache.org/flink-architecture.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">What is Apache Flink?</a></p>



<h3 class="wp-block-heading" id="6caa">Beacon architecture</h3>



<p>


At his core, Beacon is reading events from&nbsp;<strong>Kafka</strong>. Everything is represented as a&nbsp;<strong>message</strong>, from alerts to aggregations rules, snooze orders and so on. The pipeline is divided into two branches:



</p>



<ul class="wp-block-list"><li>One that is running the&nbsp;<strong>aggregations</strong>, and triggering notifications based on customer’s rules.</li><li>One that is handling the&nbsp;<strong>escalation steps</strong>.</li></ul>



<p>Then everything is merged to&nbsp;<strong>generate</strong>&nbsp;<strong>a</strong>&nbsp;<strong>notification</strong>, that is going to be forward to the right person. A notification message is pushed into Kafka, that will be consumed by another component called&nbsp;<strong>beacon-notifier.</strong></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="470" src="/blog/wp-content/uploads/2019/01/002.png" alt="Beacon architecture" class="wp-image-14349" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/002.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/002-300x159.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/002-768x408.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p>If you are new to streaming architecture, I recommend reading&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/programming-model.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Dataflow Programming Model</a>&nbsp;from Flink official documentation.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="616" src="/blog/wp-content/uploads/2019/01/003.png" alt="Handling state" class="wp-image-14350" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/003.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/003-300x209.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/003-768x535.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p>Everything is merged into a dataStream,&nbsp;<strong>partitionned</strong>&nbsp;(<a href="https://medium.com/r/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.7%2Fdev%2Fstream%2Fstate%2Fstate.html%23keyed-state" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">keyed by&nbsp;</a>in Flink API) by users. Here&#8217;s an example:</p>



<pre class="wp-block-code"><code lang="java" class="language-java">final DataStream&lt;Tuple4&lt;PlanIdentifier, Alert, Plan, Operation>> alertStream =

  // Partitioning Stream per AlertIdentifier
  cleanedAlertsStream.keyBy(0)
  // Applying a Map Operation which is setting since when an alert is triggered
  .map(new SetSinceOnSelector())
  .name("setting-since-on-selector").uid("setting-since-on-selector")

  // Partitioning again Stream per AlertIdentifier
  .keyBy(0)
  // Applying another Map Operation which is setting State and Trend
  .map(new SetStateAndTrend())
  .name("setting-state").uid("setting-state");</code></pre>



<ul class="wp-block-list"><li><strong>SetSinceOnSelector</strong>, which is setting&nbsp;<strong>since</strong>&nbsp;when the alert is triggered</li><li><strong>SetStateAndTrend</strong>, which is setting the&nbsp;<strong>state&nbsp;</strong>(ONGOING, RECOVERY or OK) and the&nbsp;<strong>trend</strong>(do we have more or less metrics in errors).</li></ul>



<p>Each of this class is under 120 lines of codes because Flink is&nbsp;<strong>handling all the difficulties</strong>. Most of the pipeline are&nbsp;<strong>only composed</strong>&nbsp;of&nbsp;<strong>classic transformations</strong>&nbsp;such as&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Map, FlatMap, Reduce</a>, including their&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#rich-functions" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Rich</a>&nbsp;and&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#using-managed-keyed-state" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Keyed</a>&nbsp;version. We have a few&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/process_function.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Process Functions</a>, which are&nbsp;<strong>very handy</strong>&nbsp;to develop, for example, the escalation timer.</p>



<h3 class="wp-block-heading" id="a77e">Integration tests</h3>



<p>As the number of classes was growing, we needed to test our pipeline. Because it is only wired to Kafka, we wrapped consumer and producer to create what we call&nbsp;<strong>scenari:&nbsp;</strong>a series of integration tests running different scenarios.</p>



<h3 class="wp-block-heading" id="5f8f">Queryable state</h3>



<p>One killer feature of Apache Flink is the&nbsp;<strong>capabilities of&nbsp;</strong><a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/queryable_state.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external"><strong>querying the internal state</strong></a><strong>&nbsp;of an operator</strong>. Even if it is a beta feature, it allows us the get the current state of the different parts of the job:</p>



<ul class="wp-block-list"><li>at which escalation steps are we on</li><li>is it snoozed or <em>ack</em>-ed</li><li>Which alert is ongoing</li><li>and so on.</li></ul>



<div class="wp-block-image size-full wp-image-14361"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="617" src="/blog/wp-content/uploads/2019/01/004-1.png" alt="Queryable state overview" class="wp-image-14361" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1-300x209.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1-768x535.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /><figcaption>Queryable state overview</figcaption></figure></div>



<p>Thanks to this, we easily developed an&nbsp;<strong>API</strong>&nbsp;over the queryable state, that is powering our&nbsp;<strong>alerting view</strong>&nbsp;in&nbsp;<a href="https://studio.metrics.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Metrics Studio,</a> our codename for the Web UI of the Metrics Data Platform.</p>



<h3 class="wp-block-heading" id="1bc7">Apache Flink deployment</h3>



<p>


&nbsp;
We deployed the latest version of Flink (<strong>1.7.1</strong>&nbsp;at the time of writing) directly on bare metal servers with a dedicated Zookeeper’s cluster using Ansible. Operating Flink has been a really nice surprise for us, with&nbsp;<strong>clear documentation and configuration</strong>, and an&nbsp;<strong>impressive resilience</strong>. We are capable of&nbsp;<strong>rebooting</strong>&nbsp;the whole Flink cluster, and the job is&nbsp;<strong>restarting at his last saved state</strong>, like nothing happened.


</p>



<p>We are using&nbsp;<strong>RockDB</strong>&nbsp;as a state backend, backed by OpenStack&nbsp;<strong>Swift storage&nbsp;</strong>provided by OVH Public Cloud.</p>



<p>For monitoring, we are relying on&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus Exporter</a>&nbsp;with&nbsp;<a href="https://github.com/ovh/beamium" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Beamium</a>&nbsp;to gain&nbsp;<strong>observability</strong>&nbsp;over job’s health.</p>



<h3 class="wp-block-heading" id="8d7c">In short, we love Apache&nbsp;Flink!</h3>



<p>If you are used to work with stream related software, you may have realized that we did not used any rocket science or tricks. We may be relying on basics streaming features offered by Apache Flink, but they allowed us to tackle many business and scalability problems with ease.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77.png" alt="Apache Flink" class="wp-image-14354" width="437" height="400" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77.png 874w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77-300x275.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77-768x703.png 768w" sizes="auto, (max-width: 437px) 100vw, 437px" /></figure></div>



<p>As such, we highly recommend that any developers should have a look to Apache Flink. I encourage you to go through <a href="https://medium.com/r/?url=https%3A%2F%2Ftraining.da-platform.com%2F" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Apache Flink Training</a>, written by Data Artisans.&nbsp;Furthermore, the community has put a lot of effort to easily deploy Apache Flink to Kubernetes, so you can easily try Flink using our&nbsp;Managed Kubernetes!</p>



<h3 class="wp-block-heading">What’s next?</h3>



<p>Next week we come back to Kubernetes, as we will expose how we deal with ETCD in our OVH <a href="https://www.ovh.com/fr/kubernetes/" data-wpel-link="exclude">Managed Kubernetes service</a>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhandling-ovhs-alerts-with-apache-flink%2F&amp;action_name=Handling%20OVH%26%238217%3Bs%20alerts%20with%20Apache%20Flink&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
