<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observability Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/observability/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/observability/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Mon, 17 Apr 2023 14:43:35 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Observability Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/observability/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Picking our Prometheus&#8217; remote storage</title>
		<link>https://blog.ovhcloud.com/picking-our-prometheus-remote-storage/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Mon, 17 Apr 2023 14:43:34 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24588</guid>

					<description><![CDATA[If you are running an IT system you are most likely using an Observability stack along it. Nowadays, the question&#8217;s no more whether or not you need Observability but more like how will you compose your stack. At OVHcloud, we have been running a scalable timeseries backend for years now. During the last year, we [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpicking-our-prometheus-remote-storage%2F&amp;action_name=Picking%20our%20Prometheus%26%238217%3B%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>If you are running an IT system you are most likely using an Observability stack along it. Nowadays, the question&#8217;s no more whether or not you need Observability but more like how will you compose your stack. At OVHcloud, we have been running a scalable timeseries backend for years now. </p>



<p>During the last year, we have the opportunity to reassess our technical choices. Prometheus is the <em>de facto</em> standard but this choice is the beginning of the process. Thanks to open source communities, there is at lot of possible choices. </p>



<p>The <a href="https://blog.ovhcloud.com/tag/prometheus/" data-wpel-link="internal">previous posts</a> were about the process we have followed select our new backend, this one concludes the series and share what we have chosen and why. In case you missed them, this series covers an <a href="https://blog.ovhcloud.com/welcome-to-prometheus-world-of-remote-storage/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">introduction to Prometheus remote storage</a>, how to bench such solution from both <a href="https://blog.ovhcloud.com/prometheus-remote-storage-playground/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">write</a> and <a href="https://blog.ovhcloud.com/benchmarking-prometheus-promql-performance/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">read</a> perspective the hard way or <a href="https://blog.ovhcloud.com/benchmarking-prometheus-like-a-pro-with-k6/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">like a pro</a>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img fetchpriority="high" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-1024x538.jpg" alt="Picking our Prometheus' remote storage" class="wp-image-25069" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1488.jpg 1199w" sizes="(max-width: 512px) 100vw, 512px" /></figure>



<h3 class="wp-block-heading">And the winner is&#8230; Grafana Mimir!</h3>



<p>After all the experimentation we have made we have chosen Grafana Mimir. The first reason why this solution&#8217;s a good fit for use is Its read/write performance&#8217;s excellent as well as its scalability. My team, core-observability, main mission&#8217;s to provide a resilient and feature full observability infrastructure. All teams relying on us, each of them has it own particularity. Multitenancy is a must have for us, with it we must be able to prevent side effect or &#8220;noisy neighboor&#8221;. This is why rate limiting was on our bucket list. Mimir provides a lots of setting both at the cluster level and the tenant level to make sure one tenant does not impact others or simply impact the quality of services.</p>



<figure class="wp-block-image alignright size-full is-resized"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1489.png" alt="Grafana Mimir" class="wp-image-25072" width="265" height="96" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1489.png 529w, https://blog.ovhcloud.com/wp-content/uploads/2023/04/IMG_1489-300x108.png 300w" sizes="(max-width: 265px) 100vw, 265px" /></figure>



<p>Like many cloud native technology Mimir relies on an <a href="https://www.ovhcloud.com/en/public-cloud/object-storage/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">object storage</a> where the timeseries are stored. Doing so allow to decouple the compute from the storage and therefore avoids to add more computing power or bigger disks to offer the retentions your users need. Data are compacted to have the small storage footprint possible and therefore achieve cost efficiency.</p>



<p>As we said in our, Prometheus is today <em>de facto</em> standard when it comes to timeseries. We wanted to offer our users the full experience, 100% compliant with <a href="https://promlabs.com/promql-compliance-tests/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">promql</a>, recording and alerting rules. Mimir is fully featured on this side, it&#8217;s even part of a bigger picture with more integration which is like icing on the cake. Let&#8217;s start with Grafana, which is of course fully compatible with Mimir, you can also manage you recording or alerting rules directly from the UI. Now comes <a href="https://grafana.com/oss/loki/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Loki</a> which is like prometheus but for logs, it allow you to query your logs just like your metrics. And finally <a href="https://grafana.com/oss/tempo/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Tempo</a> which cover the last observability pillar: distributed tracing.</p>



<p>On the operational side, there is no doubt that Mimir has been built with production stability and resiliency in mind. The default settings are production ready, the documentation is crystal clear but you also have the material to facilitate the day to day care of Mimir in production. As SREs running Mimir you can use their knowledge base. You have at your disposal ready to use <a href="https://github.com/grafana/mimir/tree/main/operations/mimir-mixin-compiled/dashboards" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">dashboards</a>, <a href="https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/rules.yaml" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">recording</a> &amp; <a href="https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">alerting</a> rules and <a href="https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">runbook</a>. Of course deployment might be different one from another. This is a very good opportunity to contribute back to the vivid open source community around Grafana Labs. No matter the size of the contribution it is always welcomed and reviewed in a timely manner. Whether you need to adjust the <a href="https://github.com/grafana/mimir/pull/2657" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">dashboards</a>, add a <a href="https://github.com/grafana/mimir/pull/2864" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">feature</a>&nbsp;or <a href="https://github.com/grafana/mimir/pull/1803" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">build deb/rpm packages</a> you can always <a href="https://github.com/grafana/mimir/tree/main/docs/internal/contributing" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">contribute</a>.</p>



<p>The definitive reason why we have chosen Mimir is the core values of its maintainers. Kudos to them. They are welcoming, easy going and more importantly they take <a href="https://grafana.com/blog/2022/03/30/announcing-grafana-mimir/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">opensource seriously</a> just like us at OVHcloud. If you want to have a glimps of that come by their slack to see how fast they are answering.</p>



<p>My team can&#8217;t wait to see all the beautiful things our users will do with this backend. One thing&#8217;s sure, we&#8217;ll contribute back and make sure Mimir thrives. Let&#8217;s reserve this part for a new blog posts.</p>
<img decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fpicking-our-prometheus-remote-storage%2F&amp;action_name=Picking%20our%20Prometheus%26%238217%3B%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Benchmarking Prometheus like a pro with k6</title>
		<link>https://blog.ovhcloud.com/benchmarking-prometheus-like-a-pro-with-k6/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Tue, 04 Apr 2023 12:19:05 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24585</guid>

					<description><![CDATA[In our previous posts about choosing a Prometheus remote storage we have seen how to&#160;set up a benchmarking infrastructure and how to benchmark promql performance. We have been able to obtain results but the whole benchmark is flawned in many ways: This blog post discuss how we should have benchmark our remote storage. How to [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-like-a-pro-with-k6%2F&amp;action_name=Benchmarking%20Prometheus%20like%20a%20pro%20with%20k6&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>In our previous posts about choosing a Prometheus remote storage we have seen how to&nbsp;<a href="https://blog.ovhcloud.com/prometheus-remote-storage-playground" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">set up a benchmarking infrastructure</a> and <a href="https://blog.ovhcloud.com/benchmarking-prometheus-promql-performance" data-wpel-link="internal">how to benchmark promql performance</a>. We have been able to obtain results but the whole benchmark is flawned in many ways:</p>



<ul class="wp-block-list">
<li>it&#8217;s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.</li>



<li>it&#8217;s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.</li>



<li>you&#8217;re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in <a href="https://github.com/prometheus/prometheus/issues/9807" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">prometheus</a> or haproxy configuration.</li>



<li>it focus mainly on the write path without stress from the read path which is not realistic.</li>
</ul>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-1024x538.jpg" alt="Benchmarking Prometheus like a pro with k6" class="wp-image-24943" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1433.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p>This blog post discuss how we should have benchmark our remote storage.</p>



<h3 class="wp-block-heading">How to do a good benchmark? K6 to the rescue</h3>



<p>A good benchmark need to be accurate and reproducible. More over for our usecase we want to have a tool who takes into account both Prometheus&#8217;s read and write path. Finally, we need to be able to remove all unnecessary pieces. This way we are able to focus on the remote storage only.</p>



<p>Such software could be a project on its own but fortunately for us there is one opensource solution for that: <a href="https://k6.io/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">K6</a></p>



<p>K6 is a general purpose&nbsp; modern load testing which can be extended with module to support Prometheus remote storage. Sounds interesting don&#8217;t you think?</p>



<p>In our previous blog post we have explained how we have built our benchmarking infrastructure which was rather complex&nbsp; to be accurate.</p>



<figure class="wp-block-image aligncenter"><img decoding="async" src="https://github.com/wilfriedroset/remote-storage-wars/blob/master/assets/generic-infrastructure.png?raw=true" alt="generic-infrastructure.png"/></figure>



<p>With k6 as benchmarking tool the infrastructure can be greatly simplified:</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-1024x436.png" alt="With k6 as benchmarking tool the infrastructure can be greatly simplified" class="wp-image-24941" width="512" height="218" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-1024x436.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-300x128.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6-768x327.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/k6.png 1127w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p>K6 is quite flexible and configurable. Its input is a load testing script, you can either write your own script or reuse an <a href="https://github.com/grafana/mimir/tree/main/operations/k6" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">opensourced one</a>. As the whole logic is in the load testing script it become easily reproducible which is exactly what we need.</p>



<p>To launch a benchmark you need two piece of infrastructure:</p>



<ul class="wp-block-list">
<li>Somewhere where you can run k6 which could be a <a href="https://www.ovhcloud.com/en-ie/public-cloud/prices/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">c2-120 instance on our public cloud</a></li>



<li>A remote storage to benchmark. for a quick start users are one helm apply away to start on <a href="https://www.ovhcloud.com/en-ie/public-cloud/kubernetes/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">k8s</a></li>
</ul>



<p>For our use case we have chosen to reuse the load testing from Grafana which does exactly what we are looking for. All useful information to tune and assess your remote storage are outputed by k6.</p>



<pre class="wp-block-code"><code class="">     ✓ write worked

     █ instant query high cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     █ range query

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field is 'success' to equal 'success'
       ✓ expected resultType is 'matrix' to equal 'matrix'

     █ instant query low cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     checks............................................................................: 100.00% ✓ 1454     ✗ 0
     ✓ { type:read }...................................................................: 0.00%   ✓ 0        ✗ 0
     ✓ { type:write }..................................................................: 100.00% ✓ 6        ✗ 0
     data_received.....................................................................: 1.0 MB  8.4 kB/s
     data_sent.........................................................................: 277 kB  2.3 kB/s
     group_duration....................................................................: avg=64.61ms min=39.94ms med=60.43ms max=230.05ms p(90)=80.39ms p(95)=107.93ms
     http_req_blocked..................................................................: avg=4.65ms  min=2µs     med=6µs     max=96.84ms  p(90)=11µs    p(95)=58.42ms
     http_req_connecting...............................................................: avg=1.31ms  min=0s      med=0s      max=21.87ms  p(90)=0s      p(95)=16.99ms
     http_req_duration.................................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
       { expected_response:true }......................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
     ✓ { type:read }...................................................................: avg=53.8ms  min=34.23ms med=52.76ms max=164.1ms  p(90)=66.85ms p(95)=71.62ms
     ✓ { url:https://admin:security-matters@remote-storage.poc.ovh.net/api/v1/push }...: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_failed...................................................................: 0.00%   ✓ 0        ✗ 368
     http_req_receiving................................................................: avg=92.34µs min=32µs    med=89µs    max=301µs    p(90)=125.3µs p(95)=150µs
     http_req_sending..................................................................: avg=49.05µs min=12µs    med=40µs    max=566µs    p(90)=68µs    p(95)=94.59µs
     http_req_tls_handshaking..........................................................: avg=3.11ms  min=0s      med=0s      max=54.28ms  p(90)=0s      p(95)=39.39ms
     http_req_waiting..................................................................: avg=53.56ms min=33.94ms med=52.56ms max=163.93ms p(90)=66.88ms p(95)=71.66ms
     http_reqs.........................................................................: 368     3.064697/s
     iteration_duration................................................................: avg=64.88ms min=40.34ms med=60.78ms max=230.27ms p(90)=80.87ms p(95)=108.47ms
     iterations........................................................................: 368     3.064697/s
     vus...............................................................................: 26      min=26     max=26
     vus_max...........................................................................: 26      min=26     max=26
</code></pre>



<p>What a time saver? With k6 we have been able to efficiently assess all remote storage solutions. This is a <strong>significative</strong> improvement if we compare it to our previous benchmarking plan.</p>



<p>The next and final post will be about which remote storage we have chosen to be our internal solution.</p>



<p>Stay tuned.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-like-a-pro-with-k6%2F&amp;action_name=Benchmarking%20Prometheus%20like%20a%20pro%20with%20k6&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Benchmarking Prometheus promql performance</title>
		<link>https://blog.ovhcloud.com/benchmarking-prometheus-promql-performance/</link>
		
		<dc:creator><![CDATA[Julien Girard]]></dc:creator>
		<pubDate>Fri, 17 Mar 2023 12:00:00 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24598</guid>

					<description><![CDATA[Here @OVHCloud, we try to replace our legacy metrics oriented infrastructure. This infrastructure matters a lot for us as internal teams use it to supervise the core services of OVH, so before making any choices, we wanted to apply a bullet proof test to the challengers. You can do two main things with a storage [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-promql-performance%2F&amp;action_name=Benchmarking%20Prometheus%20promql%20performance&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>Here @OVHCloud, we try to replace our legacy metrics oriented infrastructure. This infrastructure matters a lot for us as internal teams use it to supervise the core services of OVH, so before making any choices, we wanted to apply a bullet proof test to the challengers.</p>



<p>You can do two main things with a storage backend. You can write in it or you can read from it. That on the test of this last part we are focusing on today. We wanted our test to reproduce a production oriented scenario, let’s see how we build it.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-1024x538.jpg" alt="Benchmarking Prometheus promql performance" class="wp-image-24878" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1335.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<p>In this blog post we wont cover the building of the underlying TSDB as it could apply to any of them as long as it ensure PromQL compatibility. We will also assume that you can write to the TSDB using Prometheus remote write protocol.</p>



<p>Now that we have our bench cluster up and running, we need to fill it up with data and this is the subject of the following part.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Let’sfindsome“real”data">Let’s find some <em>&#8220;real&#8221;</em> data</h3>



<p>As a cloud provider, all our solutions use compute instances wherever they are virtual or baremetal. One of our most common use case is to <em>“look”</em> at system server metrics through automatic <a href="https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">recording rules</a> or through Grafana dashboards. All this query are PromQL expressions.</p>



<p>To emulate our ingestion workflow, we deployed nodes exposing their metrics trough <a href="https://github.com/prometheus/node_exporter" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">node exporter</a>. We also charge couples of Prometheus to scrap them several time to emulate a large amount of host (several thousands of them). Those Prometheus are in charge of writing scrapped metrics to various backend we are benchmarking using remote write protocol.</p>



<p>After waiting several hours or day, our backend is full of data and we can move on. If you need more info on this subject, we have written another <a href="https://blog.ovhcloud.com/prometheus-remote-storage-playground/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">blog post</a> about it.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-It’stimetobench">It’s time to bench</h3>



<p>As we said it earlier, our read production workload has two components: automatic recording rules and Grafana dashboards. As our alerting system is not widely distributed, we won’t discuss it here, so let’s focus on the Grafana part. A dashboard is a collection of requests to execute against a backend, this is why we have extracted both range and&nbsp; instant the queries from one.</p>



<p>Once we have got this first result, we need a way to execute this request against the backend. As a PromQL request is mainly an HTTP call, we can use an http benchmark tool as a support for our test. One of the most widely used is <a href="https://jmeter.apache.org" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Apache JMeter</a> and this is the one we have used.</p>



<figure class="wp-block-image alignright size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1336.png" alt="Graphana dashboard" class="wp-image-24880" width="235" height="184" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1336.png 469w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1336-300x235.png 300w" sizes="auto, (max-width: 235px) 100vw, 235px" /></figure>



<p>To fit into Apache JMeter who is not able to directly execute promQL request against a Prometheus compliant backend, the previously extracted series have been converted to a test plan. This tool takes various parameters, but three of them are quite important, the timestamp, the interval and the step that will apply to every query forwarded to the backend, just like you do when you submit a time frame to a dashboard in Grafana.</p>



<p>We are now able emulate the load of a dashboard with various time frame and extract meaningful information from this run as Apache Jmeter is a quite powerful tool. It allow us to use warm up period to exploit the benefice of cache or ramp up to study the response of our cluster when the load increase, loading always the same data or from new nodes.</p>



<p>For our first bench, we decided to go with <a href="https://grafana.com/grafana/dashboards/1860-node-exporter-full" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">the most widely use node exporter dashboard</a>. We also identified time frame widely used (5m, 15m, 30m, 1h, 6h, 12h, 24h, 2d, 3d, 4d, 5d, 6d, 7d). Those are mainly the default time frame proposed by Grafana.</p>



<p>With the set of tools defined above, we identified three tests we wanted to make against each one of those time frame.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Firsttest&quot;Hotandcoldstorage&quot;">First test &#8220;Hot and cold storage&#8221;</h3>



<p>A lot of solution use hot and cold storage sometimes also named short term storage and long term storage. With this test we want to identify the performances of those various layers.</p>



<p>As the purpose of this test is to check the response time of the various underlying storage, you may want to be sure to disable any cache that may alter the results.</p>



<p>Moreover, we do not want to test the saturation of the platform so we will emulate ten clients.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Secondtest&quot;Cachingperformances&quot;">Second test &#8220;Caching performances&#8221;</h3>



<p>This test is quite the opposite of the previous one. Here we want to test the response time of the TSB in the best possible scenario (data cached).</p>



<p>To get the best results from this test, we will use a warm-up period that will populate the various caches and then measure the response time of the TSDB.</p>



<p>Once again, in this test, we do not want to test the saturation of the platform so we will emulate ten clients.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Thirdtest&quot;Fillingupthecache&quot;">Third test &#8220;Filling up the cache&#8221;</h3>



<p>The purpose of this last bench is to test the saturation of the platform. Here we will use a ramp-up, adding more and more client to the test over a defined period of time and check the according errors and response time of the underlying platform.</p>



<p>At a certain point, we should see that the platform is not able to handle anymore clients. We assume this number of client will differ with the lookup time frame.</p>



<h3 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-Conclusion">Conclusion</h3>



<p>The benchmark concluded to two expected conclusions.</p>



<ul class="wp-block-list">
<li>Some support of data are way more faster than other (Memory is faster than local disk which is faster than a distant object storage).</li>



<li>The use of the various caches proposed is a game changer.</li>
</ul>



<h4 class="wp-block-heading" id="Benchmarkingprometheuspromqlperformance-It’stimeforasecondconclusion">It’s time for a second conclusion</h4>



<p>Our approach of the benchmark is quite interesting as it aims to emulate the more precisely our production workload. You may be wondering where do we store this wonderful collection of tools. Well, here is the truth, maybe those tool don&#8217;t need to be shared and for several reasons:</p>



<ul class="wp-block-list">
<li>The result of the test widely depends of the data stored inside the TSDB, which is the result of another procedure and is difficult to reproduce. That leads to a result that is subject to interpretation</li>



<li>The tooling is difficult to use and time consuming</li>



<li>Just like the time flies, the truth of today is not the one of tomorrow and your production reality of today will probably be quite different from the one to come</li>



<li>We like to fight against the not invented here syndrome</li>
</ul>



<p>In consequence, we need a tool more convenient to use, ideally used by others and with a more reproducible pattern to bench. We will discuss how we should have benchmarked our remote storage in the next blog post.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fbenchmarking-prometheus-promql-performance%2F&amp;action_name=Benchmarking%20Prometheus%20promql%20performance&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Prometheus&#8217; remote storage playground</title>
		<link>https://blog.ovhcloud.com/prometheus-remote-storage-playground/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Sun, 05 Mar 2023 23:49:35 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24583</guid>

					<description><![CDATA[Introduction In the previous post we have discuss how important remote storage are for Prometheus. We have also covered several attention points. In the following post we are covering remote write storage and how to bench them. Context After you have identify one (or more) remote storage who might suit your must bench it. However [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fprometheus-remote-storage-playground%2F&amp;action_name=Prometheus%26%238217%3B%20remote%20storage%20playground&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h3 class="wp-block-heading" id="Remotestorageplayground-Introduction">Introduction</h3>



<p>In the <a href="https://blog.ovhcloud.com/welcome-to-prometheus-world-of-remote-storage/" target="_blank" rel="noreferrer noopener" data-wpel-link="internal">previous post</a> we have discuss how important remote storage are for Prometheus. We have also covered several attention points. In the following post we are covering remote <strong>write</strong> storage and how to bench them.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-1024x538.jpg" alt="Prometheus' remote storage playground" class="wp-image-24835" width="512" height="269" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-1024x538.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-300x158.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324-768x404.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1324.jpg 1199w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure>



<h4 class="wp-block-heading" id="Remotestorageplayground-Context">Context</h4>



<p>After you have identify one (or more) remote storage who might suit your must bench it. However it is not as straight forward as it seems. Let&#8217;s review what we will need for this experiment:</p>



<ul class="wp-block-list">
<li>A (scalable) remote storage, in our case one which is remote write</li>



<li>One or more data generator</li>
</ul>



<h3 class="wp-block-heading" id="Remotestorageplayground-IntroducingHachimon">Introducing Hachimon</h3>



<figure class="wp-block-image alignright size-full is-resized"><img loading="lazy" decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1322.png" alt="Hachimon path" class="wp-image-24832" width="277" height="213" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1322.png 554w, https://blog.ovhcloud.com/wp-content/uploads/2023/03/IMG_1322-300x231.png 300w" sizes="auto, (max-width: 277px) 100vw, 277px" /></figure>



<p>Benchmarking is always fun but you know what is even more fun? Gamification! With my team mates we have created a short benchmark plan which we have called the <a href="https://narutofanon.fandom.com/wiki/Hachimon" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Hachimon path</a>:</p>



<ul class="wp-block-list">
<li>Gate of Opening
<ul class="wp-block-list">
<li>1k targets</li>



<li>1000 series/target</li>



<li>~ 66k datapoints/sec</li>
</ul>
</li>



<li>Gate of Healing
<ul class="wp-block-list">
<li>2k targets</li>



<li>1000 series/target,</li>



<li>~133k datapoints/sec</li>
</ul>
</li>



<li>Gate of Life
<ul class="wp-block-list">
<li>4k targets</li>



<li>1000 series/target</li>



<li>~266k datapoints/sec</li>
</ul>
</li>



<li>Gate of Pain
<ul class="wp-block-list">
<li>4k targets</li>



<li>1000 series/target</li>



<li>~266k datapoints/sec after deduplication</li>



<li>dual prometheus to increase pressure on deduplication features</li>
</ul>
</li>



<li>Gate of Limit
<ul class="wp-block-list">
<li>4k targets</li>



<li>2500 series/target to increase pressure on storage</li>



<li>~660k datapoints/sec</li>



<li>dual prometheus</li>
</ul>
</li>



<li>Gate of View
<ul class="wp-block-list">
<li>8k targets</li>



<li>2500 series/target</li>



<li>~1.3M datapoints/sec</li>



<li>dual prometheus</li>
</ul>
</li>



<li>Gate of Wonder
<ul class="wp-block-list">
<li>10k targets</li>



<li>2500 series/target</li>



<li>~1.6M datapoints/sec</li>



<li>dual prometheus</li>
</ul>
</li>



<li>Gate of Death
<ul class="wp-block-list">
<li>Add as many targets as you can until the backend almost on fire</li>
</ul>
</li>
</ul>



<p>To walk the Hachimon path we&#8217;ve built an infrastructure where only the central piece, the remote storage, changes. Doing so help us compare results.</p>



<p>The write path is stress by one or more Prometheus clusters which will scrap many time the same <a href="https://github.com/prometheus/node_exporter" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">node_exporter</a> under a different set of labels. Doing so allow us to emulate an infrastructure bigger than it is. To increase the cardinality we can tweak node_exporter configuration to expose more or less series. By deploying one or more Prometheus clusters we can both stress the deduplication feature of the backend and workaround the hardware limitation of a given prometheus.</p>



<p>This approach is very similar to the one of <a href="https://valyala.medium.com/promscale-vs-victoriametrics-resource-usage-on-production-workload-91c8e3786c03" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Victoriametrics</a> which has inspired us. Kudos!</p>



<p>By the time we have reach the end of our tests the infrastucture we have built looks like the following:</p>



<figure class="wp-block-image"><img decoding="async" src="https://raw.githubusercontent.com/wilfriedroset/remote-storage-wars/master/assets/generic-infrastructure.png" alt=""/></figure>



<p>This is the infrastucture we have used to bench both the read and the write path of the remote storages. There is load balancing on both side, multiple pairs of Prometheus to put more or less pressure on the write path and the deduplication. Finally, the data comes from little instances exposing node_exporter metrics.</p>



<h3 class="wp-block-heading" id="Remotestorageplayground-Expectation">Expectation</h3>



<p>Thanks to this benchmarking plan we have been able to differentiate the remote storage on a performance perspective. We&#8217;ve been able to get a first understanding about how each remote storage works, how to tune them and what can you done and what you cannot with them. It seems to us that it is equally important to have ease to operate a solution and good performance. But most importantly we learnt a lot of thing while having fun.</p>



<h3 class="wp-block-heading" id="Remotestorageplayground-Conclusion">Conclusion</h3>



<p>This benchmarking plan&#8217;s s obviously flawned in many ways:</p>



<ul class="wp-block-list">
<li>it&#8217;s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.</li>



<li>it&#8217;s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.</li>



<li>you&#8217;re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in <a href="https://github.com/prometheus/prometheus/issues/9807" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus</a> or haproxy configuration.</li>



<li>it focus mainly on the write path without stress from the read path which is not realistic.</li>
</ul>



<p>The two next posts of this series continue to focus on benchmarking. The first one focus on the read performance.</p>



<p>The second one focus on how we should have benchmarked our solution from the beginning.</p>



<p>Stay tuned</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fprometheus-remote-storage-playground%2F&amp;action_name=Prometheus%26%238217%3B%20remote%20storage%20playground&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Welcome to Prometheus world of remote storage</title>
		<link>https://blog.ovhcloud.com/welcome-to-prometheus-world-of-remote-storage/</link>
		
		<dc:creator><![CDATA[Wilfried Roset]]></dc:creator>
		<pubDate>Thu, 16 Feb 2023 16:29:25 +0000</pubDate>
				<category><![CDATA[OVHcloud Engineering]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[prometheus]]></category>
		<guid isPermaLink="false">https://blog.ovhcloud.com/?p=24579</guid>

					<description><![CDATA[At OVHcloud, we recently made a change to our internal Observability stack. After testing and comparing the different solutions on the market, we opted for on open source solution. With this blog post, we&#8217;re starting a series of articles to provide feedback on our selection process and what we&#8217;ve learned along the way. Our mission [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fwelcome-to-prometheus-world-of-remote-storage%2F&amp;action_name=Welcome%20to%20Prometheus%20world%20of%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p><em>At OVHcloud, we recently made a change to our internal Observability stack. After testing and comparing the different solutions on the market, we opted for on open source solution. With this blog post, we&#8217;re starting a series of articles to provide feedback on our selection process and what we&#8217;ve learned along the way. Our mission was to find  an horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus, we begin this series with an introduction to Prometheus remote storage…</em></p>



<p>Over the last decade Prometheus has become one of the standard for Observability. It&#8217;s core concept is well suited for today technological use cases and it makes sense that open source community loves it. While Prometheus does a lot of thing really well when it comes to long term storage users must find a solution. This blog post serie discuss Prometheus&#8217;s remote storages, the technical challenges they aim to solve and more importantly we discuss how to pick the right one for <strong>you</strong>.</p>



<figure class="wp-block-image aligncenter size-large is-resized"><img decoding="async" src="https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-1024x542.jpg" alt="Prometheus love remote storage" class="wp-image-24617" width="640" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-1024x542.jpg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-300x159.jpg 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2-768x407.jpg 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/logo-article-prometheus2.jpg 1194w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading">What is a remote storage?</h2>



<p>Prometheus can be configured to read or write to a remote storage on top of its local storage. This allow it to support long-terme storage of users data. The two features are called&nbsp;<a href="https://prometheus.io/docs/operating/configuration/#remote_read" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">remote_read</a> and <a href="https://prometheus.io/docs/operating/configuration/#remote_write" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">remote_write</a>.</p>



<p>With remote_read configured, Prometheus will answer read queries with data from the remote storage. The remote_write is responsible for shipping samples to the remote storage. Both of them are extremely useful and highly configurable.For the rest of this blog post let&#8217;s focus on remote write.</p>



<p>Whether you are a cloud provider or building an in-house Observability it is not always appropriate nor possible to connect to your customers infrastructure to extract data.</p>



<p>With a remote write approach customers can have a strict control on what comes in/out of the infrastructure. We could argue that IPtables coupled with authentication is secure enough but this is still one more door to keep an eye on. With tight security taken into account we understand that remote write makes a lot of sense from a service provider point of view.</p>



<p>Now that we know that we want a remote write compatible storage we must take into account that not all remote storages are equal. The list of solution keeps growing every day, let&#8217;s see if we can differentiate them.</p>



<p>When writing metrics to a remote storage it is because we want to read then back later. Most Observability use cases imply writing down tons of data that will be queried afterwards. PromQL is the query language use to query Prometheus and therefore associated remote storage. It would make sense to check how PromQL compliant the solutions are. Fear not, Prometheus community is already tackling this question for us with <a href="https://promlabs.com/promql-compliance-tests/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">PromQL Compliance</a></p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="621" src="https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-1024x621.png" alt="" class="wp-image-24580" srcset="https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-1024x621.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-300x182.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-768x466.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-1536x932.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2023/02/image-2048x1243.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">PromQL Compliance results as of 2021-10-14</figcaption></figure>



<p>As you can, see most remote storage are 100% compliant with Prometheus results. Good news. This means users have a plethora of </p>



<p>However, readers must not under estimate this point. Indeed compliance impacts what you can query from the backend, how you can query it and, the accuracy of a result. It might not be trivial to reach full compliance and to stay compliant. Maintainers might also choose to not be compliant and explain <a href="https://medium.com/@romanhavronenko/victoriametrics-promql-compliance-d4318203f51e" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">why</a>.</p>



<p>Prometheus world grows in adoption and under active development. If a solution is compatible today there is no guarantee it&#8217;ll stay compatible tomorrow.</p>



<p>Which bring us to the second point, the community. How healthy, large and active are the community behind each software?<br>Is it easy to contact them? Discuss issues? Propose feature and PRs? We tend to take granted the fact that PRs will be reviewed, that we&#8217;ll found someone to help us troubleshoot a bug but this is not necessarily the case.</p>



<h3 class="wp-block-heading">Features set</h3>



<p>To better address the technical challenges that are your own you must pick the solution that have the features you need. If you need multi tenancy check that point. If you need to downsample your data add this to your checklist. Don&#8217;t be shy, dig a little deeper. Test the feature look for its limitation. Tests are the only way to be able to make an informed decision. </p>



<p>To give you an idea you might want to have a look at the following features:   </p>



<ul class="wp-block-list">
<li>multi tenancy</li>



<li> rate limiting</li>



<li>deduplication</li>



<li>deletion</li>



<li>downsampling</li>
</ul>



<h3 class="wp-block-heading">Scalability</h3>



<p>Nowadays the word scalability is present almost everywhere. How well each remote storage scale? Can you write 2M samples/sec? Can you answer 1M queries/sec? Can you have 200M active series in total? <a href="https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">1B active series</a>? Per tenant?</p>



<p>You can have a rough understanding of the bottleneck by looking at the architecture diagram. But to have a crystal clear answer there is only one way, you need to make a proof of concept.</p>



<p>By the way, you can even try one remote storage right now on our <a href="https://www.ovhcloud.com/en/public-cloud/kubernetes" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">managed k8s</a>. Most of the open source remote storage offer helm charts or operator to do so: <a href="https://github.com/VictoriaMetrics/helm-charts" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">VictoriaMetrics</a>, <a href="https://github.com/timescale/helm-charts" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Timescale</a>, <a href="https://github.com/grafana/mimir/tree/main/operations/helm/charts/mimir-distributed" rel="nofollow external noopener noreferrer" data-wpel-link="external" target="_blank">Mimir</a>.</p>



<h3 class="wp-block-heading">Cost</h3>



<p>Along scalability comes <em>tco</em> which stand for <em>Total Cost of Ownership</em>. This boil down to how expensive a solution, infrastructure can be when you take all cost into account. For remote storage, on top of the team operating the infrastructure we must take into account the aforementioned infrastructure. All technical solution relies on 4 categories: trained engineers, compute resources, network and&#8230; Storage. Nevertheless, it is critical to take it into account all aspect of the target solution. Otherwise be ready for a surprise at the end of the month.</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>As we have demonstrate, we have a lot of technical solutions to address long term storage. However before putting one solution in production we need to thoroughly identify and assess all trade offs. In the next posts we will have a look on how to get to know your remote storage, bench it, break it.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fwelcome-to-prometheus-world-of-remote-storage%2F&amp;action_name=Welcome%20to%20Prometheus%20world%20of%20remote%20storage&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>OVHcloud Web Statistics: A new statistics interface for your OVHcloud hosted website</title>
		<link>https://blog.ovhcloud.com/ovhcloud-web-statistics-a-new-statistics-interface-for-your-ovhcloud-hosted-website/</link>
		
		<dc:creator><![CDATA[Matias Hastaran]]></dc:creator>
		<pubDate>Fri, 29 May 2020 12:43:41 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Web Hosting]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=18344</guid>

					<description><![CDATA[If you have ever managed or edited a website,&#160;you will likely have experience tracking page views and hit statistics. If this is the case, then this article is for you! Get ready to step into 2020 with a brand new interface! A bit of history There are multiple solutions on the market designed to help [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fovhcloud-web-statistics-a-new-statistics-interface-for-your-ovhcloud-hosted-website%2F&amp;action_name=OVHcloud%20Web%20Statistics%3A%20A%20new%20statistics%20interface%20for%20your%20OVHcloud%20hosted%20website&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>If you have ever managed or edited a website,&nbsp;you will likely have experience tracking page views and hit statistics. </p>



<p>If this is the case, then this article is for you! Get ready to step into 2020 with a brand new interface!</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="538" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/E4EAD232-019F-4573-A9AD-64F7DB48B2C6-1024x538.jpeg" alt="OVHcloud Web Statistics: A new statistics interface for your OVHcloud hosted website" class="wp-image-18375" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/E4EAD232-019F-4573-A9AD-64F7DB48B2C6-1024x538.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/E4EAD232-019F-4573-A9AD-64F7DB48B2C6-300x158.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/E4EAD232-019F-4573-A9AD-64F7DB48B2C6-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/E4EAD232-019F-4573-A9AD-64F7DB48B2C6.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h2 class="wp-block-heading" id="id-1stblogpost:OVHcloudWebStatistics:AnewstatisticsinterfaceforyourOVHcloudhostedwebsite-Abitofhistory">A bit of history</h2>



<p>There are multiple solutions on the market designed to help you analyse visits to your website.</p>



<p>Two methods exist to help you gather this information:</p>



<ul class="wp-block-list"><li>Embed some code on your website to track your visits and send those results to a third party to render those results</li><li>Keep control of your data: analyse your own raw logs to compute the relevant metrics</li></ul>



<p>In 2004, in order to keep your data safe, we decided to use an on-premise solution called Urchin&#8230; but its time to change!</p>



<p>Why?</p>



<ul class="wp-block-list"><li>Urchin was bought&nbsp;by Google and&nbsp;the software is discontinued. It hasn&#8217;t evolved, therefore, since 2012.</li><li>Urchin is Flash Player based. Flash Player is discontinued and will be stopped in 2020 by Adobe. There will be no more support for it.</li><li>It doesn&#8217;t offer the best possible experience.</li><li>Urchin doesn&#8217;t allow users to visualize subdomain statistics. (example: app.mydomain.com)</li></ul>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/BA17AD2E-CFD1-47C4-A98C-A2A8C23D3E52.jpeg" alt="" class="wp-image-18378" width="302" height="133" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/BA17AD2E-CFD1-47C4-A98C-A2A8C23D3E52.jpeg 603w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/BA17AD2E-CFD1-47C4-A98C-A2A8C23D3E52-300x132.jpeg 300w" sizes="auto, (max-width: 302px) 100vw, 302px" /></figure></div>



<h2 class="wp-block-heading" id="id-1stblogpost:OVHcloudWebStatistics:AnewstatisticsinterfaceforyourOVHcloudhostedwebsite-Howdoweprovideyourwebsitestats?">How do we provide your website stats?</h2>



<p>Every day, we compute statistics for several millions of websites. This is a specific requirement and few solutions exist to fill it.</p>



<p>What are those needs:</p>



<ul class="wp-block-list"><li>Being able to compute the statistics of all the websites as fast as possible.</li><li>Aggregate data to show anonymized data.</li><li>Do not embed code/tracker on your site</li><li>Have an easy to use interface aligned with today&#8217;s standards</li><li>Give you the ability to see at the subdomain level your statistics</li><li>Migrate your previous statistics from Urchin not to lose them</li></ul>



<p>For a long time, we tried to avoid the &#8220;not invented here&#8221; effect because rebuilding a statistic tool is not our main job. So we tried a lot of solutions on the market. Open source or not. Free or with licence. And we did not find a solution able to scale our quantity of logs and compute statistics for all websites we host!</p>



<p>So, we decided to&nbsp;develop an&nbsp;alternative solution, and proposed it by default for everyone: OVHcloud Web Statistics (or OWStats)</p>



<h3 class="wp-block-heading" id="id-1stblogpost:OVHcloudWebStatistics:AnewstatisticsinterfaceforyourOVHcloudhostedwebsite-So,what'snew?">So, what&#8217;s new?</h3>



<p>A new user Interface to quickly visualise the most relevant statistics</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="496" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/20200529-01-1024x496.png" alt="" class="wp-image-18369" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-01-1024x496.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-01-300x145.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-01-768x372.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-01-1536x743.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-01.png 1818w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>You can find few sections:</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="184" height="300" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/20200529-02.png" alt="" class="wp-image-18370"/></figure></div>



<ul class="wp-block-list"><li><strong>Dashboard</strong>:&nbsp;A summary of the activities on your domain with the dashboard&nbsp;</li><li><strong>Browsers</strong>: More technical information relating to the various browsers and platforms used to visit your domain</li><li><strong>Geolocalization</strong>: Which country/region visits your domain (the data is anonymized,&nbsp;so this is only a high level overview)</li><li><strong>Requests</strong>:&nbsp;Overview of the most viewed pages</li><li><strong>Robots</strong>:&nbsp;Analysis of the bots visiting your domain</li><li><strong>Status</strong>: Status code evolution and which&nbsp;pages are raising errors and should be investigated.</li></ul>



<h3 class="wp-block-heading">It would be simpler if we showed some pictures, wouldn&#8217;t it?</h3>



<p>Well, of course it would! Here you have:</p>



<p>The dashboard:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="543" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/20200529-03-1024x543.png" alt="" class="wp-image-18371" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-03-1024x543.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-03-300x159.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-03-768x407.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-03-1536x814.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-03.png 1571w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Geolocalization page:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="576" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/20200529-04-1024x576.png" alt="" class="wp-image-18372" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-04-1024x576.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-04-300x169.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-04-768x432.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-04-1536x865.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-04.png 1588w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>And the status pages:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="476" src="https://www.ovh.com/blog/wp-content/uploads/2020/05/20200529-05-1024x476.png" alt="" class="wp-image-18373" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-05-1024x476.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-05-300x139.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-05-768x357.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-05-1536x714.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/05/20200529-05.png 1597w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h2 class="wp-block-heading" id="id-1stblogpost:OVHcloudWebStatistics:AnewstatisticsinterfaceforyourOVHcloudhostedwebsite-Somenumbers">Some numbers</h2>



<p>Thanks to this new tool, we can&nbsp;<strong>compute your statistics up to 8x faster.</strong></p>



<p>We also compute&nbsp;<strong>2.5 TB of logs per day!</strong></p>



<h2 class="wp-block-heading" id="id-1stblogpost:OVHcloudWebStatistics:AnewstatisticsinterfaceforyourOVHcloudhostedwebsite-Wanttoseemore?">Want more infomation?</h2>



<p>This post is an early preview of our incoming OVH Web Statistics service, we will come back to you with more posts about the technical details as the release date will approach!</p>


<div class="sabox-plus-item"></div><img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fovhcloud-web-statistics-a-new-statistics-interface-for-your-ovhcloud-hosted-website%2F&amp;action_name=OVHcloud%20Web%20Statistics%3A%20A%20new%20statistics%20interface%20for%20your%20OVHcloud%20hosted%20website&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Jerem: An Agile Bot</title>
		<link>https://blog.ovhcloud.com/jerem-an-agile-bot/</link>
		
		<dc:creator><![CDATA[Aurélien Hébert]]></dc:creator>
		<pubDate>Fri, 21 Feb 2020 16:58:47 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Agile Telemetry]]></category>
		<category><![CDATA[Agility]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Time series]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16943</guid>

					<description><![CDATA[At OVHCloud, we are open sourcing our “Agility Telemetry” project. Jerem, as our data collector, is the main component of this project. Jerem scrapes our JIRA at regular intervals, and extracts specific metrics for each project. It then forwards them to our long-time storage application, the OVHCloud Metrics Data Platform.&#160;&#160; Agility concepts from a developer&#8217;s [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fjerem-an-agile-bot%2F&amp;action_name=Jerem%3A%20An%20Agile%20Bot&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>At OVHCloud, we are open sourcing our “Agility Telemetry” project. <strong><a href="https://github.com/ovh/jerem" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Jerem</a></strong>, as our data collector, is the main component of this project. Jerem scrapes our <strong>JIRA</strong> at regular intervals, and extracts <strong>specific metrics</strong> for each project. It then forwards them to our long-time storage application, the <strong>OVHCloud <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">Metrics Data Platform</a></strong>.&nbsp;&nbsp;</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="537" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-1024x537.jpeg" alt="Jerem: an agile bot" class="wp-image-17160" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-1024x537.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-300x157.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/1FA9BFB1-689F-4D25-A0EC-A65B99909343.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading">Agility concepts from a developer&#8217;s point of view</h3>



<p>To help you understand our goals for <strong>Jerem</strong>, we need to explain some Agility concepts we will be using. First, we will establish a <strong>technical quarterly roadmap</strong> for a product, which sets out all <strong>features</strong> we <strong>plan to release</strong> every three months. This is what we call an <strong>epic</strong>.&nbsp;</p>



<p>For each epic, we identify the tasks that will need to be completed. For all of those tasks, we then evaluate the complexity of tasks using <strong>story points</strong>, during a team preparation session. A story point reflects the effort required to complete the specific JIRA task. </p>



<p>Then, to advance our roadmap, we will conduct regular <strong>sprints</strong> that correspond to a period of <strong>ten days</strong>, during which the team will onboard several tasks. The amount of story points taken in a sprint should match, or be close to, the <strong>team velocity</strong>. In other words, the average number of story points that the team is able to complete each day.</p>



<p>However, other urgent tasks may arise unexpectedly during sprints. That’s what we call an <strong>impediment</strong>. We might, for example, need to factor in helping customers, bug fixes, or urgent infrastructure tasks.&nbsp;&nbsp;&nbsp;</p>



<h3 class="wp-block-heading">How Jerem works </h3>



<p>At OVH we use JIRA to track our activity. Our <strong>Jerem</strong> bot scraps our <strong>projects</strong> <strong>from</strong> <strong>JIRA</strong> and exports all necessary data to our  <strong>OVHCloud <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">Metrics Data Platform</a></strong>. Jerem can also push data to any Warp 10 compatible database. In Grafana, you simply query the Metrics platform (using <a href="https://github.com/ovh/ovh-warp10-datasource" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10 datasource</a>) with for example our <a href="https://github.com/ovh/jerem/blob/master/grafana/program_management.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">program management dashboard</a>. All your KPI are now available in a nice dashboard!</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="256" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-1024x256.jpeg" alt="" class="wp-image-17164" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-1024x256.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-300x75.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-768x192.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45-1536x384.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/DD1C99D4-E0B6-4AEC-9BDF-9ACA09CB1D45.jpeg 1720w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">Discover Jerem metrics</h3>



<p>Now that we have an overview of the main Agility concepts involved, let&#8217;s dive into Jerem! How do we convert those Agility concepts into metrics? First of all, we&#8217;ll retrieve all metrics related to epics (i.e. new features). Then, we will have a deep look at the sprint metrics.</p>



<h4 class="wp-block-heading">Epic data</h4>



<p>To explain Jerem epic metrics, we&#8217;ll start by creating a new one. In this example, we called it <code>Agile Telemetry</code>. We add a Q2-20 label, which means that we plan to release it for Q2. To record an epic with Jerem, you need to set a quarter for the final delivery! Next, we&#8217;ll simply add four tasks, as shown below:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="651" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Epic-1-1024x651.png" alt="" class="wp-image-16984" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1-1024x651.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1-300x191.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1-768x489.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Epic-1.png 1182w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To get the metrics, we need to evaluate each individual task. We we&#8217;ll do this together during preparation sessions. In this example, we have custom story points for each task. For example, we estimated the <code>write a BlogPost about Jerem</code> task as being a 3.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="472" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-1024x472.png" alt="" class="wp-image-16957" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-1024x472.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-300x138.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-768x354.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10-1536x709.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-11.32.10.png 1697w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>As a result, Jerem now has everything it needs to start collecting epic metrics. This example provides five metrics:</p>



<ul class="wp-block-list"><li><code>jerem.jira.epic.storypoint</code>: the total number of story points needed to complete this epic. The value here is 14 (the sum of all the epic story points). This metric will evolve whenever as the epic is updated by adding or removing tasks.&nbsp;</li><li><code>jerem.jira.epic.storypoint.done</code>: the number of completed tasks. In our example, we have already completed the <code>Write Jerem bot</code> and <code>Deploy Jerem Bot</code>, so we have already completed eight story points.</li><li><code>jerem.jira.epic.storypoint.inprogress</code>: the number of &#8216;in progress&#8217; tasks, such as <code>Write a BlogPost about Jerem</code>.</li><li><code>jerem.jira.epic.unestimated</code>: the number of unestimated tasks, shown as <code>Unestimated Task</code> in our example.</li><li><code>jerem.jira.epic.dependency</code>: the number of tasks that have dependency labels, indicating that they are mandatory for other epics or projects.</li></ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="443" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Metris-epics-1024x443.png" alt="" class="wp-image-16958" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-1024x443.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-300x130.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-768x332.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics-1536x665.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metris-epics.png 1784w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>This way, for each epic in a project, Jerem collects five unique metrics.  </p>



<h4 class="wp-block-heading">Sprint data</h4>



<p>To complete epic tasks, we work using a <strong>sprint</strong> process. When doing sprints, we want to provide a lot of <strong>insights</strong> into our <strong>achievements</strong>. That&#8217;s why Jerem collects sprint data too! </p>



<p>So let&#8217;s open a new sprint in JIRA and start working on our task. This gives us the following JIRA view:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="241" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Sprint-ui-1024x241.png" alt="" class="wp-image-16963" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-1024x241.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-300x71.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-768x181.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui-1536x362.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Sprint-ui.png 1804w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Jerem collects the following metrics for each sprint:&nbsp;</p>



<ul class="wp-block-list"><li><code>jerem.jira.sprint.storypoint.total</code>: the total number of story points onboarded into a sprint.</li><li><code>jerem.jira.sprint.storypoint.inprogress</code>: the story points currently in progress within a sprint.</li><li><code>jerem.jira.sprint.storypoint.done</code>: the number of story points currently completed within a sprint.</li><li><code>jerem.jira.sprint.events</code>: the &#8216;start&#8217; and of the &#8216;end&#8217; dates of sprint events, recorded as Warp10 string values.</li></ul>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="480" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Metrics-sprints-1024x480.png" alt="" class="wp-image-16964" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-1024x480.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-300x141.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-768x360.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints-1536x720.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-sprints.png 1785w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>As you can see in the Metrics view above, we will record every sprint metric twice. We do this to provide a quick view of the active sprint, which is why we use the &#8216;current&#8217; label&#8217;. This also enables us to query past sprints, using the real sprint name. Of course, an active sprint can also be queried using its name.   </p>



<h4 class="wp-block-heading">Impediment data</h4>



<p>Starting a sprint means you need to know all the tasks you will have to work on over the next few days. But how can we track and measure unplanned tasks? For example, the very urgent one for your manager, or the teammate that needs a bit of help?</p>



<p>We can add special tickets on JIRA to keep track of those task. That&#8217;s what we call an &#8216;impediment&#8217;. They are labelled according their nature. If, for example, the production requires your attention, then it&#8217;s an &#8216;Infra&#8217; impediment. You will also retrieve metrics for the &#8216;Total&#8217; (all kinds of impediments), &#8216;Excess&#8217; (the unplanned tasks), &#8216;Support&#8217; (helping teammates), and &#8216;Bug fixes or other&#8217; (for all other kinds of impediment).</p>



<p>Each impediment belongs to the active sprint it was closed in. To close an impediment, you only have to flag it as &#8216;Done&#8217; or &#8216;Closed&#8217;.</p>



<p>We also retrieve metrics like:</p>



<ul class="wp-block-list"><li><code>jerem.jira.impediment.TYPE.count</code>: the number of impediments that occurred during a sprint.</li><li><code>jerem.jira.impediment.TYPE.timespent</code>: the amount of time spent on impediments during a sprint.</li></ul>



<p><code>TYPE</code> corresponds to the <strong>kind</strong> of recorded impediment. As we didn&#8217;t open any actual impediments, Jerem collects only the <code>total</code> metrics.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="433" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Metrics-impediment-1024x433.png" alt="" class="wp-image-16965" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-1024x433.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-300x127.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-768x325.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment-1536x650.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Metrics-impediment.png 1773w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>To start recording impediments, we simply create a new JIRA task, in which we add an &#8216;impediment&#8217; label. We we also set its nature, and the actual time spent on it.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="819" height="903" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54.png" alt="" class="wp-image-16967" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54.png 819w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54-272x300.png 272w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/Screen-Shot-2020-02-06-at-14.16.54-768x847.png 768w" sizes="auto, (max-width: 819px) 100vw, 819px" /></figure>



<p>For the impediment, we&#8217;ll we also retrieve the global metrics that Jerem always records:</p>



<ul class="wp-block-list"><li><code>jerem.jira.impediment.total.created</code>: the time spent from the creation date to complete an impediment. This allows us to retrieve a total impediment count. We can also record all impediment actions, even outside sprints.&nbsp;</li></ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>For a single Jira project, like our example, you can expect around 300 metrics. This might increase depending on the epic you create and flag on Jira, and the one you close.</p></blockquote>



<h3 class="wp-block-heading">Grafana dashboard</h3>



<p>We love building Grafana dashboards! They provide both the team and the manager a lot of insights into KPIs. The best part of it for me, as a developer, is that I see why it&#8217;s nice to fill a JIRA task!</p>



<p>In our first <a href="https://github.com/ovh/jerem/blob/master/grafana/program_management.json" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Grafana dashboard</a>, you will retrieve all the best program management KPIs. Let&#8217;s start with the global overview:</p>



<h4 class="wp-block-heading">Quarter data overview</h4>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="421" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/quarter-data-sandbox-1024x421.png" alt="" class="wp-image-16968" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-1024x421.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-300x123.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-768x316.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox-1536x632.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/quarter-data-sandbox.png 1840w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Here, you will find the current epic in progress. You will also find the global team KPIs, such as the predictability, the velocity, and the impediment stats. It&#8217;s here where the magic happens! When filled correctly, this dashboard will show you exactly what your team should deliver in the current quarter. This means you have quick access to all important current subjects. You will also be able to see if your team is expected to deliver on too many subjects, so you can quickly take action and delay some of the new features.</p>



<h4 class="wp-block-heading">Active sprint data</h4>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="286" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/sprintdata-1024x286.png" alt="" class="wp-image-16969" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-1024x286.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-300x84.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-768x214.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata-1536x428.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/sprintdata.png 1839w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The active sprint data panel is often a great support during our daily meetings. In this panel, we get a quick overview of the team&#8217;s achievements, and can establish the time spent on parallel tasks. </p>



<h4 class="wp-block-heading">Detailed data</h4>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="286" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/detail-KPI-1024x286.png" alt="" class="wp-image-16970" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-1024x286.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-300x84.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-768x215.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI-1536x429.png 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/detail-KPI.png 1847w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The last part provides more detailed data. Using the epic Grafana variable, we can check specific epics, along with the completion of more global projects. You have also a <code>velocity chart</code>, which plots the past sprint, and compares the expected story points to the ones actually completed.   </p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>The Grafana dashboard is directly availble in the Jerem project. You can import it directly in Grafana, provided you have a valid Warp 10 datasource configured. </p><p>To make the dashboard work as required, you have to configure the Grafana project variable in the form of a WarpScript list <code>[ 'SAN' 'OTHER-PROJECT' ]</code>. If our program manager can do it, I am sure you can! 😉 </p></blockquote>



<p>Setting up Jerem and automatically loading program management data give us a lot of insights. As a developer, I really enjoy it and I&#8217;ve quickly become used to tracking a lot more events in JIRA than I did before. You are able to directly see the impact of your tasks. For example, you see how quickly the roadmap is advancing, and you become able to identify any bottlenecks that are causing impediments. Those bottlenecks then become epics. In other words, once we start to use Jerem, we just auto-fill it! I hope you will enjoy it too! If you have any feedback, we would love to hear it.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fjerem-an-agile-bot%2F&amp;action_name=Jerem%3A%20An%20Agile%20Bot&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Contributing to Apache HBase: custom data balancing</title>
		<link>https://blog.ovhcloud.com/contributing-to-apache-hbase-custom-data-balancing/</link>
		
		<dc:creator><![CDATA[Pierre Zemb]]></dc:creator>
		<pubDate>Fri, 14 Feb 2020 16:37:19 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Open Source]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16524</guid>

					<description><![CDATA[In today&#8217;s blogpost, we&#8217;re going to take a look at our upstream contribution to Apache HBase’s stochastic load balancer, based on our experience of running HBase clusters to support OVHcloud&#8217;s monitoring. The context Have you ever wondered how: we generate the graphs for your OVHcloud server or web hosting package? our internal teams monitor their [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcontributing-to-apache-hbase-custom-data-balancing%2F&amp;action_name=Contributing%20to%20Apache%20HBase%3A%20custom%20data%20balancing&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>In today&#8217;s blogpost, we&#8217;re going to take a look at our upstream contribution to Apache HBase’s stochastic load balancer, based on our experience of running HBase clusters to support OVHcloud&#8217;s monitoring.</p>



<div class="wp-block-image"><figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-1024x537.jpeg" alt="Contributing to Apache HBase: custom data balancing" class="wp-image-17086" width="1024" height="537" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-1024x537.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-300x157.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282-768x403.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/B043D804-9AF8-4109-8D73-3E36B9248282.jpeg 1200w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">The context</h3>



<p>Have you ever wondered how:</p>



<ul class="wp-block-list"><li>we generate the graphs for your OVHcloud server or web hosting package? </li><li>our internal teams monitor their own servers and applications?</li></ul>



<p><strong>All internal teams are constantly gathering telemetry and monitoring data</strong> and sending them to a <strong>dedicated team,</strong> who are responsible for <strong>handling all the metrics and logs generated by OVHcloud&#8217;s infrastructure</strong>: the Observability team.</p>



<p>We tried a lot of different <strong>Time Series databases</strong>, and eventually chose <a href="https://warp10.io" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Warp10</a> to handle our workloads. <strong>Warp10</strong> can be integrated with the various <strong>big-data solutions</strong> provided by the <a href="https://www.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Foundation.</a> In our case, we use <a href="http://hbase.apache.org" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache HBase</a> as the long-term storage datastore for our metrics. </p>



<p><a href="http://hbase.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache HBase</a>, a datastore built on top of <a href="http://hadoop.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Hadoop</a>, provides <strong>an elastic, distributed, key-ordered map.</strong> As such, one of the key features of Apache HBase for us is the ability to <strong>scan</strong>, i.e. retrieve a range of keys. Thanks to this feature, we can fetch <strong>thousands of datapoints in an optimised way</strong>.</p>



<p>We have our own dedicated clusters, the biggest of which has more than 270 nodes to spread our workloads:</p>



<ul class="wp-block-list"><li>between 1.6 and 2 million writes per second, 24/7</li><li>between 4 and 6 million reads per second</li><li>around 300TB of telemetry, stored within Apache HBase</li></ul>



<p>As you can probably imagine, storing 300TB of data in 270 nodes comes with some challenges regarding repartition, as <strong>every</strong> <strong>bit is hot data, and should be accessible at any time</strong>. Let&#8217;s dive in!</p>



<h3 class="wp-block-heading">How does balancing work in Apache HBase?</h3>



<p>Before diving into the balancer, let&#8217;s take a look at how it works. In Apache HBase, data is split into shards called <code>Regions</code>, and distributed through <code>RegionServers</code>. The number of regions will increase as the data is coming in, and regions will be split as a result. This is where the <code>Balancer</code> comes in. It will <strong>move regions</strong> to avoid hotspotting a single <code>RegionServer</code> and effectively distribute the load.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="768" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-1024x768.jpeg" alt="" class="wp-image-17007" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-1024x768.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-300x225.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-768x576.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA-1536x1152.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/C4812E1B-B58E-4CC9-BDAC-5F92AF68A5FA.jpeg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>The actual implementation, called <a href="https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">StochasticBalancer</a>, uses <strong>a cost-based approach:</strong></p>



<ol class="wp-block-list"><li>It first computes the <strong>overall cost</strong> of the cluster, by looping through <code>cost functions</code>. Every cost function <strong>returns a number between 0 and 1 inclusive</strong>, where 0 is the lowest cost-best solution, and 1 is the highest possible cost and worst solution. Apache Hbase is coming with several cost functions, which are measuring things like region load, table load, data locality, number of regions per RegionServers&#8230; The computed costs are <strong>scaled by their respective coefficients, defined in the configuration</strong>. </li><li>Now that the initial cost is computed, we can try to <code>Mutate</code> our cluster. For this, the Balancer creates a random <code>nextAction</code>, which could be something like <strong>swapping two regions</strong>, or <strong>moving one region to another RegionServer</strong>. The action is <strong>applied</strong> <strong>virtually</strong> , and then the <strong>new cost is calculated</strong>. If the new cost is lower than our previous one, the action is stored. If not, it is skipped. This operation is repeated <code>thousands of times</code>, hence the <code>Stochastic</code>. </li><li>At the end,<strong> the list of valid actions is applied to the actual cluster. </strong></li></ol>



<h3 class="wp-block-heading">What was not working for us?</h3>



<p>We found out that <strong>for our specific use case</strong>, which involved:</p>



<ul class="wp-block-list"><li>Single table</li><li>Dedicated Apache HBase and Apache Hadoop, <strong>tailored for our requirements</strong></li><li>Good key distribution</li></ul>



<p>&#8230; <strong>the number of regions per RegionServer was the real limit for us</strong>.</p>



<p>Even if the balancing strategy seems simple, <strong>we do think that being able to run an Apache HBase cluster on heterogeneous hardware is vital</strong>, especially in cloud environments, because you <strong>may not be able to buy the same server specs again in the future.</strong> In our earlier example, our cluster grew from 80 to ~250 machines in four years. Throughout that time, we bought new dedicated server references, and even tested some special internal references.</p>



<p>We ended-up with differents groups of hardware: <strong>some servers can handle only 180 regions, whereas the biggest can handle more than 900</strong>. Because of this disparity, we had to disable the Load Balancer to avoid the <a href="https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java#L1194" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">RegionCountSkewCostFunction</a>, which would try to bring all RegionServers to the same number of regions.</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="768" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-1024x768.jpeg" alt="RegionCountSkewCostFunction balancing" class="wp-image-17010" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-1024x768.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-300x225.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-768x576.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6-1536x1152.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/8E561C8C-17E0-46F2-AF20-7BE8900427F6.jpeg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Two years ago we developed some internal tools, which are responsible for load balancing regions across RegionServers. The tooling worked really good for our use case, simplifying the day-to-day operation of our cluster. </p>



<p><strong>Open source is at the DNA of OVHcloud</strong>, and that means that we build our tools on open source software, but also that we <strong>contribute</strong> and give it back to the community. When we talked around, we saw that we weren&#8217;t the only one concerned by the heterogenous cluster problem. We decided to rewrite our tooling to make it more general, and to <strong>contribute</strong> it <strong> directly upstream</strong> to the HBase project<strong>. </strong></p>



<h3 class="wp-block-heading">Our contributions</h3>



<p>The first contribution was pretty simple, the cost function list was a <a href="https://github.com/apache/hbase/blob/8cb531f207b9f9f51ab1509655ae59701b66ac37/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java#L199-L213" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">constant</a>. We <a href="https://github.com/apache/hbase/commit/836f26976e1ad8b35d778c563067ed0614c026e9" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">added the possibility to load custom cost functions</a>.</p>



<p>The second contribution was about <a href="https://github.com/apache/hbase/commit/42d535a57a75b58f585b48df9af9c966e6c7e46a" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">adding an optional costFunction to balance regions according to a capacity rule</a>.</p>



<h3 class="wp-block-heading">How does it works?</h3>



<p>The balancer will load a file containing lines of rules. <strong>A rule is composed of a regexp for hostname, and a limit.</strong> For example, we could have:</p>



<pre class="wp-block-code"><code class="">rs[0-9] 200
rs1[0-9] 50</code></pre>



<p>RegionServers with <strong>hostnames matching the first rules will have a limit of 200</strong>, and <strong>the others 50</strong>. If there&#8217;s no match, a default is set.</p>



<p>Thanks to these rule, we have two key pieces of information:</p>



<ul class="wp-block-list"><li>the<strong> max number of regions for this cluster</strong></li><li>the<strong> rules for each servers</strong></li></ul>



<p>The <code>HeterogeneousRegionCountCostFunction</code> will try to <strong>balance regions, according to their capacity.</strong></p>



<p>Let&#8217;s take an example&#8230; Imagine that we have 20 RS:</p>



<ul class="wp-block-list"><li>10 RS, named <code>rs0</code> to <code>rs9</code>, loaded with 60 regions each, which can each handle 200 regions.</li><li>10 RS, named <code>rs10</code> to <code>rs19</code>, loaded with 60 regions each, which can each handle 50 regions.</li></ul>



<p>So, based on the following rules:</p>



<pre class="wp-block-code"><code class="">rs[0-9] 200
rs1[0-9] 50</code></pre>



<p>&#8230; we can see that the <strong>second group is overloaded</strong>, whereas the first group has plenty of space.</p>



<p>We know that we can handle a maximum of <strong>2,500 regions</strong> (200&#215;10 + 50&#215;10), and we have currently <strong>1,200 regions</strong> (60&#215;20). As such, the <code>HeterogeneousRegionCountCostFunction</code> will understand that the cluster is <strong>full at 48.0%</strong> (1200/2500). Based on this information, we will then <strong>try to put all the RegionServers at ~48% of the load, according to the rules.</strong></p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="768" src="https://www.ovh.com/blog/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-1024x768.jpeg" alt="HeterogeneousRegionCountCostFunction balancing" class="wp-image-17084" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-1024x768.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-300x225.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-768x576.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A-1536x1152.jpeg 1536w, https://blog.ovhcloud.com/wp-content/uploads/2020/02/EE0CAA91-7767-4991-8710-1B0E993E945A.jpeg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<h3 class="wp-block-heading">Where to next?</h3>



<p>Thanks to Apache HBase&#8217;s contributors, our patches are now <strong>merged</strong> into the master branch. As soon as Apache HBase maintainers publish a new release, we will deploy and use it at scale. This <strong>will allow more automation on our side, and ease operations for the Observability Team.</strong></p>



<p>Contributing was an awesome journey. What I love most about open source is the opportunity ability to contribute back, and build stronger software. We <strong>had an opinion</strong> about how a particular issue should addressed, but <strong>the discussions with the community helped us to refine it</strong>. We spoke with e<strong>ngineers from other companies, who were struggling with Apache HBase&#8217;s cloud deployments, just as we were</strong>, and thanks to those exchanges, <strong>our contribution became more and more relevant. </strong></p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fcontributing-to-apache-hbase-custom-data-balancing%2F&amp;action_name=Contributing%20to%20Apache%20HBase%3A%20custom%20data%20balancing&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>TSL (or how to query time series databases)</title>
		<link>https://blog.ovhcloud.com/tsl-or-how-to-query-time-series-databases/</link>
		
		<dc:creator><![CDATA[Aurélien Hébert]]></dc:creator>
		<pubDate>Fri, 31 Jan 2020 13:41:34 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Observability]]></category>
		<category><![CDATA[Time series]]></category>
		<category><![CDATA[TSL]]></category>
		<guid isPermaLink="false">https://www.ovh.com/blog/?p=16734</guid>

					<description><![CDATA[Last year, we released TSL as an open source tool to query a Warp 10 platform, and by extension, the OVHcloud Metrics Data Platform. But how has it evolved since then? Is TSL ready to query other time series databases? What about TSL states on the Warp10 eco-system? TSL to query many time series databases [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ftsl-or-how-to-query-time-series-databases%2F&amp;action_name=TSL%20%28or%20how%20to%20query%20time%20series%20databases%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>Last year, we released <a href="https://github.com/ovh/tsl/)" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>TSL</strong></a> as an <strong>open source tool</strong> to <strong>query</strong> a<strong> Warp 10</strong> platform, and by extension, the <a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude"><strong>OVHcloud Metrics Data Platform</strong></a>. </p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="540" src="https://www.ovh.com/blog/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-1024x540.jpeg" alt="TSL by OVHcloud" class="wp-image-16774" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-1024x540.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-300x158.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A-768x405.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/135A79BD-555F-4967-96DF-32F0A92E6C8A.jpeg 1202w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure></div>



<p>But how has it evolved since then? Is TSL ready to query <strong>other time series databases</strong>? What about TSL states on the <strong>Warp10 eco-system</strong>?</p>



<hr class="wp-block-separator"/>



<h3 class="wp-block-heading">TSL to query many time series databases</h3>



<p>We wanted TSL to be usable in front of <strong>multiple time series databases</strong>. That&#8217;s why we also released a <strong>PromQL query generator</strong>.</p>



<p>One year later, we now know this wasn&#8217;t the way to go. Based on what we learned, the <strong><a href="https://github.com/aurrelhebert/TSL-Adaptor/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL-Adaptor</a> project</strong> was open sourced, as a proof of concept for how TSL can be used  to query a <em>non-Warp 10</em> database. Put simply, TSL-Adaptor allows TSL to <strong>query an InfluxDB</strong>.</p>



<h4 class="wp-block-heading">What is TSL-Adaptor?</h4>



<p>TSL-Adaptor is a <strong><a href="https://quarkus.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Quarkus</a> JAVA REST API</strong> that can be used to query a backend. TSL-Adaptor parses the TSL query, identifies the fetch operation, and loads natively raw data from the backend.  It then computes the TSL operations on the data, before returning  the result to the user. The main goal of TSL-Adaptor is <strong>to make TSL available</strong> on top of <strong>other TSDBs</strong>.</p>



<div class="wp-block-image"><figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="971" height="702" src="https://www.ovh.com/blog/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197.png" alt="" class="wp-image-16866" srcset="https://blog.ovhcloud.com/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197.png 971w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197-300x217.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2020/01/8834E834-6E98-4567-A2B0-B6FD530B2197-768x555.png 768w" sizes="auto, (max-width: 971px) 100vw, 971px" /></figure></div>



<p>In concrete terms, we are running a JAVA REST API that<strong> integrates the WarpScript library</strong> in its runtime. TSL is  then used to compile the query into a valid WarpScript one. This is <strong>fully transparent</strong>, and only deals with TSL queries on the user&#8217;s side. </p>



<p>To load raw data from the InfluxDB, we created a WarpScript extension. This extension integrates an abstract class <code>LOADSOURCERAW</code> that needs  to be implemented to create an TSL-Adaptor data source. This requires only two methods: <code>find</code> and <code>fetch</code>. <code>Find</code> gathers all series selectors matching a query (class names, tags or labels), while <code>fetch</code>actually retrieves the raw data within a time span.</p>



<h4 class="wp-block-heading">Query Influx with TSL-Adaptor</h4>



<p>To get started, simply run an <a href="https://www.influxdata.com/products/influxdb-overview/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">InfluxDB</a> locally on the 8086 port. Then, let&#8217;s start an influx <a href="https://www.influxdata.com/time-series-platform/telegraf/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Telegraf</a> agent and record Telegraf data on the local influx instance.</p>



<p>Next, make sure you have locally installed TSL-Adaptor and updated its config with the path to a <a href="https://github.com/ovh/tsl/releases" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><code>tsl.so</code> library</a>.</p>



<p>To specify a custom influx address or databases, update the <a href="https://github.com/aurrelhebert/TSL-Adaptor/blob/master/src/main/resources/application.properties" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL-Adaptor configuration</a> file accordingly.</p>



<p>You can start TSL-Adaptor with the following example command:</p>



<pre class="wp-block-code"><code class="">java -XX:TieredStopAtLevel=1 -Xverify:none -Djava.util.logging.manager=org.jboss.logmanager.LogManager -jar build/tsl-adaptor-0.0.1-SNAPSHOT-runner.jar </code></pre>



<p>And that&#8217;s it! You can now query your influx database with TSL and TSL-Adaptor.</p>



<p>Let&#8217;s start with the retrieval of the time series relating to the disk measurements.</p>



<pre class="wp-block-code"><code class="">curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk")'</code></pre>



<p>Now let&#8217;s use the TSL analytics power! </p>



<p>First, we would like to retrieve only the data containing a mode set to <code>rw</code>.&nbsp;</p>



<pre class="wp-block-code"><code class="">curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw")'</code></pre>



<p>We would like to retrieve the maximum value at every five-minute interval, for the last 20 minutes. The TSL query will therefore be:</p>



<pre class="wp-block-code"><code class="">curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw").last(20m).sampleBy(5m,max)'</code></pre>



<p>Now it&#8217;s your turn to have some fun with TSL and InfluxDB. You can find details of all the implemented functions in the <a href="https://github.com/ovh/tsl/blob/master/spec/doc.md" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL documentation</a>. Enjoy exploring!</p>



<hr class="wp-block-separator"/>



<h3 class="wp-block-heading">What&#8217;s new on TSL with Warp10?</h3>



<p>We originally built TSL as a <a href="https://github.com/ovh/tsl" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">GO proxy</a> in front of Warp10. <strong>TSL</strong> has now integrated the Warp 10 ecosystem as a <strong>Warp10 extension</strong>, or as a <strong>WASM library</strong>! We have also added some <strong>new native TSL functions</strong> to make the language even richer!</p>



<h4 class="wp-block-heading">TSL as WarpScript function</h4>



<p>To make TSL work as a Warp10 function, you need to have the <code>tsl.so</code> library available locally. This library can be found in <a href="https://github.com/ovh/tsl/releases" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL github repository</a>. We have also made a <a href="https://warpfleet.senx.io/browse/io.ovh/tslToWarpScript#main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL WarpScript extension</a> available from <a href="https://warpfleet.senx.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">WarpFleet</a>, the extension repository of the Warp 10 community. </p>



<p>To set up TSL extension on your Warp 10, simply download the JAR indicated in <a href="https://warpfleet.senx.io/browse/io.ovh/tslToWarpScript#main" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">WarpFleet</a>. You can then configure the extension in the Warp 10 configuration file: </p>



<pre class="wp-block-code"><code class="">warpscript.extensions = io.ovh.metrics.warp10.script.ext.tsl.TSLWarpScriptExtension
warpscript.tsl.libso.path = &lt;PATH_TO_THE_tsl.so_FILE></code></pre>



<p>Once you reboot Warp 10, you are ready to go. You can test if it&#8217;s working by running the following query:</p>



<pre class="wp-block-code"><code class="">// You will need to put here a valid Warp10 token when computing a TSL select statement
// '&lt;A_VALID_WARP_10_TOKEN>' 

// A valid TOKEN isn't needed on the create series statement in this example
// You can simply put an empty string
''

// Native TSL create series statement
 &lt;'
    create(series('test').setValues("now", [-5m, 2], [0, 1])) 
'>
TSL</code></pre>



<p>With the WarpScript TSL function, you can use native WarpScript variables in your script, as shown in the example below:</p>



<pre class="wp-block-code"><code class="">// Set a Warp10 variable

NOW 'test' STORE

'' 

// A Warp10 variable can be reused in TSL script as a native TSL variable
 &lt;'
    create(series('test').setValues("now", [-5m, 2], [0, 1])).add(test)
'>
TSL</code></pre>



<h4 class="wp-block-heading">TSL WASM</h4>



<p>To expand TSL&#8217;s potential uses, we have also exported it as a Wasm library, so you can use it directly in a browser! The Wasm version of the library parses TSL queries locally and generates the WarpScript. The result can then be used to query a Warp 10 backend. You will find more details on the <a href="https://github.com/ovh/tsl#use-tsl-with-webassembly" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TSL github</a>.</p>



<h4 class="wp-block-heading">TSL&#8217;s new features</h4>



<p>As TSL has grown in popularity, we have detected and fixed a few bugs, and also added some additional native functions to accommodate new use cases.</p>



<p>We added the <code>setLabelFromName</code>method, to set a new label to a series, based on its name. This label can be the exact series name, or the result of a regular expression. </p>



<p>We also completed the <code>sort</code>method, to allow users to sort their series set based on series meta data (i.e. selector, name or labels).</p>



<p>Finally, we added a <code>filterWithoutLabels</code>, to filter a series set and remove any series that do not contain specific labels.</p>



<p>Thanks for reading! I hope you will give TSL a try, as I would love to hear your feedback.  </p>



<hr class="wp-block-separator"/>



<h2 class="wp-block-heading">Paris Time Series meetup</h2>



<p>We are delighted to be <strong>hosting</strong>, at the <strong>OVHcloud office</strong> in Paris, soon the third<strong> <a href="https://www.meetup.com/Paris-Time-Series-Meetup/events/266610627/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Paris Time Series meetup</a></strong>, organised by Nicolas Steinmetz. During this meetup, we will be speaking about TSL, as well as listening to an introduction of the Redis Times Series platform.</p>



<p>If you are available, we will be happy to meet you there!</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Ftsl-or-how-to-query-time-series-databases%2F&amp;action_name=TSL%20%28or%20how%20to%20query%20time%20series%20databases%29&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Alerting based on IPMI data collection</title>
		<link>https://blog.ovhcloud.com/alerting-based-on-ipmi-data-collection/</link>
		
		<dc:creator><![CDATA[Morvan Le Goff]]></dc:creator>
		<pubDate>Fri, 10 May 2019 13:56:55 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Alerting]]></category>
		<category><![CDATA[Data Collection]]></category>
		<category><![CDATA[IPMI]]></category>
		<category><![CDATA[Observability]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14974</guid>

					<description><![CDATA[The problem to solve&#8230; How to continuously monitor the health of all OVH servers, without any impact on their performance, and no intrusion on the operating systems running on them&#160;– this was the issue to address. The end goal of this data collection is to allow us to detect and forecast potential hardware failure, in [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Falerting-based-on-ipmi-data-collection%2F&amp;action_name=Alerting%20based%20on%20IPMI%20data%20collection&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">The problem to solve&#8230;</h2>



<p>How to continuously monitor the health of all OVH servers, without any impact on their performance, and no intrusion on the operating systems running on them&nbsp;– this was the issue to address. The end goal of this data collection is to allow us to detect and forecast potential hardware failure, in order to improve the quality of service delivered to our customers.</p>



<p>We began by splitting the problem into four general steps:</p>



<ul class="wp-block-list"><li style="list-style-type: none;">
<ul>
<li>Data collection</li>
<li>Data storage</li>
<li>Data analytics</li>
<li>Visualisation/actions</li>
</ul>
</li></ul>



<h2 class="wp-block-heading">Data collection</h2>



<p>How did we collect massive amounts of server health data, in a non-intrusive way, within short time intervals?</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-1024x667.jpeg" alt="" class="wp-image-15455" width="768" height="500" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-1024x667.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-300x195.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-768x500.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-1200x782.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D.jpeg 1725w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<h3 class="wp-block-heading">Which data to collect?</h3>



<p>On modern servers, a BMC (Board Management Controller) allows us to control the firmware updates, reboots, etc.. This controller is independent of the system running on the server. In addition, the BMC gives us access to sensors for all the motherboard components through an I2C bus. The protocol used to communicate with the BMC is the IPMI protocol, which accessible via LAN (RMCP).</p>



<h4 class="wp-block-heading">What is IPMI?</h4>



<ul class="wp-block-list"><li>Intelligent Platform Management Interface.</li><li>Management and monitoring capabilities independently of the host’s OS.</li><li>Led by INTEL, first published in 1998.</li><li>Supported by more than 200 computer system vendors such as Cisco, DELL, HP, Intel, SuperMicro…</li></ul>



<h4 class="wp-block-heading">Why use IPMI?</h4>



<ul class="wp-block-list"><li>Access to hardware sensors (cpu temp, memory temp, chassis status, power, etc.).</li><li>No dependency on the OS (i.e. an agentless solution)</li><li>IPMI functions accessible after OS/system failure</li><li>Restricted access to IPMI functionalities via user privileges</li></ul>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-1024x533.jpeg" alt="IPMI-poller node" class="wp-image-15456" width="768" height="400" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-1024x533.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-300x156.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-768x400.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-1200x625.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869.jpeg 2000w" sizes="auto, (max-width: 768px) 100vw, 768px" /></figure></div>



<h3 class="wp-block-heading">Multi-source data collection</h3>



<p>We needed a scalable and responsive multi-source data collection tool to grab the IPMI data of about 400k servers at fixed intervals.</p>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/60E15FD2-69E4-471A-908B-8A06172973B4.png" alt="Akka" class="wp-image-15467" width="200" height="71" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/60E15FD2-69E4-471A-908B-8A06172973B4.png 352w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/60E15FD2-69E4-471A-908B-8A06172973B4-300x106.png 300w" sizes="auto, (max-width: 200px) 100vw, 200px" /></figure></div>



<p>We decided to build our IPMI data collector on an&nbsp;<a href="https://github.com/akka/akka" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Akka</a> framework.&nbsp;Akka&nbsp;is a open-source toolkit and runtime, simplifying the construction of concurrent and distributed applications on the JVM.</p>



<p>The Akka framework defines an abstraction built above thread called &#8216;actor&#8217;. This actor is an entity that handles messages. This abstraction eases the creation of multi-thread applications, so there&#8217;s no need to fight against deadlock. By selecting the dispatcher policy for a group of actors, you can fine-tune your application to be fully reactive and adaptable to the load. This way, we were able to design an efficient data collector that could adapt to the load, as we intended to grab each sensor value every minute.</p>



<p>In addition, the cluster architecture provided by the framework allowed us to handle all the servers in a datacentre with a single cluster. The cluster architecture also helped us to design a resilient system, so if a node of the cluster crashes or becomes too slow, it will automatically restart. The servers monitored by the failing node are then handled by the remaining, valid nodes of the cluster.</p>



<p>With the cluster architecture, we implemented a quorum feature, to take down the whole cluster if the minimal number of started nodes is not reached. With this feature, we can easily solve the split-brain problem, as if the connection is broken between nodes, the cluster will be split into two entities, and the one that does not reached the quorum will be automatically shut down.</p>



<p>A REST API is defined to communicate with the data collector in two ways:</p>



<ul class="wp-block-list"><li>To send the configurations</li><li>To get information on the monitored servers </li></ul>



<p>A cluster node is running on one JVM, and we are able to launch one or more nodes on a dedicated server. Each dedicated server used in the cluster is put in an OVH VRACK. An IPMI gateway pool is used to access the BMC of each server, with the communication between the gateway and the IPMI data collector secured by IPSEC connections.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-1024x872.jpeg" alt="IPMI-poller clustering" class="wp-image-15457" width="512" height="436" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-1024x872.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-300x256.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-768x654.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-1200x1022.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349.jpeg 1491w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h2 class="wp-block-heading">Data storage</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/1B79F173-0885-44F0-A356-D03D64DB7631.png" alt="OVH Metrics" class="wp-image-15470" width="199" height="179" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/1B79F173-0885-44F0-A356-D03D64DB7631.png 409w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/1B79F173-0885-44F0-A356-D03D64DB7631-300x268.png 300w" sizes="auto, (max-width: 199px) 100vw, 199px" /></figure></div>



<p>Of course, we use the OVH Metrics service for data storage! Before storing the data, the IPMI data collector unifies the metrics, by qualifying each sensor. The final metric name is defined by the entity the sensor belongs to and the base unit of the value. This will ease the post-treatment processes and data visualisation/comparison.</p>



<p>Each datacentre IPMI collector pushes its data to a Metrics live cache server with a limited persistence time. All important information is persisted in the OVH Metrics server.</p>



<h2 class="wp-block-heading">Data analytics</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/4BF68819-3158-43B2-A4DF-51123521806D-300x127.png" alt="Warp 10" class="wp-image-15468" width="201" height="85" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/4BF68819-3158-43B2-A4DF-51123521806D-300x127.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/4BF68819-3158-43B2-A4DF-51123521806D.png 450w" sizes="auto, (max-width: 201px) 100vw, 201px" /></figure></div>



<p>We store ours metrics in <a href="https://github.com/senx/warp10-platform" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">warp10</a>. Warp 10 comes with a Time series scripting language: WarpScript which wakes the analytics powerful to easily manipulate and post-process (on the server side) our collected data.</p>



<p>We have defined three levels of analysis to monitor the health of the servers:</p>



<ul class="wp-block-list"><li style="list-style-type: none;">
<ul>
<li>A simple threshold-per-server metric.</li>
<li>By using OVH metric loops service, we aggregate data per rack and per room and calculate a mean. We set a threshold for this mean, this permits to detect racks or room common failure in the cooling or power supply system.</li>
<li>The OVH MLS service performs some anomaly detections on the racks and rooms by forecasting the possible evolution of metrics, depending on past values. If the metrics value is outside of this template, an anomaly is raised.</li>
</ul>
</li></ul>



<h2 class="wp-block-heading">Visualisation/actions</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/F8551D2C-5386-4754-912B-2A0C0F278684-150x150.png" alt="TAT" class="wp-image-15472" width="100" height="100"/></figure></div>



<p>All the alerts generated by the data analysis are pushed under <a href="https://github.com/ovh/tat" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TAT</a>, which is an OVH tool we use to handle the alerting flow.</p>



<div style="height:20px" aria-hidden="true" class="wp-block-spacer"></div>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/47315BB4-8989-46AB-8882-25E804AFBFC1.png" alt="Grafana" class="wp-image-15473" width="150" height="124"/></figure></div>



<p>Grafana is used to monitored the metrics. We have dashboards to visualise the metrics and the aggregations for each rack and room, the detected anomalies, and the evolution of the opened alerts.</p>



<div style="height:20px" aria-hidden="true" class="wp-block-spacer"></div>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="300" height="163" src="/blog/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-300x163.png" alt="" class="wp-image-14985" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-300x163.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-768x417.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-1024x556.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-1200x651.png 1200w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Falerting-based-on-ipmi-data-collection%2F&amp;action_name=Alerting%20based%20on%20IPMI%20data%20collection&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
