<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Omni Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/omni/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/omni/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Fri, 12 Jul 2019 09:22:55 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Omni Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/omni/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Handling OVH&#8217;s alerts with Apache Flink</title>
		<link>https://blog.ovhcloud.com/handling-ovhs-alerts-with-apache-flink/</link>
		
		<dc:creator><![CDATA[Pierre Zemb]]></dc:creator>
		<pubDate>Thu, 31 Jan 2019 09:01:32 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Alerting]]></category>
		<category><![CDATA[Apache Flink]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Omni]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14337</guid>

					<description><![CDATA[OVH relies extensively on metrics to effectively monitor its entire stack. Whenever they are low-level or business centric, they allow teams to gain insight into how our services are operating on a daily basis. The need to store millions of datapoints per second has produced the need to create a dedicated team to build a operate a product to handle that load: Metrics Data Platform. By relying on Apache Hbase, Apache Kafka and Warp 10, we succeeded in creating a fully distributed platform that is handling all our metrics... and yours!

After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: Alerting. 
<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhandling-ovhs-alerts-with-apache-flink%2F&amp;action_name=Handling%20OVH%26%238217%3Bs%20alerts%20with%20Apache%20Flink&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p>OVH relies extensively on <strong>metrics</strong> to effectively monitor its entire stack. Whether they are <strong>low-level</strong> or <strong>business</strong> centric, they allow teams to gain <strong>insight</strong> into how our services are operating on a daily basis. The need to store <strong>millions of datapoints per second</strong> has produced the need to create a dedicated team to build a operate a product to handle that load: <strong><a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">Metrics Data Platform</a>.</strong> By relying on <strong><a href="https://hbase.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Hbase</a>, <a href="https://kafka.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Kafka</a></strong> and <a href="https://www.warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Warp 10</strong></a>, we succeeded in creating a fully distributed platform that is handling all our metrics&#8230; and yours! </p>



<p>After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: the <strong>Alerting. </strong></p>



<figure class="wp-block-image"><img fetchpriority="high" decoding="async" width="885" height="290" src="https://www.ovh.com/blog/wp-content/uploads/2019/01/001-1.png" alt="OVH &amp; Apache Flink" class="wp-image-14367" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1-300x98.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1-768x252.png 768w" sizes="(max-width: 885px) 100vw, 885px" /></figure>



<h3 class="wp-block-heading" id="6c01">Meet OMNI, our alerting&nbsp;layer</h3>



<p>OMNI is our code name for a&nbsp;<strong>fully distributed</strong>,&nbsp;<strong>as-code</strong>,&nbsp;<strong>alerting</strong>&nbsp;system that we developed on top of Metrics. It is split into components:</p>



<ul class="wp-block-list"><li><strong>The management part</strong>, taking your alerts definitions defined in a Git repository, and represent them as continuous queries,</li><li><strong>The query executor</strong>, scheduling your queries in a distributed way.</li></ul>



<p>The query executor is pushing the query results into Kafka, ready to be handled! We now need to perform all the tasks that an alerting system does:</p>



<ul class="wp-block-list"><li>Handling alerts&nbsp;<strong>deduplication</strong>&nbsp;and&nbsp;<strong>grouping</strong>, to avoid&nbsp;<a href="https://en.wikipedia.org/wiki/Alarm_fatigue" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">alert fatigue.&nbsp;</a></li><li>Handling&nbsp;<strong>escalation</strong>&nbsp;steps,&nbsp;<strong>acknowledgement&nbsp;</strong>or&nbsp;<strong>snooze</strong>.</li><li><strong>Notify</strong>&nbsp;the end user, through differents&nbsp;<strong>channels</strong>: SMS, mail, Push notifications,&nbsp;…</li></ul>



<p>To handle that, we looked at open-source projects, such as&nbsp;<a href="https://github.com/prometheus/alertmanager" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus AlertManager,</a>&nbsp;<a href="https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">LinkedIn Iris,</a>&nbsp;we discovered the&nbsp;<em>hidden</em>&nbsp;truth:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>Handling alerts as streams of data,<br>moving from operators to&nbsp;another.</p></blockquote>



<p>We embraced it, and decided to leverage <a href="https://flink.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Flink</a> to create&nbsp;<strong>Beacon</strong>. In the next section we are going to describe the architecture of Beacon, and how we built and operate it.</p>



<p>If you want some more information on Apache Flink, we suggest to read the introduction article on the official website:&nbsp;<a href="https://flink.apache.org/flink-architecture.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">What is Apache Flink?</a></p>



<h3 class="wp-block-heading" id="6caa">Beacon architecture</h3>



<p>


At his core, Beacon is reading events from&nbsp;<strong>Kafka</strong>. Everything is represented as a&nbsp;<strong>message</strong>, from alerts to aggregations rules, snooze orders and so on. The pipeline is divided into two branches:



</p>



<ul class="wp-block-list"><li>One that is running the&nbsp;<strong>aggregations</strong>, and triggering notifications based on customer’s rules.</li><li>One that is handling the&nbsp;<strong>escalation steps</strong>.</li></ul>



<p>Then everything is merged to&nbsp;<strong>generate</strong>&nbsp;<strong>a</strong>&nbsp;<strong>notification</strong>, that is going to be forward to the right person. A notification message is pushed into Kafka, that will be consumed by another component called&nbsp;<strong>beacon-notifier.</strong></p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" width="885" height="470" src="/blog/wp-content/uploads/2019/01/002.png" alt="Beacon architecture" class="wp-image-14349" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/002.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/002-300x159.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/002-768x408.png 768w" sizes="(max-width: 885px) 100vw, 885px" /></figure></div>



<p>If you are new to streaming architecture, I recommend reading&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/programming-model.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Dataflow Programming Model</a>&nbsp;from Flink official documentation.</p>



<div class="wp-block-image"><figure class="aligncenter"><img decoding="async" width="885" height="616" src="/blog/wp-content/uploads/2019/01/003.png" alt="Handling state" class="wp-image-14350" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/003.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/003-300x209.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/003-768x535.png 768w" sizes="(max-width: 885px) 100vw, 885px" /></figure></div>



<p>Everything is merged into a dataStream,&nbsp;<strong>partitionned</strong>&nbsp;(<a href="https://medium.com/r/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.7%2Fdev%2Fstream%2Fstate%2Fstate.html%23keyed-state" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">keyed by&nbsp;</a>in Flink API) by users. Here&#8217;s an example:</p>



<pre class="wp-block-code"><code lang="java" class="language-java">final DataStream&lt;Tuple4&lt;PlanIdentifier, Alert, Plan, Operation>> alertStream =

  // Partitioning Stream per AlertIdentifier
  cleanedAlertsStream.keyBy(0)
  // Applying a Map Operation which is setting since when an alert is triggered
  .map(new SetSinceOnSelector())
  .name("setting-since-on-selector").uid("setting-since-on-selector")

  // Partitioning again Stream per AlertIdentifier
  .keyBy(0)
  // Applying another Map Operation which is setting State and Trend
  .map(new SetStateAndTrend())
  .name("setting-state").uid("setting-state");</code></pre>



<ul class="wp-block-list"><li><strong>SetSinceOnSelector</strong>, which is setting&nbsp;<strong>since</strong>&nbsp;when the alert is triggered</li><li><strong>SetStateAndTrend</strong>, which is setting the&nbsp;<strong>state&nbsp;</strong>(ONGOING, RECOVERY or OK) and the&nbsp;<strong>trend</strong>(do we have more or less metrics in errors).</li></ul>



<p>Each of this class is under 120 lines of codes because Flink is&nbsp;<strong>handling all the difficulties</strong>. Most of the pipeline are&nbsp;<strong>only composed</strong>&nbsp;of&nbsp;<strong>classic transformations</strong>&nbsp;such as&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Map, FlatMap, Reduce</a>, including their&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#rich-functions" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Rich</a>&nbsp;and&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#using-managed-keyed-state" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Keyed</a>&nbsp;version. We have a few&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/process_function.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Process Functions</a>, which are&nbsp;<strong>very handy</strong>&nbsp;to develop, for example, the escalation timer.</p>



<h3 class="wp-block-heading" id="a77e">Integration tests</h3>



<p>As the number of classes was growing, we needed to test our pipeline. Because it is only wired to Kafka, we wrapped consumer and producer to create what we call&nbsp;<strong>scenari:&nbsp;</strong>a series of integration tests running different scenarios.</p>



<h3 class="wp-block-heading" id="5f8f">Queryable state</h3>



<p>One killer feature of Apache Flink is the&nbsp;<strong>capabilities of&nbsp;</strong><a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/queryable_state.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external"><strong>querying the internal state</strong></a><strong>&nbsp;of an operator</strong>. Even if it is a beta feature, it allows us the get the current state of the different parts of the job:</p>



<ul class="wp-block-list"><li>at which escalation steps are we on</li><li>is it snoozed or <em>ack</em>-ed</li><li>Which alert is ongoing</li><li>and so on.</li></ul>



<div class="wp-block-image size-full wp-image-14361"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="617" src="/blog/wp-content/uploads/2019/01/004-1.png" alt="Queryable state overview" class="wp-image-14361" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1-300x209.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1-768x535.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /><figcaption>Queryable state overview</figcaption></figure></div>



<p>Thanks to this, we easily developed an&nbsp;<strong>API</strong>&nbsp;over the queryable state, that is powering our&nbsp;<strong>alerting view</strong>&nbsp;in&nbsp;<a href="https://studio.metrics.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Metrics Studio,</a> our codename for the Web UI of the Metrics Data Platform.</p>



<h3 class="wp-block-heading" id="1bc7">Apache Flink deployment</h3>



<p>


&nbsp;
We deployed the latest version of Flink (<strong>1.7.1</strong>&nbsp;at the time of writing) directly on bare metal servers with a dedicated Zookeeper’s cluster using Ansible. Operating Flink has been a really nice surprise for us, with&nbsp;<strong>clear documentation and configuration</strong>, and an&nbsp;<strong>impressive resilience</strong>. We are capable of&nbsp;<strong>rebooting</strong>&nbsp;the whole Flink cluster, and the job is&nbsp;<strong>restarting at his last saved state</strong>, like nothing happened.


</p>



<p>We are using&nbsp;<strong>RockDB</strong>&nbsp;as a state backend, backed by OpenStack&nbsp;<strong>Swift storage&nbsp;</strong>provided by OVH Public Cloud.</p>



<p>For monitoring, we are relying on&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus Exporter</a>&nbsp;with&nbsp;<a href="https://github.com/ovh/beamium" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Beamium</a>&nbsp;to gain&nbsp;<strong>observability</strong>&nbsp;over job’s health.</p>



<h3 class="wp-block-heading" id="8d7c">In short, we love Apache&nbsp;Flink!</h3>



<p>If you are used to work with stream related software, you may have realized that we did not used any rocket science or tricks. We may be relying on basics streaming features offered by Apache Flink, but they allowed us to tackle many business and scalability problems with ease.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77.png" alt="Apache Flink" class="wp-image-14354" width="437" height="400" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77.png 874w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77-300x275.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77-768x703.png 768w" sizes="auto, (max-width: 437px) 100vw, 437px" /></figure></div>



<p>As such, we highly recommend that any developers should have a look to Apache Flink. I encourage you to go through <a href="https://medium.com/r/?url=https%3A%2F%2Ftraining.da-platform.com%2F" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Apache Flink Training</a>, written by Data Artisans.&nbsp;Furthermore, the community has put a lot of effort to easily deploy Apache Flink to Kubernetes, so you can easily try Flink using our&nbsp;Managed Kubernetes!</p>



<h3 class="wp-block-heading">What’s next?</h3>



<p>Next week we come back to Kubernetes, as we will expose how we deal with ETCD in our OVH <a href="https://www.ovh.com/fr/kubernetes/" data-wpel-link="exclude">Managed Kubernetes service</a>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhandling-ovhs-alerts-with-apache-flink%2F&amp;action_name=Handling%20OVH%26%238217%3Bs%20alerts%20with%20Apache%20Flink&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
