<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Alerting Archives - OVHcloud Blog</title>
	<atom:link href="https://blog.ovhcloud.com/tag/alerting/feed/" rel="self" type="application/rss+xml" />
	<link>https://blog.ovhcloud.com/tag/alerting/</link>
	<description>Innovation for Freedom</description>
	<lastBuildDate>Fri, 27 Sep 2019 09:00:28 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://blog.ovhcloud.com/wp-content/uploads/2019/07/cropped-cropped-nouveau-logo-ovh-rebranding-32x32.gif</url>
	<title>Alerting Archives - OVHcloud Blog</title>
	<link>https://blog.ovhcloud.com/tag/alerting/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Alerting based on IPMI data collection</title>
		<link>https://blog.ovhcloud.com/alerting-based-on-ipmi-data-collection/</link>
		
		<dc:creator><![CDATA[Morvan Le Goff]]></dc:creator>
		<pubDate>Fri, 10 May 2019 13:56:55 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Alerting]]></category>
		<category><![CDATA[Data Collection]]></category>
		<category><![CDATA[IPMI]]></category>
		<category><![CDATA[Observability]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14974</guid>

					<description><![CDATA[The problem to solve&#8230; How to continuously monitor the health of all OVH servers, without any impact on their performance, and no intrusion on the operating systems running on them&#160;– this was the issue to address. The end goal of this data collection is to allow us to detect and forecast potential hardware failure, in [&#8230;]<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Falerting-based-on-ipmi-data-collection%2F&amp;action_name=Alerting%20based%20on%20IPMI%20data%20collection&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">The problem to solve&#8230;</h2>



<p class="wp-block-paragraph">How to continuously monitor the health of all OVH servers, without any impact on their performance, and no intrusion on the operating systems running on them&nbsp;– this was the issue to address. The end goal of this data collection is to allow us to detect and forecast potential hardware failure, in order to improve the quality of service delivered to our customers.</p>



<p class="wp-block-paragraph">We began by splitting the problem into four general steps:</p>



<ul class="wp-block-list"><li style="list-style-type: none;">
<ul>
<li>Data collection</li>
<li>Data storage</li>
<li>Data analytics</li>
<li>Visualisation/actions</li>
</ul>
</li></ul>



<h2 class="wp-block-heading">Data collection</h2>



<p class="wp-block-paragraph">How did we collect massive amounts of server health data, in a non-intrusive way, within short time intervals?</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img fetchpriority="high" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-1024x667.jpeg" alt="" class="wp-image-15455" width="768" height="500" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-1024x667.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-300x195.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-768x500.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D-1200x782.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/CBD51216-1458-45ED-B575-69229AD64E2D.jpeg 1725w" sizes="(max-width: 768px) 100vw, 768px" /></figure></div>



<h3 class="wp-block-heading">Which data to collect?</h3>



<p class="wp-block-paragraph">On modern servers, a BMC (Board Management Controller) allows us to control the firmware updates, reboots, etc.. This controller is independent of the system running on the server. In addition, the BMC gives us access to sensors for all the motherboard components through an I2C bus. The protocol used to communicate with the BMC is the IPMI protocol, which accessible via LAN (RMCP).</p>



<h4 class="wp-block-heading">What is IPMI?</h4>



<ul class="wp-block-list"><li>Intelligent Platform Management Interface.</li><li>Management and monitoring capabilities independently of the host’s OS.</li><li>Led by INTEL, first published in 1998.</li><li>Supported by more than 200 computer system vendors such as Cisco, DELL, HP, Intel, SuperMicro…</li></ul>



<h4 class="wp-block-heading">Why use IPMI?</h4>



<ul class="wp-block-list"><li>Access to hardware sensors (cpu temp, memory temp, chassis status, power, etc.).</li><li>No dependency on the OS (i.e. an agentless solution)</li><li>IPMI functions accessible after OS/system failure</li><li>Restricted access to IPMI functionalities via user privileges</li></ul>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-1024x533.jpeg" alt="IPMI-poller node" class="wp-image-15456" width="768" height="400" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-1024x533.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-300x156.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-768x400.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869-1200x625.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/34BC9464-E831-4E9A-83E2-5CD96B6A0869.jpeg 2000w" sizes="(max-width: 768px) 100vw, 768px" /></figure></div>



<h3 class="wp-block-heading">Multi-source data collection</h3>



<p class="wp-block-paragraph">We needed a scalable and responsive multi-source data collection tool to grab the IPMI data of about 400k servers at fixed intervals.</p>



<div class="wp-block-image"><figure class="alignright is-resized"><img decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/60E15FD2-69E4-471A-908B-8A06172973B4.png" alt="Akka" class="wp-image-15467" width="200" height="71" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/60E15FD2-69E4-471A-908B-8A06172973B4.png 352w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/60E15FD2-69E4-471A-908B-8A06172973B4-300x106.png 300w" sizes="(max-width: 200px) 100vw, 200px" /></figure></div>



<p class="wp-block-paragraph">We decided to build our IPMI data collector on an&nbsp;<a href="https://github.com/akka/akka" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Akka</a> framework.&nbsp;Akka&nbsp;is a open-source toolkit and runtime, simplifying the construction of concurrent and distributed applications on the JVM.</p>



<p class="wp-block-paragraph">The Akka framework defines an abstraction built above thread called &#8216;actor&#8217;. This actor is an entity that handles messages. This abstraction eases the creation of multi-thread applications, so there&#8217;s no need to fight against deadlock. By selecting the dispatcher policy for a group of actors, you can fine-tune your application to be fully reactive and adaptable to the load. This way, we were able to design an efficient data collector that could adapt to the load, as we intended to grab each sensor value every minute.</p>



<p class="wp-block-paragraph">In addition, the cluster architecture provided by the framework allowed us to handle all the servers in a datacentre with a single cluster. The cluster architecture also helped us to design a resilient system, so if a node of the cluster crashes or becomes too slow, it will automatically restart. The servers monitored by the failing node are then handled by the remaining, valid nodes of the cluster.</p>



<p class="wp-block-paragraph">With the cluster architecture, we implemented a quorum feature, to take down the whole cluster if the minimal number of started nodes is not reached. With this feature, we can easily solve the split-brain problem, as if the connection is broken between nodes, the cluster will be split into two entities, and the one that does not reached the quorum will be automatically shut down.</p>



<p class="wp-block-paragraph">A REST API is defined to communicate with the data collector in two ways:</p>



<ul class="wp-block-list"><li>To send the configurations</li><li>To get information on the monitored servers </li></ul>



<p class="wp-block-paragraph">A cluster node is running on one JVM, and we are able to launch one or more nodes on a dedicated server. Each dedicated server used in the cluster is put in an OVH VRACK. An IPMI gateway pool is used to access the BMC of each server, with the communication between the gateway and the IPMI data collector secured by IPSEC connections.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="https://www.ovh.com/blog/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-1024x872.jpeg" alt="IPMI-poller clustering" class="wp-image-15457" width="512" height="436" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-1024x872.jpeg 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-300x256.jpeg 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-768x654.jpeg 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349-1200x1022.jpeg 1200w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/2F3F033B-5D0D-4A3B-8F52-829087BF1349.jpeg 1491w" sizes="auto, (max-width: 512px) 100vw, 512px" /></figure></div>



<h2 class="wp-block-heading">Data storage</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/1B79F173-0885-44F0-A356-D03D64DB7631.png" alt="OVH Metrics" class="wp-image-15470" width="199" height="179" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/1B79F173-0885-44F0-A356-D03D64DB7631.png 409w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/1B79F173-0885-44F0-A356-D03D64DB7631-300x268.png 300w" sizes="auto, (max-width: 199px) 100vw, 199px" /></figure></div>



<p class="wp-block-paragraph">Of course, we use the OVH Metrics service for data storage! Before storing the data, the IPMI data collector unifies the metrics, by qualifying each sensor. The final metric name is defined by the entity the sensor belongs to and the base unit of the value. This will ease the post-treatment processes and data visualisation/comparison.</p>



<p class="wp-block-paragraph">Each datacentre IPMI collector pushes its data to a Metrics live cache server with a limited persistence time. All important information is persisted in the OVH Metrics server.</p>



<h2 class="wp-block-heading">Data analytics</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/4BF68819-3158-43B2-A4DF-51123521806D-300x127.png" alt="Warp 10" class="wp-image-15468" width="201" height="85" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/05/4BF68819-3158-43B2-A4DF-51123521806D-300x127.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/05/4BF68819-3158-43B2-A4DF-51123521806D.png 450w" sizes="auto, (max-width: 201px) 100vw, 201px" /></figure></div>



<p class="wp-block-paragraph">We store ours metrics in <a href="https://github.com/senx/warp10-platform" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">warp10</a>. Warp 10 comes with a Time series scripting language: WarpScript which wakes the analytics powerful to easily manipulate and post-process (on the server side) our collected data.</p>



<p class="wp-block-paragraph">We have defined three levels of analysis to monitor the health of the servers:</p>



<ul class="wp-block-list"><li style="list-style-type: none;">
<ul>
<li>A simple threshold-per-server metric.</li>
<li>By using OVH metric loops service, we aggregate data per rack and per room and calculate a mean. We set a threshold for this mean, this permits to detect racks or room common failure in the cooling or power supply system.</li>
<li>The OVH MLS service performs some anomaly detections on the racks and rooms by forecasting the possible evolution of metrics, depending on past values. If the metrics value is outside of this template, an anomaly is raised.</li>
</ul>
</li></ul>



<h2 class="wp-block-heading">Visualisation/actions</h2>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/F8551D2C-5386-4754-912B-2A0C0F278684-150x150.png" alt="TAT" class="wp-image-15472" width="100" height="100"/></figure></div>



<p class="wp-block-paragraph">All the alerts generated by the data analysis are pushed under <a href="https://github.com/ovh/tat" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">TAT</a>, which is an OVH tool we use to handle the alerting flow.</p>



<div style="height:20px" aria-hidden="true" class="wp-block-spacer"></div>



<div class="wp-block-image"><figure class="alignright is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/05/47315BB4-8989-46AB-8882-25E804AFBFC1.png" alt="Grafana" class="wp-image-15473" width="150" height="124"/></figure></div>



<p class="wp-block-paragraph">Grafana is used to monitored the metrics. We have dashboards to visualise the metrics and the aggregations for each rack and room, the detected anomalies, and the evolution of the opened alerts.</p>



<div style="height:20px" aria-hidden="true" class="wp-block-spacer"></div>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="300" height="163" src="/blog/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-300x163.png" alt="" class="wp-image-14985" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-300x163.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-768x417.png 768w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-1024x556.png 1024w, https://blog.ovhcloud.com/wp-content/uploads/2019/03/Capture-d’écran-2019-03-05-à-10.17.16-1-1200x651.png 1200w" sizes="auto, (max-width: 300px) 100vw, 300px" /></figure></div>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Falerting-based-on-ipmi-data-collection%2F&amp;action_name=Alerting%20based%20on%20IPMI%20data%20collection&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Handling OVH&#8217;s alerts with Apache Flink</title>
		<link>https://blog.ovhcloud.com/handling-ovhs-alerts-with-apache-flink/</link>
		
		<dc:creator><![CDATA[Pierre Zemb]]></dc:creator>
		<pubDate>Thu, 31 Jan 2019 09:01:32 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Alerting]]></category>
		<category><![CDATA[Apache Flink]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[Omni]]></category>
		<guid isPermaLink="false">https://blog.ovh.com/fr/blog/?p=14337</guid>

					<description><![CDATA[OVH relies extensively on metrics to effectively monitor its entire stack. Whenever they are low-level or business centric, they allow teams to gain insight into how our services are operating on a daily basis. The need to store millions of datapoints per second has produced the need to create a dedicated team to build a operate a product to handle that load: Metrics Data Platform. By relying on Apache Hbase, Apache Kafka and Warp 10, we succeeded in creating a fully distributed platform that is handling all our metrics... and yours!

After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: Alerting. 
<img src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhandling-ovhs-alerts-with-apache-flink%2F&amp;action_name=Handling%20OVH%26%238217%3Bs%20alerts%20with%20Apache%20Flink&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></description>
										<content:encoded><![CDATA[
<p class="wp-block-paragraph">OVH relies extensively on <strong>metrics</strong> to effectively monitor its entire stack. Whether they are <strong>low-level</strong> or <strong>business</strong> centric, they allow teams to gain <strong>insight</strong> into how our services are operating on a daily basis. The need to store <strong>millions of datapoints per second</strong> has produced the need to create a dedicated team to build a operate a product to handle that load: <strong><a href="https://www.ovh.com/fr/data-platforms/metrics/" data-wpel-link="exclude">Metrics Data Platform</a>.</strong> By relying on <strong><a href="https://hbase.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Hbase</a>, <a href="https://kafka.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Kafka</a></strong> and <a href="https://www.warp10.io/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer"><strong>Warp 10</strong></a>, we succeeded in creating a fully distributed platform that is handling all our metrics&#8230; and yours! </p>



<p class="wp-block-paragraph">After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: the <strong>Alerting. </strong></p>



<figure class="wp-block-image"><img loading="lazy" decoding="async" width="885" height="290" src="https://www.ovh.com/blog/wp-content/uploads/2019/01/001-1.png" alt="OVH &amp; Apache Flink" class="wp-image-14367" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1-300x98.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/001-1-768x252.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure>



<h3 class="wp-block-heading" id="6c01">Meet OMNI, our alerting&nbsp;layer</h3>



<p class="wp-block-paragraph">OMNI is our code name for a&nbsp;<strong>fully distributed</strong>,&nbsp;<strong>as-code</strong>,&nbsp;<strong>alerting</strong>&nbsp;system that we developed on top of Metrics. It is split into components:</p>



<ul class="wp-block-list"><li><strong>The management part</strong>, taking your alerts definitions defined in a Git repository, and represent them as continuous queries,</li><li><strong>The query executor</strong>, scheduling your queries in a distributed way.</li></ul>



<p class="wp-block-paragraph">The query executor is pushing the query results into Kafka, ready to be handled! We now need to perform all the tasks that an alerting system does:</p>



<ul class="wp-block-list"><li>Handling alerts&nbsp;<strong>deduplication</strong>&nbsp;and&nbsp;<strong>grouping</strong>, to avoid&nbsp;<a href="https://en.wikipedia.org/wiki/Alarm_fatigue" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">alert fatigue.&nbsp;</a></li><li>Handling&nbsp;<strong>escalation</strong>&nbsp;steps,&nbsp;<strong>acknowledgement&nbsp;</strong>or&nbsp;<strong>snooze</strong>.</li><li><strong>Notify</strong>&nbsp;the end user, through differents&nbsp;<strong>channels</strong>: SMS, mail, Push notifications,&nbsp;…</li></ul>



<p class="wp-block-paragraph">To handle that, we looked at open-source projects, such as&nbsp;<a href="https://github.com/prometheus/alertmanager" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus AlertManager,</a>&nbsp;<a href="https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">LinkedIn Iris,</a>&nbsp;we discovered the&nbsp;<em>hidden</em>&nbsp;truth:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"><p>Handling alerts as streams of data,<br>moving from operators to&nbsp;another.</p></blockquote>



<p class="wp-block-paragraph">We embraced it, and decided to leverage <a href="https://flink.apache.org/" data-wpel-link="external" target="_blank" rel="nofollow external noopener noreferrer">Apache Flink</a> to create&nbsp;<strong>Beacon</strong>. In the next section we are going to describe the architecture of Beacon, and how we built and operate it.</p>



<p class="wp-block-paragraph">If you want some more information on Apache Flink, we suggest to read the introduction article on the official website:&nbsp;<a href="https://flink.apache.org/flink-architecture.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">What is Apache Flink?</a></p>



<h3 class="wp-block-heading" id="6caa">Beacon architecture</h3>



<p class="wp-block-paragraph">


At his core, Beacon is reading events from&nbsp;<strong>Kafka</strong>. Everything is represented as a&nbsp;<strong>message</strong>, from alerts to aggregations rules, snooze orders and so on. The pipeline is divided into two branches:



</p>



<ul class="wp-block-list"><li>One that is running the&nbsp;<strong>aggregations</strong>, and triggering notifications based on customer’s rules.</li><li>One that is handling the&nbsp;<strong>escalation steps</strong>.</li></ul>



<p class="wp-block-paragraph">Then everything is merged to&nbsp;<strong>generate</strong>&nbsp;<strong>a</strong>&nbsp;<strong>notification</strong>, that is going to be forward to the right person. A notification message is pushed into Kafka, that will be consumed by another component called&nbsp;<strong>beacon-notifier.</strong></p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="470" src="/blog/wp-content/uploads/2019/01/002.png" alt="Beacon architecture" class="wp-image-14349" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/002.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/002-300x159.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/002-768x408.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p class="wp-block-paragraph">If you are new to streaming architecture, I recommend reading&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/programming-model.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Dataflow Programming Model</a>&nbsp;from Flink official documentation.</p>



<div class="wp-block-image"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="616" src="/blog/wp-content/uploads/2019/01/003.png" alt="Handling state" class="wp-image-14350" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/003.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/003-300x209.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/003-768x535.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /></figure></div>



<p class="wp-block-paragraph">Everything is merged into a dataStream,&nbsp;<strong>partitionned</strong>&nbsp;(<a href="https://medium.com/r/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.7%2Fdev%2Fstream%2Fstate%2Fstate.html%23keyed-state" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">keyed by&nbsp;</a>in Flink API) by users. Here&#8217;s an example:</p>



<pre class="wp-block-code"><code lang="java" class="language-java">final DataStream&lt;Tuple4&lt;PlanIdentifier, Alert, Plan, Operation>> alertStream =

  // Partitioning Stream per AlertIdentifier
  cleanedAlertsStream.keyBy(0)
  // Applying a Map Operation which is setting since when an alert is triggered
  .map(new SetSinceOnSelector())
  .name("setting-since-on-selector").uid("setting-since-on-selector")

  // Partitioning again Stream per AlertIdentifier
  .keyBy(0)
  // Applying another Map Operation which is setting State and Trend
  .map(new SetStateAndTrend())
  .name("setting-state").uid("setting-state");</code></pre>



<ul class="wp-block-list"><li><strong>SetSinceOnSelector</strong>, which is setting&nbsp;<strong>since</strong>&nbsp;when the alert is triggered</li><li><strong>SetStateAndTrend</strong>, which is setting the&nbsp;<strong>state&nbsp;</strong>(ONGOING, RECOVERY or OK) and the&nbsp;<strong>trend</strong>(do we have more or less metrics in errors).</li></ul>



<p class="wp-block-paragraph">Each of this class is under 120 lines of codes because Flink is&nbsp;<strong>handling all the difficulties</strong>. Most of the pipeline are&nbsp;<strong>only composed</strong>&nbsp;of&nbsp;<strong>classic transformations</strong>&nbsp;such as&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Map, FlatMap, Reduce</a>, including their&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/dev/api_concepts.html#rich-functions" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Rich</a>&nbsp;and&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#using-managed-keyed-state" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Keyed</a>&nbsp;version. We have a few&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/process_function.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Process Functions</a>, which are&nbsp;<strong>very handy</strong>&nbsp;to develop, for example, the escalation timer.</p>



<h3 class="wp-block-heading" id="a77e">Integration tests</h3>



<p class="wp-block-paragraph">As the number of classes was growing, we needed to test our pipeline. Because it is only wired to Kafka, we wrapped consumer and producer to create what we call&nbsp;<strong>scenari:&nbsp;</strong>a series of integration tests running different scenarios.</p>



<h3 class="wp-block-heading" id="5f8f">Queryable state</h3>



<p class="wp-block-paragraph">One killer feature of Apache Flink is the&nbsp;<strong>capabilities of&nbsp;</strong><a href="https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/queryable_state.html" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external"><strong>querying the internal state</strong></a><strong>&nbsp;of an operator</strong>. Even if it is a beta feature, it allows us the get the current state of the different parts of the job:</p>



<ul class="wp-block-list"><li>at which escalation steps are we on</li><li>is it snoozed or <em>ack</em>-ed</li><li>Which alert is ongoing</li><li>and so on.</li></ul>



<div class="wp-block-image size-full wp-image-14361"><figure class="aligncenter"><img loading="lazy" decoding="async" width="885" height="617" src="/blog/wp-content/uploads/2019/01/004-1.png" alt="Queryable state overview" class="wp-image-14361" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1.png 885w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1-300x209.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/004-1-768x535.png 768w" sizes="auto, (max-width: 885px) 100vw, 885px" /><figcaption>Queryable state overview</figcaption></figure></div>



<p class="wp-block-paragraph">Thanks to this, we easily developed an&nbsp;<strong>API</strong>&nbsp;over the queryable state, that is powering our&nbsp;<strong>alerting view</strong>&nbsp;in&nbsp;<a href="https://studio.metrics.ovh.net/" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Metrics Studio,</a> our codename for the Web UI of the Metrics Data Platform.</p>



<h3 class="wp-block-heading" id="1bc7">Apache Flink deployment</h3>



<p class="wp-block-paragraph">


&nbsp;
We deployed the latest version of Flink (<strong>1.7.1</strong>&nbsp;at the time of writing) directly on bare metal servers with a dedicated Zookeeper’s cluster using Ansible. Operating Flink has been a really nice surprise for us, with&nbsp;<strong>clear documentation and configuration</strong>, and an&nbsp;<strong>impressive resilience</strong>. We are capable of&nbsp;<strong>rebooting</strong>&nbsp;the whole Flink cluster, and the job is&nbsp;<strong>restarting at his last saved state</strong>, like nothing happened.


</p>



<p class="wp-block-paragraph">We are using&nbsp;<strong>RockDB</strong>&nbsp;as a state backend, backed by OpenStack&nbsp;<strong>Swift storage&nbsp;</strong>provided by OVH Public Cloud.</p>



<p class="wp-block-paragraph">For monitoring, we are relying on&nbsp;<a href="https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Prometheus Exporter</a>&nbsp;with&nbsp;<a href="https://github.com/ovh/beamium" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Beamium</a>&nbsp;to gain&nbsp;<strong>observability</strong>&nbsp;over job’s health.</p>



<h3 class="wp-block-heading" id="8d7c">In short, we love Apache&nbsp;Flink!</h3>



<p class="wp-block-paragraph">If you are used to work with stream related software, you may have realized that we did not used any rocket science or tricks. We may be relying on basics streaming features offered by Apache Flink, but they allowed us to tackle many business and scalability problems with ease.</p>



<div class="wp-block-image"><figure class="aligncenter is-resized"><img loading="lazy" decoding="async" src="/blog/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77.png" alt="Apache Flink" class="wp-image-14354" width="437" height="400" srcset="https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77.png 874w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77-300x275.png 300w, https://blog.ovhcloud.com/wp-content/uploads/2019/01/0F28C7F7-9701-4C19-BAFB-E40439FA1C77-768x703.png 768w" sizes="auto, (max-width: 437px) 100vw, 437px" /></figure></div>



<p class="wp-block-paragraph">As such, we highly recommend that any developers should have a look to Apache Flink. I encourage you to go through <a href="https://medium.com/r/?url=https%3A%2F%2Ftraining.da-platform.com%2F" target="_blank" rel="noreferrer noopener nofollow external" data-wpel-link="external">Apache Flink Training</a>, written by Data Artisans.&nbsp;Furthermore, the community has put a lot of effort to easily deploy Apache Flink to Kubernetes, so you can easily try Flink using our&nbsp;Managed Kubernetes!</p>



<h3 class="wp-block-heading">What’s next?</h3>



<p class="wp-block-paragraph">Next week we come back to Kubernetes, as we will expose how we deal with ETCD in our OVH <a href="https://www.ovh.com/fr/kubernetes/" data-wpel-link="exclude">Managed Kubernetes service</a>.</p>
<img loading="lazy" decoding="async" src="//blog.ovhcloud.com/wp-content/plugins/matomo/app/matomo.php?idsite=1&amp;rec=1&amp;url=https%3A%2F%2Fblog.ovhcloud.com%2Fhandling-ovhs-alerts-with-apache-flink%2F&amp;action_name=Handling%20OVH%26%238217%3Bs%20alerts%20with%20Apache%20Flink&amp;urlref=https%3A%2F%2Fblog.ovhcloud.com%2Ffeed%2F" style="border:0;width:0;height:0" width="0" height="0" alt="" />]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
