Metrics Archives - OVHcloud Blog

Erlenmeyer and PromQL compatibility

Aurélien Hébert — Wed, 16 Dec 2020 11:03:27 +0000

Today in the monitoring world, we see the rise of the Prometheus tool. It’s a great tool to deploy in your infrastructure, as it allows you to scrap all of your servers or applications to retrieve, store and analyze the metrics. And all you have to do is to extract and run it, it does all the work by itself. Of course, Prometheus comes with some trade-offs (pull, how to handle late ingestion), and some limits, as you have your data only for a couple of days.

Context

How is it possible to handle Prometheus long-time storage? A vast amount of Time Series DataBase are now fully compatible with Prometheus. It’s easy to check that Prometheus ingest is working well, however, how can we validate the PromQL – or Prometheus queries – part? A few months ago, PromLab released a new tool called “PromQL compliance tester“. They recently created this page where they reference the result of several products PromQL compliance tests. On this blog post, we will see how this tool helps us improve our PromQL implementation.

Compliance tester

The PromQL compliance tester is open source and contains a full set of tests. When using this tool, it generates for you around 500 PromQL queries covering the vast majority of the language. It includes tests on simple scalar, selectors, time range functions, operators, and so on. This tool will execute a request on both a Prometheus instance and the tested backend. It will then expect you to get the same result as PromQL output. It expects an exact match for all metadata of a series (tags and names). It’s more flexible for the ticks as you can set a parameter to round your check at the milliseconds. Finally, the compliance tool checks the equality of both query values, as many things can impact the floating predictability, it computes an approximated equality.

Erlenmeyer

At Metrics, we used a Warp10 TSDB with it’s own analytical query engine WarpScript. We decided to build an open source tool to transpile PromQL queries into WarpScript called Erlenmeyer. This compliance tester was a great help to validate some of our implementation and to detect which query were not fully ISO.

Set up

To start testing our PromQL experience, we set up a local Prometheus with a default configuration. This configuration makes Prometheus run and collect some “Demo” Metrics, then we forwarded all of them to one of our Metrics regions using Prometheus remote write. We added a local instance of Erlenmeyer to query the data stored in a distributed Warp10 backend. Then, we iterated on each set of tests of the PromLab compliance tool to identify all issues and improved our existing PromQL implementation.

To be compliant, we had to reduce the precision for the value of the compliance tool. We set the precision to 0.001 instead of 0.00001. We also had to remove the Warp10 .app label from the result. As on Warp10 instance, we identify users based on this .app label.

A test query

When running the test, you will get a full report of your failing queries. Let’s take an example:

RESULT: FAILED: Query returned different results:
  model.Matrix{
  	&{
  		Metric: Inverse(DropResultLabels, s`{instance="demo.promlabs.com:10002", job="demo"}`),
  		Values: []model.SamplePair{
  			... // 52 identical elements
  			{Timestamp: s"1606323726.058", Value: Inverse(TranslateFloat64, float64(2.6928936527e+10))},
  			{Timestamp: s"1606323736.058", Value: Inverse(TranslateFloat64, float64(2.691644054725e+10))},
  			{
  				Timestamp: s"1606323746.058",
- 				Value:     Inverse(TranslateFloat64, float64(2.6922272529119648e+10)),
+ 				Value:     Inverse(TranslateFloat64, float64(2.689432207325e+10)),
  			},
  			{Timestamp: s"1606323756.058", Value: Inverse(TranslateFloat64, float64(2.6915188293125e+10))},
  			{Timestamp: s"1606323766.058", Value: Inverse(TranslateFloat64, float64(2.69215848005e+10))},
  			... // 4 identical elements
  		},
  	},
  }

The test reports includes all errors occurring during the test. In this example, we can see, that for a single series we have 56 correct values. However one is invalid, we see it on two lines. The first one is the one starting by “-“. This stands for the expected value. And the second one starting by a “+” corresponds to the tested instance value. In this case, the value isn’t precise enough (2.68 instead of 2.69).

Results

Now that we have a full test set-up running, we can see what we improved from its results. If you want to access the full detailed fixes, you can check the code update made here. This tool helped us to fix some implementation, sanitize known issues, to know what PromQL features we missed, and detect a few new bugs! Let’s review the change.

Quick implementation fixes

Running those test was a great help for us to understand some of implementations errors we had when trying to match PromQL behavior. For example, the time range function was sampling before computing the operation. Reversing those steps provided us a direct match with a native query. It also helped us also fix some minor bugs on how to handle the comparison operators or multiple functions as label_replace, holt_winters, predict_linear or the full set of time functions (hour, minute, month…).

We improved also our handling of PromQL operator aggregators : by and without.

Sanitize known issues

We discovered recently, that we were not matching PromQL behavior on the series name. As a result we were keeping the name for all compute operations. Prometheus has, however, a different approach as the name is only kept when it’s relevant. The compliance tester helps us on how to validate this specific update for all queries.

With this tool, we test the validity of a query compared to a native PromQL query, it helps us to sanitize our query output. We knew that, in case of missing values or empty series, we were not ISO compliant. We have corrected the part of the Erlenmeyer software handling the output to match all PromQL cases included in the tests.

Unimplemented features

Running the test, lead us to discover that we missed some PromQL native features. As a matter of fact, Erlenmeyer now supports the PromQL unary or the “bool” keywords. The support of unary allows the use of “-my_series” for example. In PromQL, the bool keywords convert the result to booleans. It returns as series values 1 or 0 depending on the condition, where 1 stands for true and zero for false.

Open issues

Running all compliance tests and improving our code base lead to us to around 91% of success. For the rest, we open new issues on Erlenmeyer, we detected that:

the handling of the over_time function is not correct when the range is below the data point frequency,
rate, delta, increase and predict_linear, our result isn’t precise enough to match PromQL output when then the range is below 5 minutes,
some minor bugs on series selector (!=), or on the label_replace (some checks are missing on parameters validators),
the PromQL subqueries, as well as, some functions are not implemented: ^ and % on two series set and the deriv function.

Those are the 4 missing points to cover the full PromQL feature set with Erlenmeyer. Our documentation already contained all the missing implementations.

Actions

This tool was a great help to improve our PromQL compliance and we are happy with our compliance result. Indeed we reach 91% with the provided test result:

General query tweaks:
*  Metrics test
================================================================================
Total: 496 / 541 (91.68%) passed

Our next action, is to release those fixes and improvements on all our Metrics regions. Looking forward to see what you think about our PromQL implementation!

We now see a lot of projects are implementing Prometheus writes and reads. These projects bring Prometheus a lot of missing features like long-term storage, delete, late ingestion, historical data analysis, HA… Being able to validate PromQL implementation is a big challenge, and is a great help in choosing the right backend according to the need.

The Open Source Metrics family welcomes Catalyst and Erlenmeyer

Aurélien Hébert — Fri, 20 Mar 2020 09:43:32 +0000

At OVHcloud Metrics, we love open source! Our goal is to provide all of our users with a full experience. We rely on the Warp10 time series database which enables us to build open source tools for our users benefit. Let’s take a look at some in this blogpost.

Storage tool

Our Infrastructure is based on the open source time series database: Warp10. This database includes two versions: a stand-alone one and a distributed one. The distributed one relies on distributed tools such as Apache Kafka, Apache Hadoop and Apache HBase.

Unsurprisingly, our team makes its own contributions to the Warp10 platform. Due to our unique requirements, we even contribute to the underlying open source database HBase!

Metrics data ingest

As a matter of fact, we were the first to get stuck in on the ingest process! We often build adapted tools to collect and push monitoring data on Warp10 – this is how noderig came to life. Noderig is an adapted tool that is able to collect a simple core of metrics from any server or any virtual machine. It is also possible to send metrics safely to a backend. Beamium, a Rust tool, is able to push the Noderig metrics to one or several Warp 10 backend(s).

What if I want to collect my own custom metrics? First, you’ll need to expose them following the ‘Prometheus model’. Beamium is then able to scrap applications based on their configuration file and forward all the data to the configured Warp 10 backend(s)!

If you are looking to monitor specific applications using the Influx Telegraf agent (in order to expose the Metrics you require) we have also contributed the Warp10 Telegraf connector, which was recently merged!

This looks great so far, but what if I usually push Graphite, Prometheus, Influx or OpenTSDB metrics; how can I simply migrate to Warp10? Our answer is Catalyst: a proxy layer that is able to parse Metrics in the related formats, and convert them to Warp10 native.

Catalyst

Catalyst is a Go HTTP proxy used to parse multiple Open Source TimeSeries database writes. At the moment, it supports multiple Open Source TimeSeries database writes; such as OpenTSDB, PromQL, Prometheus-remote_write, Influx and Graphite. Catalyst runs a HTTP server that listens to a specific path; starting with the protocol time series name and then the native query one. For example, in order to collect influx data, you simply send a request to influxdb/write. Catalyst will natively parsed the influx data and convert it to Warp 10 ingest native format.

Metrics queries

Data collection is an important first step, but we have also considered how existing query Monitoring protocols could be used on top of Warp10. This has led us to implement TSL. TSL was discussed at length during the Paris Time Series Meetup as well as in this blog post.

Now let’s take a user that is using Telegraf and pushing data to Warp10 with Catalyst. They will wish to use the native Influx Grafana dashboard, but how? And what about users that automate queries with the OpenTSDB query protocol? Our answer was to develop a proxy: Erlenmeyer.

Erlenmeyer

Erlenmeyer is a Go HTTP proxy that enables users to query Warp 10 based on Open Source query protocol. At the moment, it supports multiple Open Source TimeSeries formats; such as PromQL, Prometheus remote-read, InfluxQL, OpenTSDB or Graphite. Erlenmeyer runs a HTTP server that listens to a specific path; starting with the protocol time series name and then the native query one. For example, to run a promQL query, the user sends a request to prometheus/api/v0/query. Erlenmeyer will natively parsed the promQL request and then build a native WarpScript request that any Warp10 backend can support.

To be continued

At first, Erlenmeyer and Catalyst represented a quick implementation of native protocols, aimed to help internal teams migrate, while still utilising a familiar tool. We have now integrated a lot of the native functionality of each protocol, and feel they are ready for sharing. It’s time to make them available to the Warp10 community, so we can receive feedback and continue to work hard in supporting open source protocols. You can find us in OVHcloud Metrics gitter room!

Other Warp10 users may require unimplemented protocol. They will be able to use Erlenmeyer and Catalyst to support them on their own Warp10 backend.

Welcome Erlenmeyer and Catalyst – Metrics Open Source projects!

Jerem: An Agile Bot

Aurélien Hébert — Fri, 21 Feb 2020 16:58:47 +0000

At OVHCloud, we are open sourcing our “Agility Telemetry” project. Jerem, as our data collector, is the main component of this project. Jerem scrapes our JIRA at regular intervals, and extracts specific metrics for each project. It then forwards them to our long-time storage application, the OVHCloud Metrics Data Platform.

Agility concepts from a developer’s point of view

To help you understand our goals for Jerem, we need to explain some Agility concepts we will be using. First, we will establish a technical quarterly roadmap for a product, which sets out all features we plan to release every three months. This is what we call an epic.

For each epic, we identify the tasks that will need to be completed. For all of those tasks, we then evaluate the complexity of tasks using story points, during a team preparation session. A story point reflects the effort required to complete the specific JIRA task.

Then, to advance our roadmap, we will conduct regular sprints that correspond to a period of ten days, during which the team will onboard several tasks. The amount of story points taken in a sprint should match, or be close to, the team velocity. In other words, the average number of story points that the team is able to complete each day.

However, other urgent tasks may arise unexpectedly during sprints. That’s what we call an impediment. We might, for example, need to factor in helping customers, bug fixes, or urgent infrastructure tasks.

How Jerem works

At OVH we use JIRA to track our activity. Our Jerem bot scraps our projects from JIRA and exports all necessary data to our OVHCloud Metrics Data Platform. Jerem can also push data to any Warp 10 compatible database. In Grafana, you simply query the Metrics platform (using Warp10 datasource) with for example our program management dashboard. All your KPI are now available in a nice dashboard!

Discover Jerem metrics

Now that we have an overview of the main Agility concepts involved, let’s dive into Jerem! How do we convert those Agility concepts into metrics? First of all, we’ll retrieve all metrics related to epics (i.e. new features). Then, we will have a deep look at the sprint metrics.

Epic data

To explain Jerem epic metrics, we’ll start by creating a new one. In this example, we called it Agile Telemetry. We add a Q2-20 label, which means that we plan to release it for Q2. To record an epic with Jerem, you need to set a quarter for the final delivery! Next, we’ll simply add four tasks, as shown below:

To get the metrics, we need to evaluate each individual task. We we’ll do this together during preparation sessions. In this example, we have custom story points for each task. For example, we estimated the write a BlogPost about Jerem task as being a 3.

As a result, Jerem now has everything it needs to start collecting epic metrics. This example provides five metrics:

jerem.jira.epic.storypoint: the total number of story points needed to complete this epic. The value here is 14 (the sum of all the epic story points). This metric will evolve whenever as the epic is updated by adding or removing tasks.
jerem.jira.epic.storypoint.done: the number of completed tasks. In our example, we have already completed the Write Jerem bot and Deploy Jerem Bot, so we have already completed eight story points.
jerem.jira.epic.storypoint.inprogress: the number of ‘in progress’ tasks, such as Write a BlogPost about Jerem.
jerem.jira.epic.unestimated: the number of unestimated tasks, shown as Unestimated Task in our example.
jerem.jira.epic.dependency: the number of tasks that have dependency labels, indicating that they are mandatory for other epics or projects.

This way, for each epic in a project, Jerem collects five unique metrics.

Sprint data

To complete epic tasks, we work using a sprint process. When doing sprints, we want to provide a lot of insights into our achievements. That’s why Jerem collects sprint data too!

So let’s open a new sprint in JIRA and start working on our task. This gives us the following JIRA view:

Jerem collects the following metrics for each sprint:

jerem.jira.sprint.storypoint.total: the total number of story points onboarded into a sprint.
jerem.jira.sprint.storypoint.inprogress: the story points currently in progress within a sprint.
jerem.jira.sprint.storypoint.done: the number of story points currently completed within a sprint.
jerem.jira.sprint.events: the ‘start’ and of the ‘end’ dates of sprint events, recorded as Warp10 string values.

As you can see in the Metrics view above, we will record every sprint metric twice. We do this to provide a quick view of the active sprint, which is why we use the ‘current’ label’. This also enables us to query past sprints, using the real sprint name. Of course, an active sprint can also be queried using its name.

Impediment data

Starting a sprint means you need to know all the tasks you will have to work on over the next few days. But how can we track and measure unplanned tasks? For example, the very urgent one for your manager, or the teammate that needs a bit of help?

We can add special tickets on JIRA to keep track of those task. That’s what we call an ‘impediment’. They are labelled according their nature. If, for example, the production requires your attention, then it’s an ‘Infra’ impediment. You will also retrieve metrics for the ‘Total’ (all kinds of impediments), ‘Excess’ (the unplanned tasks), ‘Support’ (helping teammates), and ‘Bug fixes or other’ (for all other kinds of impediment).

Each impediment belongs to the active sprint it was closed in. To close an impediment, you only have to flag it as ‘Done’ or ‘Closed’.

We also retrieve metrics like:

jerem.jira.impediment.TYPE.count: the number of impediments that occurred during a sprint.
jerem.jira.impediment.TYPE.timespent: the amount of time spent on impediments during a sprint.

TYPE corresponds to the kind of recorded impediment. As we didn’t open any actual impediments, Jerem collects only the total metrics.

To start recording impediments, we simply create a new JIRA task, in which we add an ‘impediment’ label. We we also set its nature, and the actual time spent on it.

For the impediment, we’ll we also retrieve the global metrics that Jerem always records:

jerem.jira.impediment.total.created: the time spent from the creation date to complete an impediment. This allows us to retrieve a total impediment count. We can also record all impediment actions, even outside sprints.

For a single Jira project, like our example, you can expect around 300 metrics. This might increase depending on the epic you create and flag on Jira, and the one you close.

Grafana dashboard

We love building Grafana dashboards! They provide both the team and the manager a lot of insights into KPIs. The best part of it for me, as a developer, is that I see why it’s nice to fill a JIRA task!

In our first Grafana dashboard, you will retrieve all the best program management KPIs. Let’s start with the global overview:

Quarter data overview

Here, you will find the current epic in progress. You will also find the global team KPIs, such as the predictability, the velocity, and the impediment stats. It’s here where the magic happens! When filled correctly, this dashboard will show you exactly what your team should deliver in the current quarter. This means you have quick access to all important current subjects. You will also be able to see if your team is expected to deliver on too many subjects, so you can quickly take action and delay some of the new features.

Active sprint data

The active sprint data panel is often a great support during our daily meetings. In this panel, we get a quick overview of the team’s achievements, and can establish the time spent on parallel tasks.

Detailed data

The last part provides more detailed data. Using the epic Grafana variable, we can check specific epics, along with the completion of more global projects. You have also a velocity chart, which plots the past sprint, and compares the expected story points to the ones actually completed.

The Grafana dashboard is directly availble in the Jerem project. You can import it directly in Grafana, provided you have a valid Warp 10 datasource configured.
To make the dashboard work as required, you have to configure the Grafana project variable in the form of a WarpScript list [ 'SAN' 'OTHER-PROJECT' ]. If our program manager can do it, I am sure you can! 😉

Setting up Jerem and automatically loading program management data give us a lot of insights. As a developer, I really enjoy it and I’ve quickly become used to tracking a lot more events in JIRA than I did before. You are able to directly see the impact of your tasks. For example, you see how quickly the roadmap is advancing, and you become able to identify any bottlenecks that are causing impediments. Those bottlenecks then become epics. In other words, once we start to use Jerem, we just auto-fill it! I hope you will enjoy it too! If you have any feedback, we would love to hear it.

Contributing to Apache HBase: custom data balancing

Pierre Zemb — Fri, 14 Feb 2020 16:37:19 +0000

In today’s blogpost, we’re going to take a look at our upstream contribution to Apache HBase’s stochastic load balancer, based on our experience of running HBase clusters to support OVHcloud’s monitoring.

The context

Have you ever wondered how:

we generate the graphs for your OVHcloud server or web hosting package?
our internal teams monitor their own servers and applications?

All internal teams are constantly gathering telemetry and monitoring data and sending them to a dedicated team, who are responsible for handling all the metrics and logs generated by OVHcloud’s infrastructure: the Observability team.

We tried a lot of different Time Series databases, and eventually chose Warp10 to handle our workloads. Warp10 can be integrated with the various big-data solutions provided by the Apache Foundation. In our case, we use Apache HBase as the long-term storage datastore for our metrics.

Apache HBase, a datastore built on top of Apache Hadoop, provides an elastic, distributed, key-ordered map. As such, one of the key features of Apache HBase for us is the ability to scan, i.e. retrieve a range of keys. Thanks to this feature, we can fetch thousands of datapoints in an optimised way.

We have our own dedicated clusters, the biggest of which has more than 270 nodes to spread our workloads:

between 1.6 and 2 million writes per second, 24/7
between 4 and 6 million reads per second
around 300TB of telemetry, stored within Apache HBase

As you can probably imagine, storing 300TB of data in 270 nodes comes with some challenges regarding repartition, as every bit is hot data, and should be accessible at any time. Let’s dive in!

How does balancing work in Apache HBase?

Before diving into the balancer, let’s take a look at how it works. In Apache HBase, data is split into shards called Regions, and distributed through RegionServers. The number of regions will increase as the data is coming in, and regions will be split as a result. This is where the Balancer comes in. It will move regions to avoid hotspotting a single RegionServer and effectively distribute the load.

The actual implementation, called StochasticBalancer, uses a cost-based approach:

It first computes the overall cost of the cluster, by looping through cost functions. Every cost function returns a number between 0 and 1 inclusive, where 0 is the lowest cost-best solution, and 1 is the highest possible cost and worst solution. Apache Hbase is coming with several cost functions, which are measuring things like region load, table load, data locality, number of regions per RegionServers… The computed costs are scaled by their respective coefficients, defined in the configuration.
Now that the initial cost is computed, we can try to Mutate our cluster. For this, the Balancer creates a random nextAction, which could be something like swapping two regions, or moving one region to another RegionServer. The action is applied virtually , and then the new cost is calculated. If the new cost is lower than our previous one, the action is stored. If not, it is skipped. This operation is repeated thousands of times, hence the Stochastic.
At the end, the list of valid actions is applied to the actual cluster.

What was not working for us?

We found out that for our specific use case, which involved:

Single table
Dedicated Apache HBase and Apache Hadoop, tailored for our requirements
Good key distribution

… the number of regions per RegionServer was the real limit for us.

Even if the balancing strategy seems simple, we do think that being able to run an Apache HBase cluster on heterogeneous hardware is vital, especially in cloud environments, because you may not be able to buy the same server specs again in the future. In our earlier example, our cluster grew from 80 to ~250 machines in four years. Throughout that time, we bought new dedicated server references, and even tested some special internal references.

We ended-up with differents groups of hardware: some servers can handle only 180 regions, whereas the biggest can handle more than 900. Because of this disparity, we had to disable the Load Balancer to avoid the RegionCountSkewCostFunction, which would try to bring all RegionServers to the same number of regions.

Two years ago we developed some internal tools, which are responsible for load balancing regions across RegionServers. The tooling worked really good for our use case, simplifying the day-to-day operation of our cluster.

Open source is at the DNA of OVHcloud, and that means that we build our tools on open source software, but also that we contribute and give it back to the community. When we talked around, we saw that we weren’t the only one concerned by the heterogenous cluster problem. We decided to rewrite our tooling to make it more general, and to contribute it directly upstream to the HBase project.

Our contributions

The first contribution was pretty simple, the cost function list was a constant. We added the possibility to load custom cost functions.

The second contribution was about adding an optional costFunction to balance regions according to a capacity rule.

How does it works?

The balancer will load a file containing lines of rules. A rule is composed of a regexp for hostname, and a limit. For example, we could have:

rs[0-9] 200
rs1[0-9] 50

RegionServers with hostnames matching the first rules will have a limit of 200, and the others 50. If there’s no match, a default is set.

Thanks to these rule, we have two key pieces of information:

the max number of regions for this cluster
the rules for each servers

The HeterogeneousRegionCountCostFunction will try to balance regions, according to their capacity.

Let’s take an example… Imagine that we have 20 RS:

10 RS, named rs0 to rs9, loaded with 60 regions each, which can each handle 200 regions.
10 RS, named rs10 to rs19, loaded with 60 regions each, which can each handle 50 regions.

So, based on the following rules:

rs[0-9] 200
rs1[0-9] 50

… we can see that the second group is overloaded, whereas the first group has plenty of space.

We know that we can handle a maximum of 2,500 regions (200×10 + 50×10), and we have currently 1,200 regions (60×20). As such, the HeterogeneousRegionCountCostFunction will understand that the cluster is full at 48.0% (1200/2500). Based on this information, we will then try to put all the RegionServers at ~48% of the load, according to the rules.

Where to next?

Thanks to Apache HBase’s contributors, our patches are now merged into the master branch. As soon as Apache HBase maintainers publish a new release, we will deploy and use it at scale. This will allow more automation on our side, and ease operations for the Observability Team.

Contributing was an awesome journey. What I love most about open source is the opportunity ability to contribute back, and build stronger software. We had an opinion about how a particular issue should addressed, but the discussions with the community helped us to refine it. We spoke with engineers from other companies, who were struggling with Apache HBase’s cloud deployments, just as we were, and thanks to those exchanges, our contribution became more and more relevant.

TSL (or how to query time series databases)

Aurélien Hébert — Fri, 31 Jan 2020 13:41:34 +0000

Last year, we released TSL as an open source tool to query a Warp 10 platform, and by extension, the OVHcloud Metrics Data Platform.

But how has it evolved since then? Is TSL ready to query other time series databases? What about TSL states on the Warp10 eco-system?

TSL to query many time series databases

We wanted TSL to be usable in front of multiple time series databases. That’s why we also released a PromQL query generator.

One year later, we now know this wasn’t the way to go. Based on what we learned, the TSL-Adaptor project was open sourced, as a proof of concept for how TSL can be used to query a non-Warp 10 database. Put simply, TSL-Adaptor allows TSL to query an InfluxDB.

What is TSL-Adaptor?

TSL-Adaptor is a Quarkus JAVA REST API that can be used to query a backend. TSL-Adaptor parses the TSL query, identifies the fetch operation, and loads natively raw data from the backend. It then computes the TSL operations on the data, before returning the result to the user. The main goal of TSL-Adaptor is to make TSL available on top of other TSDBs.

In concrete terms, we are running a JAVA REST API that integrates the WarpScript library in its runtime. TSL is then used to compile the query into a valid WarpScript one. This is fully transparent, and only deals with TSL queries on the user’s side.

To load raw data from the InfluxDB, we created a WarpScript extension. This extension integrates an abstract class LOADSOURCERAW that needs to be implemented to create an TSL-Adaptor data source. This requires only two methods: find and fetch. Find gathers all series selectors matching a query (class names, tags or labels), while fetchactually retrieves the raw data within a time span.

Query Influx with TSL-Adaptor

To get started, simply run an InfluxDB locally on the 8086 port. Then, let’s start an influx Telegraf agent and record Telegraf data on the local influx instance.

Next, make sure you have locally installed TSL-Adaptor and updated its config with the path to a tsl.so library.

To specify a custom influx address or databases, update the TSL-Adaptor configuration file accordingly.

You can start TSL-Adaptor with the following example command:

java -XX:TieredStopAtLevel=1 -Xverify:none -Djava.util.logging.manager=org.jboss.logmanager.LogManager -jar build/tsl-adaptor-0.0.1-SNAPSHOT-runner.jar

And that’s it! You can now query your influx database with TSL and TSL-Adaptor.

Let’s start with the retrieval of the time series relating to the disk measurements.

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk")'

Now let’s use the TSL analytics power!

First, we would like to retrieve only the data containing a mode set to rw.

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw")'

We would like to retrieve the maximum value at every five-minute interval, for the last 20 minutes. The TSL query will therefore be:

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw").last(20m).sampleBy(5m,max)'

Now it’s your turn to have some fun with TSL and InfluxDB. You can find details of all the implemented functions in the TSL documentation. Enjoy exploring!

What’s new on TSL with Warp10?

We originally built TSL as a GO proxy in front of Warp10. TSL has now integrated the Warp 10 ecosystem as a Warp10 extension, or as a WASM library! We have also added some new native TSL functions to make the language even richer!

TSL as WarpScript function

To make TSL work as a Warp10 function, you need to have the tsl.so library available locally. This library can be found in TSL github repository. We have also made a TSL WarpScript extension available from WarpFleet, the extension repository of the Warp 10 community.

To set up TSL extension on your Warp 10, simply download the JAR indicated in WarpFleet. You can then configure the extension in the Warp 10 configuration file:

warpscript.extensions = io.ovh.metrics.warp10.script.ext.tsl.TSLWarpScriptExtension
warpscript.tsl.libso.path =

Once you reboot Warp 10, you are ready to go. You can test if it’s working by running the following query:

// You will need to put here a valid Warp10 token when computing a TSL select statement
// '' 

// A valid TOKEN isn't needed on the create series statement in this example
// You can simply put an empty string
''

// Native TSL create series statement
 <'
    create(series('test').setValues("now", [-5m, 2], [0, 1])) 
'>
TSL

With the WarpScript TSL function, you can use native WarpScript variables in your script, as shown in the example below:

// Set a Warp10 variable

NOW 'test' STORE

'' 

// A Warp10 variable can be reused in TSL script as a native TSL variable
 <'
    create(series('test').setValues("now", [-5m, 2], [0, 1])).add(test)
'>
TSL

TSL WASM

To expand TSL’s potential uses, we have also exported it as a Wasm library, so you can use it directly in a browser! The Wasm version of the library parses TSL queries locally and generates the WarpScript. The result can then be used to query a Warp 10 backend. You will find more details on the TSL github.

TSL’s new features

As TSL has grown in popularity, we have detected and fixed a few bugs, and also added some additional native functions to accommodate new use cases.

We added the setLabelFromNamemethod, to set a new label to a series, based on its name. This label can be the exact series name, or the result of a regular expression.

We also completed the sortmethod, to allow users to sort their series set based on series meta data (i.e. selector, name or labels).

Finally, we added a filterWithoutLabels, to filter a series set and remove any series that do not contain specific labels.

Thanks for reading! I hope you will give TSL a try, as I would love to hear your feedback.

Paris Time Series meetup

We are delighted to be hosting, at the OVHcloud office in Paris, soon the third Paris Time Series meetup, organised by Nicolas Steinmetz. During this meetup, we will be speaking about TSL, as well as listening to an introduction of the Redis Times Series platform.

If you are available, we will be happy to meet you there!

IOT: Pushing data to OVHcloud metrics timeseries from Arduino

Cyrille Meichel — Thu, 24 Oct 2019 12:56:04 +0000

Last spring, I built a wood oven in my garden. I’ve wanted to have one for years, and I finally decided to make it. To use it, I make a big fire inside for two hours, remove all the embers, and then it’s ready for cooking. The oven accumulates the heat during the fire and then releases it.

Once the embers are removed, I have to prioritise the dishes I want to cook as the temperature drops:

Pizza: 280°C
Bread: 250°C
Rice pudding: 180°C
Meringues: 100°C

I built a first version of a thermometer with an Arduino, to be able to check the temperature. This thermometer, made of a thermocouple (i.e. a sensor that measures high temperatures), displays the inside temperature on a little LCD screen.

The next step was to anticipate when to stuff dishes into the oven. Watching the temperature dropping down for hours was not a good idea. I needed the heat diagram of my oven! A heat diagram is just the chart of the temperature over a given period of time. But writing down temperature on a paper every ten minutes… wait… it will last more than 30 hours.

Please, let me sleep !

This needs some automation. Fortunately, OVHcloud has the solution: Metrics Data Platform: https://www.ovh.com/fr/data-platforms/metrics/

The Hardware

The aim of the project is to plug a sensor onto an Arduino that will send data to OVHcloud Metrics Data Platform (https://www.ovh.com/fr/data-platforms/metrics/) via the network. Basically, the Arduino will use the local wifi network to push temperature data to OVHcloud servers.

Do you know ESP8266? It’s a low-cost (less than 2€!) wifi microchip with full TCP/IP stack and microcontroller capability.

ESP8266 functional diagram

Implementation: Wemos

ESP8266 is not quite so easy to use on its own:

Must be powered at 3.3V (not too much, or it will burn)
No USB

That’s why it is better to use a solution that implements ESP8266 for us. Here is the Wemos!

Powered at 5V (6V is still ok)
USB for serial communication (for debugging)
Can be programmed via USB
Can be programmed with Arduino IDE
Costs less than 3€

Prepare your Arduino IDE

Install the integrated development environment

First of all you need to install Arduino IDE. It’s free, and available for any platform (Mac, Windows, Linux). Go to https://www.arduino.cc/en/main/software and download the version corresponding to your platform. At the time of writing, the current version is 1.8.10.

Additional configuration for ESP8266

When you install the Arduino IDE, it will only be capable of programming official Arduinos. Let’s add the firmware and libraries for ESP8266…

Start Arduino and open the “Preferences” window (File > Preferences).

Enter https://arduino.esp8266.com/stable/package_esp8266com_index.json into the “Additional Board Manager URLs” field. You can add multiple URLs, separating them with commas.

Now open “Boards Manager” from the Tools > Board menu and install the esp8266 platform (don’t forget to select your ESP8266 board from the Tools > Board menu after installation).

You are now ready!

Order a Metrics Data Platform

Go to the OVHcloud Metrics Data Platform website: https://www.ovh.com/fr/data-platforms/metrics/. Click on the free trial, and finalise your order. If you don’t have an account, just create one. With this trial you will have 12 metrics (i.e. 12 sets of records). In this example, we will only use one.

Retrieve your token

Go to the OVH Control Panel: https://www.ovh.com/manager/cloud/#/. On the left-hand panel, you should have Metrics and a new service inside.

In the “Tokens” tab, you can copy the write token. Keep it, as we will need it later.

Note that to configure Grafana, you will need the read token.

Retrieve the host of the Metrics Data Platform

The host of your Metrics Data Platform is given in your service description. In the “Platforms” tab, copy the opentsdb host. Keep it, as we will need it later.

Deeper into the program

Now let’s have a look at an example. Here is a code that will push static data to OVHcloud Metrics Data Platform. You can use it with your sensor. You just have to code the sensor measure. When running, the Wemos will:

Try to connect to you wifi network
If successful, push data to OVHcloud Metrics Data Platform

The whole source code is available on my github: https://github.com/landru29/ovh_metrics_wemos.

There are six main files:

ovh_metrics_wemos.ino: the main file
wifi.cpp: class that implements the process to connect to wifi via WPS (Wifi Protected Setup)
wifi.h: header file for the wifi
metrics.cpp: class that sends the metric data to OVHcloud Metrics Data Platform via HTTPS
metrics.h: header file for metrics
config.h.sample: model to create your configuration file (see below)

Create your configuration file

If you try to compile the program, you will get errors, as some definitions are missing. We need to declare them in a file: config.h.

Copy config.h.sample into config.h
Copy the write token you got in paragraph 5.1 (#define TOKEN “xxxxxx”)
Copy the host you got in paragraph 5.2 (#define HOST “xxxxxx”)

Get the fingerprint of the certificate

As the Wemos will request through HTTPS, we need the certificate fingerprint. You will need the host you just grabbed from the “Platforms” tab and then:

Linux users

Just run this little script:

HOST=opentsdb.gra1.metrics.ovh.net; echo | openssl s_client -showcerts -servername ${HOST} -connect ${HOST}:443 2>/dev/null | openssl x509 -noout -fingerprint -sha1 -inform pem | sed -e "s/.*=//g" | sed -e "s/\:/ /g"

Copy the result in your config.h (#define FINGERPRINT "xx xx ..").

MAC users

Just run this little script:

HOST=opentsdb.gra1.metrics.ovh.net; echo | openssl s_client -showcerts -servername ${HOST} -connect ${HOST}:443 2>/dev/null | openssl x509 -noout -fingerprint -sha1 -inform pem | sed -e "s/.*=//g" | sed -e "s/\:/ /g"

Copy the result in your config.h (#define FINGERPRINT "xx xx ..").

Windows users

In your browser, go to https://opentsdb.gra1.metrics.ovh.net. Click on the lock next to the URL to display the fingerprint of the certificate. Replace all ‘:’ with one space.

Compile the project and upload it to the Wemos

Open the .ino file in the Arduino IDE (you should have six tabs in the project)
Plug the Wemos into you computer
Select the port from Tools > Port
On the top-left side, click on the arrow to upload the program
Once uploaded, you can open the serial monitor: Tools > Serial Monitor

Right now, the program should fail, as the Wemos will not be able to connect to your wifi network.

Run the program

As we’ve already seen, the first run crashes. It’s because you need to launch a WPS connection, so depending on your internet modem, you will need to launch a WPS transaction. This could be a physical button on the modem, or a software action to trigger on the console (https://en.wikipedia.org/wiki/Wi-Fi_Protected_Setup).

When the process is launched on the modem side, you have something like 30 seconds to power the Wemos.

Plug in your Wemos via USB => the program is running
Select the port from Tools > Port (it may have changed)
Open the serial monitor: Tools > Serial Monitor

Now you can follow the process.

Wifi connection

In the serial monitor (adjust the bit rate to 9600), you should get:

Try to connect
 
WPS config start
Trying to connect to  with saved config ...|SUCCESS
IP address: 192.168.xx.xx

If the wifi connection was successful, the serial console should display a local IP address (192.168.xx.xx), otherwise, it failed. Try again by triggering WPS on your modem and restarting the Wemos (unplug it and plug it back in).

Sending data to OVHcloud Metrics Data Platform

Now the Wemos is POSTing a request on the OVHcloud server. The serial console shows you the JSON it will send:

------------------------------------------------
POST opentsdb.gra1.metrics.ovh.net/api/put
[{"metric": "universe","value":42,"tags":{}}]
------------------------------------------------
beginResult: 0
http: 204
response: xxxx

If beginResult is negative, connection to the OVHcloud server failed. It could mean that the FINGERPRINT is wrong.

If http is not 2xx (it should be 204), the server could not process your request. It may mean that the TOKEN is wrong.

You got a 204? Great! It’s a success. Let’s check that on Grafana…

Configure Grafana

Go to OVHcloud Grafana: https://grafana.metrics.ovh.net/login. Log in with your OVHcloud account.

Configure a data source

Click on “Add data source”.

Name: choose one
Type: OpenTSDB
URL: https://
Access: direct
Check “Basic Auth”
User: metrics
Password:

Click on the “Add” button…

… and save it.

Create your first chart

Go back to https://grafana.metrics.ovh.net/ and click on “New Dashboard”.

Click on “Graph”.

Click on “Panel title”, then “Edit”.

Select your metric in the “metric name” field. The software must suggest the name universe (the name specified in the Arduino program). If it doesn’t, this means the metrics were not correctly sent by the Wemos. Close the “edit” panel (click the cross on the right) and save your configuration (top-left of the window).

Result analysis

Temperature rise

The first result to analyse is the temperature rise. The sensor was lying on the bricks of the oven. The yellow chart is the oven temperature, and the green chart is the ambient temperature.

Between 11:05 and 11:10, there is a step at about 85°C. It seems to be the moisture of the oven that was drying.
Then there’s a temperature drop, so I added some more wood to the oven (i.e. introduced cold stuff).
At about 11:20, the slope is lighter, and I have no idea why. Fire not strong enough? Moisture deeper in the bricks?

Temperature dropdown

At this point, I moved all the embers at the back of the oven and put the sensor where the fire was burning. That’s why the chart begins at 400°C.

The temperature dropdown seems to be something like F(t) = A/t
At about 15:40, I changed the power supply from a phone power supply plugged in at 230V to a car battery with a voltage regulator (which seemed to be shitty)
The ambient temperature is quite high between 15:00 and 17:00. It was a sunny day, so the sun was directly heating the circuit.

How to monitor your Kubernetes Cluster with OVH Observability

Adrien Carreira — Fri, 08 Mar 2019 13:33:55 +0000

Our colleagues in the K8S team launched the OVH Managed Kubernetes solution last week, in which they manage the Kubernetes master components and spawn your nodes on top of our Public Cloud solution. I will not describe the details of how it works here, but there are already many blog posts about it (here and here, to get you started).

In the Prescience team, we have used Kubernetes for more than a year now. Our cluster includes 40 nodes, running on top of PCI. We continuously run about 800 pods, and generate a lot of metrics as a result.

Today, we’ll look at how we handle these metrics to monitor our Kubernetes Cluster, and (equally importantly!) how to do this with your own cluster.

OVH Metrics

Like any infrastructure, you need to monitor your Kubernetes Cluster. You need to know exactly how your nodes, cluster and applications behave once they have been deployed in order to provide reliable services to your customers. To do this with our own Cluster, we use OVH Observability.

OVH Observability is backend-agnostic, so we can push metrics with one format and query with another one. It can handle:

Graphite
InfluxDB
Metrics2.0
OpentTSDB
Prometheus
Warp10

It also incorporates a managed Grafana, in order to display metrics and create monitoring dashboards.

Monitoring Nodes

The first thing to monitor is the health of nodes. Everything else starts from there.

In order to monitor your nodes, we will use Noderig and Beamium, as described here. We will also use Kubernetes DaemonSets to start the process on all our nodes.

So let’s start creating a namespace…

kubectl create namespace metrics

Next, create a secret with the write token metrics, which you can find in the OVH Control Panel.

kubectl create secret generic w10-credentials --from-literal=METRICS_TOKEN=your-token -n metrics

Copy metrics.yml into a file and apply the configuration with kubectl

# This will configure Beamium to scrap noderig
# And push metrics to warp 10
# We also add the HOSTNAME to the labels of the metrics pushed
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: beamium-config
  namespace: metrics
data:
  config.yaml: |
    scrapers:
      nodering:
        url: http://0.0.0.0:9100/metrics
        period: 30000
        format: sensision
        labels:
          app: nodering

    sinks:
      warp:
        url: https://warp10.gra1.metrics.ovh.net/api/v0/update
        token: $METRICS_TOKEN

    labels:
      host: $HOSTNAME

    parameters:
      log-file: /dev/stdout
---
# This is a custom collector that report the uptime of the node
apiVersion: v1
kind: ConfigMap
metadata:
  name: noderig-collector
  namespace: metrics
data:
  uptime.sh: |
    #!/bin/sh
    echo 'os.uptime' `date +%s%N | cut -b1-10` `awk '{print $1}' /proc/uptime`
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: metrics-daemon
  namespace: metrics
spec:
  selector:
    matchLabels:
      name: metrics-daemon
  template:
    metadata:
      labels:
        name: metrics-daemon
    spec:
      terminationGracePeriodSeconds: 10
      hostNetwork: true
      volumes:
      - name: config
        configMap:
          name: beamium-config
      - name: noderig-collector
        configMap:
          name: noderig-collector
          defaultMode: 0777
      - name: beamium-persistence
        emptyDir:{}
      containers:
      - image: ovhcom/beamium:latest
        imagePullPolicy: Always
        name: beamium
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: TEMPLATE_CONFIG
          value: /config/config.yaml
        envFrom:
        - secretRef:
            name: w10-credentials
            optional: false
        resources:
          limits:
            cpu: "0.05"
            memory: 128Mi
          requests:
            cpu: "0.01"
            memory: 128Mi
        workingDir: /beamium
        volumeMounts:
        - mountPath: /config
          name: config
        - mountPath: /beamium
          name: beamium-persistence
      - image: ovhcom/noderig:latest
        imagePullPolicy: Always
        name: noderig
        args: ["-c", "/collectors", "--net", "3"]
        volumeMounts:
        - mountPath: /collectors/60/uptime.sh
          name: noderig-collector
          subPath: uptime.sh
        resources:
          limits:
            cpu: "0.05"
            memory: 128Mi
          requests:
            cpu: "0.01"
            memory: 128Mi

Don’t hesitate to change the collector levels if you need more information.

Then apply the configuration with kubectl…

$ kubectl apply -f metrics.yml
# Then, just wait a minutes for the pods to start
$ kubectl get all -n metrics
NAME                       READY   STATUS    RESTARTS   AGE
pod/metrics-daemon-2j6jh   2/2     Running   0          5m15s
pod/metrics-daemon-t6frh   2/2     Running   0          5m14s

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE AGE
daemonset.apps/metrics-daemon 40        40        40      40           40          122d

You can import our dashboard in to your Grafana from here, and view some metrics about your nodes straight away.

Kube Metrics

As the OVH Kube is a managed service, you don’t need to monitor the apiserver, etcd, or controlplane. The OVH Kubernetes team takes care of this. So we will focus on cAdvisor metrics and Kube state metrics

The most mature solution for dynamically scraping metrics inside the Kube (for now) is Prometheus.

In the next Beamium release, we should be able to reproduce the features of the Prometheus scraper.

To install the Prometheus server, you need to install Helm on the cluster…

kubectl -n kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller \
    --clusterrole cluster-admin \
    --serviceaccount=kube-system:tiller
helm init --service-account tiller

You then need to create the following two files: prometheus.yml and values.yml.

# Based on https://github.com/prometheus/prometheus/blob/release-2.2/documentation/examples/prometheus-kubernetes.yml
serverFiles:
  prometheus.yml:
    remote_write:
    - url: "https://prometheus.gra1.metrics.ovh.net/remote_write"
      remote_timeout: 120s
      bearer_token: $TOKEN
      write_relabel_configs:
      # Filter metrics to keep
      - action: keep
        source_labels: [__name__]
        regex: "eagle.*|\
            kube_node_info.*|\
            kube_node_spec_taint.*|\
            container_start_time_seconds|\
            container_last_seen|\
            container_cpu_usage_seconds_total|\
            container_fs_io_time_seconds_total|\
            container_fs_write_seconds_total|\
            container_fs_usage_bytes|\
            container_fs_limit_bytes|\
            container_memory_working_set_bytes|\
            container_memory_rss|\
            container_memory_usage_bytes|\
            container_network_receive_bytes_total|\
            container_network_transmit_bytes_total|\
            machine_memory_bytes|\
            machine_cpu_cores"

    scrape_configs:
    # Scrape config for Kubelet cAdvisor.
    - job_name: 'kubernetes-cadvisor'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      
      relabel_configs:
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
        
      metric_relabel_configs:
      # Only keep systemd important services like docker|containerd|kubelet and kubepods,
      # We also want machine_cpu_cores that don't have id, so we need to add the name of the metric in order to be matched
      # The string will concat id with name and the separator is a ;
      # `/;container_cpu_usage_seconds_total` OK
      # `/system.slice;container_cpu_usage_seconds_total` OK
      # `/system.slice/minion.service;container_cpu_usage_seconds_total` NOK, Useless
      # `/kubepods/besteffort/e2514ad43202;container_cpu_usage_seconds_total` Best Effort POD OK
      # `/kubepods/burstable/e2514ad43202;container_cpu_usage_seconds_total` Burstable POD OK
      # `/kubepods/e2514ad43202;container_cpu_usage_seconds_total` Guaranteed POD OK
      # `/docker/pod104329ff;container_cpu_usage_seconds_total` OK, Container that run on docker but not managed by kube
      # `;machine_cpu_cores` OK, there is no id on these metrics, but we want to keep them also
      - source_labels: [id,__name__]
        regex: "^((/(system.slice(/(docker|containerd|kubelet).service)?|(kubepods|docker).*)?);.*|;(machine_cpu_cores|machine_memory_bytes))$"
        action: keep
      # Remove Useless parents keys like `/kubepods/burstable` or `/docker`
      - source_labels: [id]
        regex: "(/kubepods/burstable|/kubepods/besteffort|/kubepods|/docker)"
        action: drop
        # cAdvisor give metrics per container and sometimes it sum up per pod
        # As we already have the child, we will sum up ourselves, so we drop metrics for the POD and keep containers metrics
        # Metrics for the POD don't have container_name, so we drop if we have just the pod_name
      - source_labels: [container_name,pod_name]
        regex: ";(.+)"
        action: drop
    
    # Scrape config for service endpoints.
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

    # Example scrape config for pods
    #
    # The relabeling allows the actual pod scrape endpoint to be configured via the
    # following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
    # pod's declared ports (default is a port-free target if none are declared).
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod

      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: host
      - action: labeldrop
        regex: (pod_template_generation|job|release|controller_revision_hash|workload_user_cattle_io_workloadselector|pod_template_hash)

alertmanager:
  enabled: false
pushgateway:
  enabled: false
nodeExporter:
  enabled: false
server:
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: traefik
      ingress.kubernetes.io/auth-type: basic
      ingress.kubernetes.io/auth-secret: basic-auth
    hosts:
    - prometheus.domain.com
  image:
    tag: v2.7.1
  persistentVolume:
    enabled: false

Don’t forget to replace your token!

The Prometheus scraper is quite powerful. You can relabel your time series, keep a few that match your regex, etc. This config removes a lot of useless metrics, so don’t hesitate to tweak it if you want to see more cAdvisor metrics (for example).

Install it with Helm…

helm install stable/prometheus \
    --namespace=metrics \
    --name=metrics \
    --values=values/values.yaml \
    --values=values/prometheus.yaml

Add add a basic-auth secret…

$ htpasswd -c auth foo
New password: 
New password:
Re-type new password:
Adding password for user foo
$ kubectl create secret generic basic-auth --from-file=auth -n metrics
secret "basic-auth" created

You can can access the Prometheus server interface through prometheus.domain.com.

You will see all the metrics for your Cluster, although only the one you have filtered will be pushed to OVH Metrics.

The Prometheus interfaces is a good way to explore your metrics, as it’s quite straightforward to display and monitor your infrastructure. You can find our dashboard here.

Resources Metrics

As @Martin Schneppenheim said in this post, in order to correctly manage a Kubernetes Cluster, you also need to monitor pod resources.

We will install Kube Eagle, which will fetch and expose some metrics about CPU and RAM requests and limits, so they can be fetched by the Prometheus server you just installed.

Create a file named eagle.yml.

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  labels:
    app: kube-eagle
  name: kube-eagle
  namespace: kube-eagle
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - pods
  verbs:
  - get
  - list
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  labels:
    app: kube-eagle
  name: kube-eagle
  namespace: kube-eagle
subjects:
- kind: ServiceAccount
  name: kube-eagle
  namespace: kube-eagle
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-eagle
---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: kube-eagle
  labels:
    app: kube-eagle
  name: kube-eagle
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: kube-eagle
  name: kube-eagle
  labels:
    app: kube-eagle
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-eagle
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
      labels:
        app: kube-eagle
    spec:
      serviceAccountName: kube-eagle
      containers:
      - name: kube-eagle
        image: "quay.io/google-cloud-tools/kube-eagle:1.0.0"
        imagePullPolicy: IfNotPresent
        env:
        - name: PORT
          value: "8080"
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /health
            port: http
        readinessProbe:
          httpGet:
            path: /health
            port: http

$ kubectl create namespace kube-eagle
$ kubectl apply -f eagle.yml

Next, add import this Grafana dashboard (it’s the same dashboard as Kube Eagle, but ported to Warp10).

You now have an easy way of monitoring your pod resources in the Cluster!

Custom Metrics

How does Prometheus know that it needs to scrape kube-eagle? If you looks at the metadata of the eagle.yml, you’ll see that:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080" # The port where to find the metrics
  prometheus.io/path: "/metrics" # The path where to find the metrics

Theses annotations will trigger the Prometheus auto-discovery process (described in prometheus.yml line 114).

This means you can easily add these annotations to pods or services that contain a Prometheus exporter, and then forward these metrics to OVH Observability. You can find a non-exhaustive list of Prometheus exporters here.

Volumetrics Analysis

As you saw on the prometheus.yml , we’ve tried to filter a lot of useless metrics. For example, with cAdvisor on a fresh cluster, with only three real production pods, and with the whole kube-system and Prometheus namespace, have about 2,600 metrics per node. With a smart cleaning approach, you can reduce this to 126 series.

Here’s a table to show the approximate number of metrics you will generate, based on the number of nodes (N) and the number of production pods (P) you have:

	Noderig	cAdvisor	Kube State	Eagle	Total
nodes	N * 13⁽¹⁾	N * 2⁽²⁾	N * 1⁽³⁾	N * 8⁽⁴⁾	*N 24**
system.slice	0	N * 5⁽⁵⁾ * 16⁽⁶⁾	0	0	*N 80**
kube-system + kube-proxy + metrics	0	N * 5⁽⁹⁾ * 26⁽⁶⁾	0	N * 5⁽⁹⁾ * 6⁽¹⁰⁾	*N 160**
Production Pods	0	P * 26⁽⁶⁾	0	P * 6⁽¹⁰⁾	*P 32**

For example, if you run three nodes with 60 Pods, you will generate 264 * 3 + 32 * 60 ~= 2,700 metrics

NB: A pod has a unique name, so if you redeploy a deployment, you will create 32 new metrics each time.

^{(1) Noderig metrics: os.mem / os.cpu / os.disk.fs / os.load1 / os.net.dropped (in/out) / os.net.errs (in/out) / os.net.packets (in/out) / os.net.bytes (in/out)/ os.uptime}

^{(2) cAdvisor nodes metrics: machine_memory_bytes / machine_cpu_cores}

^{(3) Kube state nodes metrics: kube_node_info}

^{(4) Kube Eagle nodes metrics: eagle_node_resource_allocatable_cpu_cores / eagle_node_resource_allocatable_memory_bytes / eagle_node_resource_limits_cpu_cores / eagle_node_resource_limits_memory_bytes / eagle_node_resource_requests_cpu_cores / eagle_node_resource_requests_memory_bytes / eagle_node_resource_usage_cpu_cores / eagle_node_resource_usage_memory_bytes}

^{(5) With our filters, we will monitor around five system.slices}

^{(6) Metrics are reported per container. A pod is a set of containers (a minimum of two): your container + the pause container for the network. So we can consider (2* 10 + 6) for the number of metrics per pod. 10 metrics from the cAdvisor and six for the network (see below) and for system.slice we will have 10 + 6, because it’s treated as one container.}

^{(7) cAdvisor will provide these metrics for each container}^:^{container_start_time_seconds / container_last_seen / container_cpu_usage_seconds_total / container_fs_io_time_seconds_total / container_fs_write_seconds_total / container_fs_usage_bytes / container_fs_limit_bytes / container_memory_working_set_bytes / container_memory_rss / container_memory_usage_bytes}

^{(8) cAdvisor will provide these metrics for each interface: container_network_receive_bytes_total * per interface / container_network_transmit_bytes_total * per interface}

^{(9) kube-dns / beamium-noderig-metrics / kube-proxy / canal / metrics-server}

^{(10) Kube Eagle pods metrics: eagle_pod_container_resource_limits_cpu_cores / eagle_pod_container_resource_limits_memory_bytes / eagle_pod_container_resource_requests_cpu_cores / eagle_pod_container_resource_requests_memory_bytes / eagle_pod_container_resource_usage_cpu_cores / eagle_pod_container_resource_usage_memory_bytes}

Conclusion

As you can see, monitoring your Kubernetes Cluster with OVH Observability is easy. You don’t need to worry about how and where to store your metrics, leaving you free to focus on leveraging your Kubernetes Cluster to handle your business workloads effectively, like we have in the Machine Learning Services Team.

The next step will be to add an alerting system, to notify you when your nodes are down (for example). For this, you can use the free OVH Alert Monitoring tool.

Stay in touch

For any questions, feel free to join the Observability Gitter or Kubernetes Gitter!
Follow us on Twitter: @OVH

Monitoring guidelines for OVH Observability

Kevin Georges — Thu, 07 Mar 2019 11:19:08 +0000

At the OVH Observability (formerly Metrics) team, we collect, process and analyse most of OVH’s monitoring data. It represents about 500M unique metrics, pushing data points at a steady rate of 5M per second.

This data can be classified in two ways: host or application monitoring. Host monitoring is mostly based on hardware counters (CPU, memory, network, disk…) while application monitoring is based on the service and its scalability (requests, processing, business logic…).

We provide this service for internal teams, who enjoy the same experience as our customers. Basically, our Observability service is SaaS with a compatibility layer (supporting InfluxDB, OpenTSDB, Warp10, Prometheus, and Graphite) that allows it to integrate with most of the existing solutions out there. This way, a team that is used to a particular tool, or have already deployed a monitoring solution, won’t need to invest much time or effort when migrating to a fully managed and scalable service: they just pick a token, use the right endpoint, and they’re done. Besides, our compatibility layer offers a choice: you can push your data with OpenTSDB, then query it in either PromQL or WarpScript. Combining protocols in this way results in a unique open-source interoperability that delivers more value, with no restrictions created by a solution’s query capabilities.

Scollector, Snap, Telegraf, Graphite, Collectd…

Drawing on this experience, we collectively tried most of the collection tools, but we always arrived at the same conclusion: we were witnessing metrics bleeding. Each tool focused on scraping every reachable bit of data, which is great if you are a graph addict, but can be counterproductive from an operational point-of-view, if you have to monitor thousands of hosts. While it’s possible to filter them, teams still need to understand the whole metrics set in order to know what needs to be filtered.

At OVH, we use laser-cut collections of metrics. Each host has a specific template (web server, database, automation…) that exports a set amount of metrics, which can be used for health diagnostics and monitoring application performance.

This finely-grained management leads to greater understanding for operational teams, since they know what’s available and can progressively add metrics to manage their own services.

Beamium & Noderig — The Perfect Fit

Our requirements were rather simple:
— Scalable: Monitor one node in the same way as we’d monitor thousands
— Laser-cut: Only collect the metrics that are relevant
— Reliable: We want metrics to be available even in the worst conditions
— Simple: Multiple plug-and-play components, instead of intricate ones
— Efficient: We believe in impact-free metrics collection

The first solution was Beamium

Beamium handles two aspects of the monitoring process: application data scrapping and metrics forwarding.

Application data is collected is the well-known and widely-used Prometheus format. We chose Prometheus as the community was growing rapidly at the time, and many instrumentation libraries were available for it. There are two key concepts in Beamium: Sources and Sinks.

The Sources, where Beamium will scrape data, are just Prometheus HTTP endpoints. This means it’s as simple as supplying the HTTP endpoint, and eventually adding a few parameters. This data will be routed to Sinks, which allows us to filter them during the routing process between a Source and a Sink. Sinks are Warp 10(R) endpoints, where we can push the data.

Once scraped, metrics are first stored on disk, before being routed to a Sink. The Disk Fail-Over (DFO) mechanism allows for network or remote failure recovery . This way, locally we retain the Prometheus pull logic, but decentralized, and we reverse it to push to feed the platform which has many advantages:

support for a transactional logic over the metrics platform
recovery from network partitioning or platform unavailability
dual writes with data consistency (as there’s otherwise no guarantee that two Prometheus instances would scrape the same data at the same timestamp)

We have many different customers, some of whom use the Time Series store behind the Observability product to manage their product consumption or transactional changes over licensing. These use cases can’t be handled with Prometheus instances, which are better suited to metrics-based monitoring.

The second was Noderig

During conversations with some of our customers, we came to the conclusion that the existing tools needed a certain level of expertise if they were to be used at scale. For example, a team with a 20k node cluster with Scollector would end up with more than 10 million metrics, just for the nodes… In fact, depending on the hardware configuration, Scollector would generate between 350 and 1,000 metrics from a single node.

That’s the reason behind Noderig. We wanted it to be as simple to use as the node-exporter from Prometheus, but with more finely-grained metrics production as the default.

Noderig collects OS metrics (CPU, memory, disk, and network) using a simple level semantic. This allows you to collect the right amount of metrics for any kind of host, which is particularly suitable for containerized environments.

We made it compatible with Scollector’s custom collectors to ease the migration process, and allow for extensibility. External collectors are simple executables that act as providers for data that is collected by Noderig, as with any other metrics.

The collected metrics are available through a simple rest endpoint, allowing you to see your metrics in real-time, and easily integrate them with Beamium.

Does it work?

Beamium and Noderig are extensively used at OVH, and support the monitoring of very large infrastructures. At the time of writing, we collect and store hundreds of millions of metrics using these tools. So they certainly seem to work!

In fact, we’re currently working on the 2.0 release, which will be a rework, incorporating autodiscovery and hot reload.

Stay in touch

For any questions, feel free to join our Gitter!
Follow us on Twitter: @OVH

TSL: a developer-friendly Time Series query language for all our metrics

Aurélien Hébert — Wed, 13 Feb 2019 13:11:28 +0000

At the Metrics team, we have been working on Time Series for several years now. In our experience, the data analytics capabilities of a Time Series Database (TSDB) platform is a key factor in creating value from your metrics. These analytics capabilities are mostly defined by the query languages they support.

TSL stands for Time Series Language. In simple terms, TSL is an abstracted way of generating queries for different TSDB backends, in the form of an HTTP proxy. It currently supports Warp 10’s WarpScript and Prometheus’ PromQL query languages, but we aim to extend the support to other major TSDBs.

To provide some context around why we created TSL, it began with a review of some of the TSDB query languages supported on the OVH Metrics Data Platform. When implementing them, we learned the good, the bad and the ugly of each one. In the end, we decided to build TSL to simplify the querying on our platform, before open-sourcing it to use it on any TSDB solution.

So why did we decide to invest our time in developing such a proxy? Well, let me tell you the story of the OVH Metrics protocol!

From OpenTSDB…

The first aim of our platform is to be able to support the OVH infrastructure and application monitoring. When this project started, a lot of people were using OpenTSDB, and were familiar with its query syntax. OpenTSDB is a scalable database for Time Series. The OpenTSDB query syntax is easy to read, as you send a JSON document describing the request. The document below will load allsys.cpu.0 metrics of thetestdatacentre, summing them between thestart and end dates:

{
    "start": 1356998400,
    "end": 1356998460,
    "queries": [
        {
            "aggregator": "sum",
            "metric": "sys.cpu.0",
            "tags": {
                "host": "*",
                "dc": "test"
            }
        }
    ]
}

This enables the quick retrieval of specific data, in a specific time range. At OVH, this was used for graphs purpose, in conjunction with Grafana, and helped us to spot potential issues in real time, as well as investigate past events. OpenTSDB integrates simple queries, where you can define your own sampling and deal with counter data, as well as filtered and aggregated raw data.

OpenTSDB was the first protocol supported by the Metrics team, and is still widely used today. Internal statistics shows that 30-40% of our traffic is based on OpenTSDB queries. A lot of internal use cases can still be entirely resolved with this protocol, and the queries are easy to write and understand.

For example, a query with OpenTSDB to get the max value of theusage_systemfor thecpu0 to 9, sampled for a 2-minute span by their values’ average, looks like this:

{
    "start": 1535797890,
    "end": 1535818770,
    "queries":  [{
        "metric":"cpu.usage_system",
        "aggregator":"max",
        "downsample":"2m-avg",
        "filters": [{
            "type":"regexp",
            "tagk":"cpu",
            "filter":"cpu[0–9]+",
            "groupBy":false
            }]
        }]
}

However, OpenTSDB quickly shows its limitations, and some specific uses cases can’t be resolved with it. For example, you can’t apply any operations directly on the back-end. You have to load the data on an external tool and use it to apply any analytics.

One of the main areas where OpenTSDB (version 2.3) is lacking is multiple Time Series set operators, which allow actions like a divide series. Those operators can be a useful way to compute the individual query time per request, when you have (for example) a set of total time spend in requests and a set of total requests count series. That’s one of the reasons why the OVH Metrics Data Platform supports other protocols.

… to PromQL

The second protocol we worked on was PromQL, the query language of the Prometheus Time Series database. When we made that choice in 2015, Prometheus was gaining some traction, and it still has an impressive adoption rate. But if Prometheus is a success, it isn’t for it’s query language, PromQL. This language never took off internally, although it has started to gain some adoption recently, mainly due to the arrival of people that worked with Prometheus in their previous companies. Internally, PromQL queries represent about 1-2% of our daily traffic. The main reasons are that a lot of simple use cases can be solved quickly and with more control of the raw data with OpenTSDB queries, while a lot of more complex use cases cannot be solved with PromQL. A similar request to the one defined in OpenTSDB would be:

api/v1/query_range?
query=max(cpu. usage_system{cpu=~"cpu[0–9]%2B"})
start=1535797890&
end=1535818770&
step=2m

With PromQL, you lose control of how you sample the data, as the only operator is last. This means that if (for example) you want to downsample your series with a 5-minute duration, you are only able to keep the last value of each 5-minute series span. In contrast, all competitors include a range of operators. For example, with OpenTSDB, you can choose between several operators, including average, count, standard deviation, first, last, percentiles, minimal, maximal or summing all values inside your defined span.

In the end, a lot of people choose to use a much more complex method: WarpScript, which is powered by the Warp10 Analytics Engine we use behind the scenes.

Our internal adoption of WarpScript

WarpScript is the current Time Series language of Warp 10(R), our underlying backend. WarpScript will help for any complex Time Series use case, and solves numerous real-world problems, as you have full control of all your operations. You have dedicated frameworks of functions to sample raw data and fill missing values. You also have frameworks to apply operations on single-value or window operations. You can apply operations on multiple Time Series sets, and have dedicated functions to manipulate Time Series times, statistics, etc.

It works with a Reverse Polish Notation (like a good, old-fashioned HP48, for those who’ve got one!), and simple uses cases can be easy to express. But when it comes to analytics, while it certainly solves problems, it’s still complex to learn. In particular, Time Series use cases are complex and require a thinking model, so WarpScript helped to solve a lot of hard ones.

This is why it’s still the main query used at OVH on the OVH Metrics platform, with nearly 60% of internal queries making use of it. The same request that that we just computed in OpenTSDB and PromQL would be as follows in WarpScript:

[ "token" "cpu.average" { "cpu" "~cpu[0–9]+" } NOW 2 h ] FETCH
[ SWAP bucketizer.mean 0 2 m 0 ] BUCKETIZE
[ SWAP [ "host" ] reducer.max ] REDUCE

A lot of users find it hard to learn WarpScript at first, but after solving their initial issues with some (sometimes a lot of) support, it becomes the first step of their Time Series adventure. Later, they figure out some new ideas about how they can gain knowledge from their metrics. They then come back with many demands and questions about their daily issues, some of which can be solved quickly, with their own knowledge and experience.

What we learned from WarpScript is that it’s a fantastic tool with which to build analytics for our Metrics data. We pushed many complex use cases with advanced signal-processing algorithms like LTTB, Outliers or Patterns detections, and Kernel Smoothing, where it proved to be a real enabler. However, it proved quite expensive to support for basic requirements, and feedback indicated the syntax and overall complexity were big concerns.

A WarpScript can involve dozens (or even hundreds) of lines, and a successful execution is often an accomplishment, with the special feeling that comes from having made full use of one’s brainpower. In fact, an inside joke amongst our team is being born able to write a WarpScript in a single day, or to earn a WarpScript Pro Gamer badge! That’s why we’ve distributed Metrics t-shirts to users that have achieved significant successes with the Metrics Data Platform.

We liked the WarpScript semantic, but we wanted it to have a significant impact on a broader range of use cases. This is why we started to write TSL with few simple goals:

Offer a clear Time Series analytics semantic
Simplify the writing and making it developer-friendly
Support data flow queries and ease debugging for complex queries
Don’t try and be the ultimate tool box. Keep it simple.

We know that users will probably have to switch back to WarpScript every so often. However, we hope that using TSL will simplify their learning curve. TSL is simply a new step in the Time Series adventure!

The path to TSL

TSL is the result of three years of Time Series analytics support, and offers a functional Time Series Language. The aim of TSL is to build a Time Series data flow as code.

With TSL, native methods, such as select and where, exist to choose which metrics to work on. Then, as Time Series data is time-related, we have to use a time selector method on the selected meta. The two available methods are from and last. The vast majority of the other TSL methods take Time Series sets as input and provide Time Series sets as the result. For example, you have methods that only select values above a specific threshold, compute rate, and so on. We have also included specific operations to apply to multiple subsets of Time Series sets, as additions or multiplications.

Finally, for a more readable language, you can define variables to store Time Series queries and reuse them in your script any time you wish. For now, we support only a few native types, such as Numbers, Strings, Time durations, Lists, and Time Series (of course!).

Finally, the same query used throughout this article will be as follows in TSL:

select("cpu.usage_system")
.where("cpu~cpu[0–9]+")
.last(12h)
.sampleBy(2m,mean)
.groupBy(max)

You can also write more complex queries. For example, we condensed our WarpScript hands-on, designed to detect exoplanets from NASA raw data, into a single TSL query:

sample = select('sap.flux')
 .where('KEPLERID=6541920')
 .from("2009–05–02T00:56:10.000000Z", to="2013–05–11T12:02:06.000000Z")
 .timesplit(6h,100,"record")
 .filterByLabels('record~[2–5]')
 .sampleBy(2h, min, false, "none")

trend = sample.window(mean, 5, 5)

sub(sample,trend)
 .on('KEPLERID','record')
 .lessThan(-20.0)

So what did we do here? First we instantiated a sample variable in which we loaded the ‘sap.flux’ raw data of one star, the 6541920. We then cleaned the series, using the timesplit function (to split the star series when there is a hole in the data with a length greater than 6h), keeping only four records. Finally, we sampled the result, keeping the minimal value of each 2-hour bucket.

We then used this result to compute the series trend, using a moving average of 10 hours.

To conclude, the query returns only the points less than 20 from the result of the subtraction of the trend and the sample series.

TSL is Open Source

Even if our first community of users was mostly inside OVH, we’re pretty confident that TSL can be used to solve a lot of Time Series use cases.

We are currently beta testing TSL on our OVH Metrics public platform. Furthermore, TSL is open-sourced on Github, so you can also test it on your own platforms.

We would love to get your feedback or comments on TSL, or Time Series in general. We’re available on the OVH Metrics gitter, and you can find out more about TSL in our Beta features documentation.

Handling OVH’s alerts with Apache Flink

Pierre Zemb — Thu, 31 Jan 2019 09:01:32 +0000

OVH relies extensively on metrics to effectively monitor its entire stack. Whether they are low-level or business centric, they allow teams to gain insight into how our services are operating on a daily basis. The need to store millions of datapoints per second has produced the need to create a dedicated team to build a operate a product to handle that load: Metrics Data Platform. By relying on Apache Hbase, Apache Kafka and Warp 10, we succeeded in creating a fully distributed platform that is handling all our metrics… and yours!

After building the platform to deal with all those metrics, our next challenge was to build one of the most needed feature for Metrics: the Alerting.

Meet OMNI, our alerting layer

OMNI is our code name for a fully distributed, as-code, alerting system that we developed on top of Metrics. It is split into components:

The management part, taking your alerts definitions defined in a Git repository, and represent them as continuous queries,
The query executor, scheduling your queries in a distributed way.

The query executor is pushing the query results into Kafka, ready to be handled! We now need to perform all the tasks that an alerting system does:

Handling alerts deduplication and grouping, to avoid alert fatigue.
Handling escalation steps, acknowledgement or snooze.
Notify the end user, through differents channels: SMS, mail, Push notifications, …

To handle that, we looked at open-source projects, such as Prometheus AlertManager, LinkedIn Iris, we discovered the hidden truth:

Handling alerts as streams of data,
moving from operators to another.

We embraced it, and decided to leverage Apache Flink to create Beacon. In the next section we are going to describe the architecture of Beacon, and how we built and operate it.

If you want some more information on Apache Flink, we suggest to read the introduction article on the official website: What is Apache Flink?

Beacon architecture

At his core, Beacon is reading events from Kafka. Everything is represented as a message, from alerts to aggregations rules, snooze orders and so on. The pipeline is divided into two branches:

One that is running the aggregations, and triggering notifications based on customer’s rules.
One that is handling the escalation steps.

Then everything is merged to generate a notification, that is going to be forward to the right person. A notification message is pushed into Kafka, that will be consumed by another component called beacon-notifier.

If you are new to streaming architecture, I recommend reading Dataflow Programming Model from Flink official documentation.

Everything is merged into a dataStream, partitionned (keyed by in Flink API) by users. Here’s an example:

final DataStream> alertStream =

  // Partitioning Stream per AlertIdentifier
  cleanedAlertsStream.keyBy(0)
  // Applying a Map Operation which is setting since when an alert is triggered
  .map(new SetSinceOnSelector())
  .name("setting-since-on-selector").uid("setting-since-on-selector")

  // Partitioning again Stream per AlertIdentifier
  .keyBy(0)
  // Applying another Map Operation which is setting State and Trend
  .map(new SetStateAndTrend())
  .name("setting-state").uid("setting-state");

SetSinceOnSelector, which is setting since when the alert is triggered
SetStateAndTrend, which is setting the state (ONGOING, RECOVERY or OK) and the trend(do we have more or less metrics in errors).

Each of this class is under 120 lines of codes because Flink is handling all the difficulties. Most of the pipeline are only composed of classic transformations such as Map, FlatMap, Reduce, including their Rich and Keyed version. We have a few Process Functions, which are very handy to develop, for example, the escalation timer.

Integration tests

As the number of classes was growing, we needed to test our pipeline. Because it is only wired to Kafka, we wrapped consumer and producer to create what we call scenari: a series of integration tests running different scenarios.

Queryable state

One killer feature of Apache Flink is the capabilities of querying the internal state of an operator. Even if it is a beta feature, it allows us the get the current state of the different parts of the job:

at which escalation steps are we on
is it snoozed or ack-ed
Which alert is ongoing
and so on.

Queryable state overview

Thanks to this, we easily developed an API over the queryable state, that is powering our alerting view in Metrics Studio, our codename for the Web UI of the Metrics Data Platform.

Apache Flink deployment

We deployed the latest version of Flink (1.7.1 at the time of writing) directly on bare metal servers with a dedicated Zookeeper’s cluster using Ansible. Operating Flink has been a really nice surprise for us, with clear documentation and configuration, and an impressive resilience. We are capable of rebooting the whole Flink cluster, and the job is restarting at his last saved state, like nothing happened.

We are using RockDB as a state backend, backed by OpenStack Swift storage provided by OVH Public Cloud.

For monitoring, we are relying on Prometheus Exporter with Beamium to gain observability over job’s health.

In short, we love Apache Flink!

If you are used to work with stream related software, you may have realized that we did not used any rocket science or tricks. We may be relying on basics streaming features offered by Apache Flink, but they allowed us to tackle many business and scalability problems with ease.

As such, we highly recommend that any developers should have a look to Apache Flink. I encourage you to go through Apache Flink Training, written by Data Artisans. Furthermore, the community has put a lot of effort to easily deploy Apache Flink to Kubernetes, so you can easily try Flink using our Managed Kubernetes!

What’s next?

Next week we come back to Kubernetes, as we will expose how we deal with ETCD in our OVH Managed Kubernetes service.