Time series Archives - OVHcloud Blog

Erlenmeyer and PromQL compatibility

Aurélien Hébert — Wed, 16 Dec 2020 11:03:27 +0000

Today in the monitoring world, we see the rise of the Prometheus tool. It’s a great tool to deploy in your infrastructure, as it allows you to scrap all of your servers or applications to retrieve, store and analyze the metrics. And all you have to do is to extract and run it, it does all the work by itself. Of course, Prometheus comes with some trade-offs (pull, how to handle late ingestion), and some limits, as you have your data only for a couple of days.

Context

How is it possible to handle Prometheus long-time storage? A vast amount of Time Series DataBase are now fully compatible with Prometheus. It’s easy to check that Prometheus ingest is working well, however, how can we validate the PromQL – or Prometheus queries – part? A few months ago, PromLab released a new tool called “PromQL compliance tester“. They recently created this page where they reference the result of several products PromQL compliance tests. On this blog post, we will see how this tool helps us improve our PromQL implementation.

Compliance tester

The PromQL compliance tester is open source and contains a full set of tests. When using this tool, it generates for you around 500 PromQL queries covering the vast majority of the language. It includes tests on simple scalar, selectors, time range functions, operators, and so on. This tool will execute a request on both a Prometheus instance and the tested backend. It will then expect you to get the same result as PromQL output. It expects an exact match for all metadata of a series (tags and names). It’s more flexible for the ticks as you can set a parameter to round your check at the milliseconds. Finally, the compliance tool checks the equality of both query values, as many things can impact the floating predictability, it computes an approximated equality.

Erlenmeyer

At Metrics, we used a Warp10 TSDB with it’s own analytical query engine WarpScript. We decided to build an open source tool to transpile PromQL queries into WarpScript called Erlenmeyer. This compliance tester was a great help to validate some of our implementation and to detect which query were not fully ISO.

Set up

To start testing our PromQL experience, we set up a local Prometheus with a default configuration. This configuration makes Prometheus run and collect some “Demo” Metrics, then we forwarded all of them to one of our Metrics regions using Prometheus remote write. We added a local instance of Erlenmeyer to query the data stored in a distributed Warp10 backend. Then, we iterated on each set of tests of the PromLab compliance tool to identify all issues and improved our existing PromQL implementation.

To be compliant, we had to reduce the precision for the value of the compliance tool. We set the precision to 0.001 instead of 0.00001. We also had to remove the Warp10 .app label from the result. As on Warp10 instance, we identify users based on this .app label.

A test query

When running the test, you will get a full report of your failing queries. Let’s take an example:

RESULT: FAILED: Query returned different results:
  model.Matrix{
  	&{
  		Metric: Inverse(DropResultLabels, s`{instance="demo.promlabs.com:10002", job="demo"}`),
  		Values: []model.SamplePair{
  			... // 52 identical elements
  			{Timestamp: s"1606323726.058", Value: Inverse(TranslateFloat64, float64(2.6928936527e+10))},
  			{Timestamp: s"1606323736.058", Value: Inverse(TranslateFloat64, float64(2.691644054725e+10))},
  			{
  				Timestamp: s"1606323746.058",
- 				Value:     Inverse(TranslateFloat64, float64(2.6922272529119648e+10)),
+ 				Value:     Inverse(TranslateFloat64, float64(2.689432207325e+10)),
  			},
  			{Timestamp: s"1606323756.058", Value: Inverse(TranslateFloat64, float64(2.6915188293125e+10))},
  			{Timestamp: s"1606323766.058", Value: Inverse(TranslateFloat64, float64(2.69215848005e+10))},
  			... // 4 identical elements
  		},
  	},
  }

The test reports includes all errors occurring during the test. In this example, we can see, that for a single series we have 56 correct values. However one is invalid, we see it on two lines. The first one is the one starting by “-“. This stands for the expected value. And the second one starting by a “+” corresponds to the tested instance value. In this case, the value isn’t precise enough (2.68 instead of 2.69).

Results

Now that we have a full test set-up running, we can see what we improved from its results. If you want to access the full detailed fixes, you can check the code update made here. This tool helped us to fix some implementation, sanitize known issues, to know what PromQL features we missed, and detect a few new bugs! Let’s review the change.

Quick implementation fixes

Running those test was a great help for us to understand some of implementations errors we had when trying to match PromQL behavior. For example, the time range function was sampling before computing the operation. Reversing those steps provided us a direct match with a native query. It also helped us also fix some minor bugs on how to handle the comparison operators or multiple functions as label_replace, holt_winters, predict_linear or the full set of time functions (hour, minute, month…).

We improved also our handling of PromQL operator aggregators : by and without.

Sanitize known issues

We discovered recently, that we were not matching PromQL behavior on the series name. As a result we were keeping the name for all compute operations. Prometheus has, however, a different approach as the name is only kept when it’s relevant. The compliance tester helps us on how to validate this specific update for all queries.

With this tool, we test the validity of a query compared to a native PromQL query, it helps us to sanitize our query output. We knew that, in case of missing values or empty series, we were not ISO compliant. We have corrected the part of the Erlenmeyer software handling the output to match all PromQL cases included in the tests.

Unimplemented features

Running the test, lead us to discover that we missed some PromQL native features. As a matter of fact, Erlenmeyer now supports the PromQL unary or the “bool” keywords. The support of unary allows the use of “-my_series” for example. In PromQL, the bool keywords convert the result to booleans. It returns as series values 1 or 0 depending on the condition, where 1 stands for true and zero for false.

Open issues

Running all compliance tests and improving our code base lead to us to around 91% of success. For the rest, we open new issues on Erlenmeyer, we detected that:

the handling of the over_time function is not correct when the range is below the data point frequency,
rate, delta, increase and predict_linear, our result isn’t precise enough to match PromQL output when then the range is below 5 minutes,
some minor bugs on series selector (!=), or on the label_replace (some checks are missing on parameters validators),
the PromQL subqueries, as well as, some functions are not implemented: ^ and % on two series set and the deriv function.

Those are the 4 missing points to cover the full PromQL feature set with Erlenmeyer. Our documentation already contained all the missing implementations.

Actions

This tool was a great help to improve our PromQL compliance and we are happy with our compliance result. Indeed we reach 91% with the provided test result:

General query tweaks:
*  Metrics test
================================================================================
Total: 496 / 541 (91.68%) passed

Our next action, is to release those fixes and improvements on all our Metrics regions. Looking forward to see what you think about our PromQL implementation!

We now see a lot of projects are implementing Prometheus writes and reads. These projects bring Prometheus a lot of missing features like long-term storage, delete, late ingestion, historical data analysis, HA… Being able to validate PromQL implementation is a big challenge, and is a great help in choosing the right backend according to the need.

The Open Source Metrics family welcomes Catalyst and Erlenmeyer

Aurélien Hébert — Fri, 20 Mar 2020 09:43:32 +0000

At OVHcloud Metrics, we love open source! Our goal is to provide all of our users with a full experience. We rely on the Warp10 time series database which enables us to build open source tools for our users benefit. Let’s take a look at some in this blogpost.

Storage tool

Our Infrastructure is based on the open source time series database: Warp10. This database includes two versions: a stand-alone one and a distributed one. The distributed one relies on distributed tools such as Apache Kafka, Apache Hadoop and Apache HBase.

Unsurprisingly, our team makes its own contributions to the Warp10 platform. Due to our unique requirements, we even contribute to the underlying open source database HBase!

Metrics data ingest

As a matter of fact, we were the first to get stuck in on the ingest process! We often build adapted tools to collect and push monitoring data on Warp10 – this is how noderig came to life. Noderig is an adapted tool that is able to collect a simple core of metrics from any server or any virtual machine. It is also possible to send metrics safely to a backend. Beamium, a Rust tool, is able to push the Noderig metrics to one or several Warp 10 backend(s).

What if I want to collect my own custom metrics? First, you’ll need to expose them following the ‘Prometheus model’. Beamium is then able to scrap applications based on their configuration file and forward all the data to the configured Warp 10 backend(s)!

If you are looking to monitor specific applications using the Influx Telegraf agent (in order to expose the Metrics you require) we have also contributed the Warp10 Telegraf connector, which was recently merged!

This looks great so far, but what if I usually push Graphite, Prometheus, Influx or OpenTSDB metrics; how can I simply migrate to Warp10? Our answer is Catalyst: a proxy layer that is able to parse Metrics in the related formats, and convert them to Warp10 native.

Catalyst

Catalyst is a Go HTTP proxy used to parse multiple Open Source TimeSeries database writes. At the moment, it supports multiple Open Source TimeSeries database writes; such as OpenTSDB, PromQL, Prometheus-remote_write, Influx and Graphite. Catalyst runs a HTTP server that listens to a specific path; starting with the protocol time series name and then the native query one. For example, in order to collect influx data, you simply send a request to influxdb/write. Catalyst will natively parsed the influx data and convert it to Warp 10 ingest native format.

Metrics queries

Data collection is an important first step, but we have also considered how existing query Monitoring protocols could be used on top of Warp10. This has led us to implement TSL. TSL was discussed at length during the Paris Time Series Meetup as well as in this blog post.

Now let’s take a user that is using Telegraf and pushing data to Warp10 with Catalyst. They will wish to use the native Influx Grafana dashboard, but how? And what about users that automate queries with the OpenTSDB query protocol? Our answer was to develop a proxy: Erlenmeyer.

Erlenmeyer

Erlenmeyer is a Go HTTP proxy that enables users to query Warp 10 based on Open Source query protocol. At the moment, it supports multiple Open Source TimeSeries formats; such as PromQL, Prometheus remote-read, InfluxQL, OpenTSDB or Graphite. Erlenmeyer runs a HTTP server that listens to a specific path; starting with the protocol time series name and then the native query one. For example, to run a promQL query, the user sends a request to prometheus/api/v0/query. Erlenmeyer will natively parsed the promQL request and then build a native WarpScript request that any Warp10 backend can support.

To be continued

At first, Erlenmeyer and Catalyst represented a quick implementation of native protocols, aimed to help internal teams migrate, while still utilising a familiar tool. We have now integrated a lot of the native functionality of each protocol, and feel they are ready for sharing. It’s time to make them available to the Warp10 community, so we can receive feedback and continue to work hard in supporting open source protocols. You can find us in OVHcloud Metrics gitter room!

Other Warp10 users may require unimplemented protocol. They will be able to use Erlenmeyer and Catalyst to support them on their own Warp10 backend.

Welcome Erlenmeyer and Catalyst – Metrics Open Source projects!

Jerem: An Agile Bot

Aurélien Hébert — Fri, 21 Feb 2020 16:58:47 +0000

At OVHCloud, we are open sourcing our “Agility Telemetry” project. Jerem, as our data collector, is the main component of this project. Jerem scrapes our JIRA at regular intervals, and extracts specific metrics for each project. It then forwards them to our long-time storage application, the OVHCloud Metrics Data Platform.

Agility concepts from a developer’s point of view

To help you understand our goals for Jerem, we need to explain some Agility concepts we will be using. First, we will establish a technical quarterly roadmap for a product, which sets out all features we plan to release every three months. This is what we call an epic.

For each epic, we identify the tasks that will need to be completed. For all of those tasks, we then evaluate the complexity of tasks using story points, during a team preparation session. A story point reflects the effort required to complete the specific JIRA task.

Then, to advance our roadmap, we will conduct regular sprints that correspond to a period of ten days, during which the team will onboard several tasks. The amount of story points taken in a sprint should match, or be close to, the team velocity. In other words, the average number of story points that the team is able to complete each day.

However, other urgent tasks may arise unexpectedly during sprints. That’s what we call an impediment. We might, for example, need to factor in helping customers, bug fixes, or urgent infrastructure tasks.

How Jerem works

At OVH we use JIRA to track our activity. Our Jerem bot scraps our projects from JIRA and exports all necessary data to our OVHCloud Metrics Data Platform. Jerem can also push data to any Warp 10 compatible database. In Grafana, you simply query the Metrics platform (using Warp10 datasource) with for example our program management dashboard. All your KPI are now available in a nice dashboard!

Discover Jerem metrics

Now that we have an overview of the main Agility concepts involved, let’s dive into Jerem! How do we convert those Agility concepts into metrics? First of all, we’ll retrieve all metrics related to epics (i.e. new features). Then, we will have a deep look at the sprint metrics.

Epic data

To explain Jerem epic metrics, we’ll start by creating a new one. In this example, we called it Agile Telemetry. We add a Q2-20 label, which means that we plan to release it for Q2. To record an epic with Jerem, you need to set a quarter for the final delivery! Next, we’ll simply add four tasks, as shown below:

To get the metrics, we need to evaluate each individual task. We we’ll do this together during preparation sessions. In this example, we have custom story points for each task. For example, we estimated the write a BlogPost about Jerem task as being a 3.

As a result, Jerem now has everything it needs to start collecting epic metrics. This example provides five metrics:

jerem.jira.epic.storypoint: the total number of story points needed to complete this epic. The value here is 14 (the sum of all the epic story points). This metric will evolve whenever as the epic is updated by adding or removing tasks.
jerem.jira.epic.storypoint.done: the number of completed tasks. In our example, we have already completed the Write Jerem bot and Deploy Jerem Bot, so we have already completed eight story points.
jerem.jira.epic.storypoint.inprogress: the number of ‘in progress’ tasks, such as Write a BlogPost about Jerem.
jerem.jira.epic.unestimated: the number of unestimated tasks, shown as Unestimated Task in our example.
jerem.jira.epic.dependency: the number of tasks that have dependency labels, indicating that they are mandatory for other epics or projects.

This way, for each epic in a project, Jerem collects five unique metrics.

Sprint data

To complete epic tasks, we work using a sprint process. When doing sprints, we want to provide a lot of insights into our achievements. That’s why Jerem collects sprint data too!

So let’s open a new sprint in JIRA and start working on our task. This gives us the following JIRA view:

Jerem collects the following metrics for each sprint:

jerem.jira.sprint.storypoint.total: the total number of story points onboarded into a sprint.
jerem.jira.sprint.storypoint.inprogress: the story points currently in progress within a sprint.
jerem.jira.sprint.storypoint.done: the number of story points currently completed within a sprint.
jerem.jira.sprint.events: the ‘start’ and of the ‘end’ dates of sprint events, recorded as Warp10 string values.

As you can see in the Metrics view above, we will record every sprint metric twice. We do this to provide a quick view of the active sprint, which is why we use the ‘current’ label’. This also enables us to query past sprints, using the real sprint name. Of course, an active sprint can also be queried using its name.

Impediment data

Starting a sprint means you need to know all the tasks you will have to work on over the next few days. But how can we track and measure unplanned tasks? For example, the very urgent one for your manager, or the teammate that needs a bit of help?

We can add special tickets on JIRA to keep track of those task. That’s what we call an ‘impediment’. They are labelled according their nature. If, for example, the production requires your attention, then it’s an ‘Infra’ impediment. You will also retrieve metrics for the ‘Total’ (all kinds of impediments), ‘Excess’ (the unplanned tasks), ‘Support’ (helping teammates), and ‘Bug fixes or other’ (for all other kinds of impediment).

Each impediment belongs to the active sprint it was closed in. To close an impediment, you only have to flag it as ‘Done’ or ‘Closed’.

We also retrieve metrics like:

jerem.jira.impediment.TYPE.count: the number of impediments that occurred during a sprint.
jerem.jira.impediment.TYPE.timespent: the amount of time spent on impediments during a sprint.

TYPE corresponds to the kind of recorded impediment. As we didn’t open any actual impediments, Jerem collects only the total metrics.

To start recording impediments, we simply create a new JIRA task, in which we add an ‘impediment’ label. We we also set its nature, and the actual time spent on it.

For the impediment, we’ll we also retrieve the global metrics that Jerem always records:

jerem.jira.impediment.total.created: the time spent from the creation date to complete an impediment. This allows us to retrieve a total impediment count. We can also record all impediment actions, even outside sprints.

For a single Jira project, like our example, you can expect around 300 metrics. This might increase depending on the epic you create and flag on Jira, and the one you close.

Grafana dashboard

We love building Grafana dashboards! They provide both the team and the manager a lot of insights into KPIs. The best part of it for me, as a developer, is that I see why it’s nice to fill a JIRA task!

In our first Grafana dashboard, you will retrieve all the best program management KPIs. Let’s start with the global overview:

Quarter data overview

Here, you will find the current epic in progress. You will also find the global team KPIs, such as the predictability, the velocity, and the impediment stats. It’s here where the magic happens! When filled correctly, this dashboard will show you exactly what your team should deliver in the current quarter. This means you have quick access to all important current subjects. You will also be able to see if your team is expected to deliver on too many subjects, so you can quickly take action and delay some of the new features.

Active sprint data

The active sprint data panel is often a great support during our daily meetings. In this panel, we get a quick overview of the team’s achievements, and can establish the time spent on parallel tasks.

Detailed data

The last part provides more detailed data. Using the epic Grafana variable, we can check specific epics, along with the completion of more global projects. You have also a velocity chart, which plots the past sprint, and compares the expected story points to the ones actually completed.

The Grafana dashboard is directly availble in the Jerem project. You can import it directly in Grafana, provided you have a valid Warp 10 datasource configured.
To make the dashboard work as required, you have to configure the Grafana project variable in the form of a WarpScript list [ 'SAN' 'OTHER-PROJECT' ]. If our program manager can do it, I am sure you can! 😉

Setting up Jerem and automatically loading program management data give us a lot of insights. As a developer, I really enjoy it and I’ve quickly become used to tracking a lot more events in JIRA than I did before. You are able to directly see the impact of your tasks. For example, you see how quickly the roadmap is advancing, and you become able to identify any bottlenecks that are causing impediments. Those bottlenecks then become epics. In other words, once we start to use Jerem, we just auto-fill it! I hope you will enjoy it too! If you have any feedback, we would love to hear it.

TSL (or how to query time series databases)

Aurélien Hébert — Fri, 31 Jan 2020 13:41:34 +0000

Last year, we released TSL as an open source tool to query a Warp 10 platform, and by extension, the OVHcloud Metrics Data Platform.

But how has it evolved since then? Is TSL ready to query other time series databases? What about TSL states on the Warp10 eco-system?

TSL to query many time series databases

We wanted TSL to be usable in front of multiple time series databases. That’s why we also released a PromQL query generator.

One year later, we now know this wasn’t the way to go. Based on what we learned, the TSL-Adaptor project was open sourced, as a proof of concept for how TSL can be used to query a non-Warp 10 database. Put simply, TSL-Adaptor allows TSL to query an InfluxDB.

What is TSL-Adaptor?

TSL-Adaptor is a Quarkus JAVA REST API that can be used to query a backend. TSL-Adaptor parses the TSL query, identifies the fetch operation, and loads natively raw data from the backend. It then computes the TSL operations on the data, before returning the result to the user. The main goal of TSL-Adaptor is to make TSL available on top of other TSDBs.

In concrete terms, we are running a JAVA REST API that integrates the WarpScript library in its runtime. TSL is then used to compile the query into a valid WarpScript one. This is fully transparent, and only deals with TSL queries on the user’s side.

To load raw data from the InfluxDB, we created a WarpScript extension. This extension integrates an abstract class LOADSOURCERAW that needs to be implemented to create an TSL-Adaptor data source. This requires only two methods: find and fetch. Find gathers all series selectors matching a query (class names, tags or labels), while fetchactually retrieves the raw data within a time span.

Query Influx with TSL-Adaptor

To get started, simply run an InfluxDB locally on the 8086 port. Then, let’s start an influx Telegraf agent and record Telegraf data on the local influx instance.

Next, make sure you have locally installed TSL-Adaptor and updated its config with the path to a tsl.so library.

To specify a custom influx address or databases, update the TSL-Adaptor configuration file accordingly.

You can start TSL-Adaptor with the following example command:

java -XX:TieredStopAtLevel=1 -Xverify:none -Djava.util.logging.manager=org.jboss.logmanager.LogManager -jar build/tsl-adaptor-0.0.1-SNAPSHOT-runner.jar

And that’s it! You can now query your influx database with TSL and TSL-Adaptor.

Let’s start with the retrieval of the time series relating to the disk measurements.

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk")'

Now let’s use the TSL analytics power!

First, we would like to retrieve only the data containing a mode set to rw.

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw")'

We would like to retrieve the maximum value at every five-minute interval, for the last 20 minutes. The TSL query will therefore be:

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw").last(20m).sampleBy(5m,max)'

Now it’s your turn to have some fun with TSL and InfluxDB. You can find details of all the implemented functions in the TSL documentation. Enjoy exploring!

What’s new on TSL with Warp10?

We originally built TSL as a GO proxy in front of Warp10. TSL has now integrated the Warp 10 ecosystem as a Warp10 extension, or as a WASM library! We have also added some new native TSL functions to make the language even richer!

TSL as WarpScript function

To make TSL work as a Warp10 function, you need to have the tsl.so library available locally. This library can be found in TSL github repository. We have also made a TSL WarpScript extension available from WarpFleet, the extension repository of the Warp 10 community.

To set up TSL extension on your Warp 10, simply download the JAR indicated in WarpFleet. You can then configure the extension in the Warp 10 configuration file:

warpscript.extensions = io.ovh.metrics.warp10.script.ext.tsl.TSLWarpScriptExtension
warpscript.tsl.libso.path =

Once you reboot Warp 10, you are ready to go. You can test if it’s working by running the following query:

// You will need to put here a valid Warp10 token when computing a TSL select statement
// '' 

// A valid TOKEN isn't needed on the create series statement in this example
// You can simply put an empty string
''

// Native TSL create series statement
 <'
    create(series('test').setValues("now", [-5m, 2], [0, 1])) 
'>
TSL

With the WarpScript TSL function, you can use native WarpScript variables in your script, as shown in the example below:

// Set a Warp10 variable

NOW 'test' STORE

'' 

// A Warp10 variable can be reused in TSL script as a native TSL variable
 <'
    create(series('test').setValues("now", [-5m, 2], [0, 1])).add(test)
'>
TSL

TSL WASM

To expand TSL’s potential uses, we have also exported it as a Wasm library, so you can use it directly in a browser! The Wasm version of the library parses TSL queries locally and generates the WarpScript. The result can then be used to query a Warp 10 backend. You will find more details on the TSL github.

TSL’s new features

As TSL has grown in popularity, we have detected and fixed a few bugs, and also added some additional native functions to accommodate new use cases.

We added the setLabelFromNamemethod, to set a new label to a series, based on its name. This label can be the exact series name, or the result of a regular expression.

We also completed the sortmethod, to allow users to sort their series set based on series meta data (i.e. selector, name or labels).

Finally, we added a filterWithoutLabels, to filter a series set and remove any series that do not contain specific labels.

Thanks for reading! I hope you will give TSL a try, as I would love to hear your feedback.

Paris Time Series meetup

We are delighted to be hosting, at the OVHcloud office in Paris, soon the third Paris Time Series meetup, organised by Nicolas Steinmetz. During this meetup, we will be speaking about TSL, as well as listening to an introduction of the Redis Times Series platform.

If you are available, we will be happy to meet you there!

TSL: a developer-friendly Time Series query language for all our metrics

Aurélien Hébert — Wed, 13 Feb 2019 13:11:28 +0000

At the Metrics team, we have been working on Time Series for several years now. In our experience, the data analytics capabilities of a Time Series Database (TSDB) platform is a key factor in creating value from your metrics. These analytics capabilities are mostly defined by the query languages they support.

TSL stands for Time Series Language. In simple terms, TSL is an abstracted way of generating queries for different TSDB backends, in the form of an HTTP proxy. It currently supports Warp 10’s WarpScript and Prometheus’ PromQL query languages, but we aim to extend the support to other major TSDBs.

To provide some context around why we created TSL, it began with a review of some of the TSDB query languages supported on the OVH Metrics Data Platform. When implementing them, we learned the good, the bad and the ugly of each one. In the end, we decided to build TSL to simplify the querying on our platform, before open-sourcing it to use it on any TSDB solution.

So why did we decide to invest our time in developing such a proxy? Well, let me tell you the story of the OVH Metrics protocol!

From OpenTSDB…

The first aim of our platform is to be able to support the OVH infrastructure and application monitoring. When this project started, a lot of people were using OpenTSDB, and were familiar with its query syntax. OpenTSDB is a scalable database for Time Series. The OpenTSDB query syntax is easy to read, as you send a JSON document describing the request. The document below will load allsys.cpu.0 metrics of thetestdatacentre, summing them between thestart and end dates:

{
    "start": 1356998400,
    "end": 1356998460,
    "queries": [
        {
            "aggregator": "sum",
            "metric": "sys.cpu.0",
            "tags": {
                "host": "*",
                "dc": "test"
            }
        }
    ]
}

This enables the quick retrieval of specific data, in a specific time range. At OVH, this was used for graphs purpose, in conjunction with Grafana, and helped us to spot potential issues in real time, as well as investigate past events. OpenTSDB integrates simple queries, where you can define your own sampling and deal with counter data, as well as filtered and aggregated raw data.

OpenTSDB was the first protocol supported by the Metrics team, and is still widely used today. Internal statistics shows that 30-40% of our traffic is based on OpenTSDB queries. A lot of internal use cases can still be entirely resolved with this protocol, and the queries are easy to write and understand.

For example, a query with OpenTSDB to get the max value of theusage_systemfor thecpu0 to 9, sampled for a 2-minute span by their values’ average, looks like this:

{
    "start": 1535797890,
    "end": 1535818770,
    "queries":  [{
        "metric":"cpu.usage_system",
        "aggregator":"max",
        "downsample":"2m-avg",
        "filters": [{
            "type":"regexp",
            "tagk":"cpu",
            "filter":"cpu[0–9]+",
            "groupBy":false
            }]
        }]
}

However, OpenTSDB quickly shows its limitations, and some specific uses cases can’t be resolved with it. For example, you can’t apply any operations directly on the back-end. You have to load the data on an external tool and use it to apply any analytics.

One of the main areas where OpenTSDB (version 2.3) is lacking is multiple Time Series set operators, which allow actions like a divide series. Those operators can be a useful way to compute the individual query time per request, when you have (for example) a set of total time spend in requests and a set of total requests count series. That’s one of the reasons why the OVH Metrics Data Platform supports other protocols.

… to PromQL

The second protocol we worked on was PromQL, the query language of the Prometheus Time Series database. When we made that choice in 2015, Prometheus was gaining some traction, and it still has an impressive adoption rate. But if Prometheus is a success, it isn’t for it’s query language, PromQL. This language never took off internally, although it has started to gain some adoption recently, mainly due to the arrival of people that worked with Prometheus in their previous companies. Internally, PromQL queries represent about 1-2% of our daily traffic. The main reasons are that a lot of simple use cases can be solved quickly and with more control of the raw data with OpenTSDB queries, while a lot of more complex use cases cannot be solved with PromQL. A similar request to the one defined in OpenTSDB would be:

api/v1/query_range?
query=max(cpu. usage_system{cpu=~"cpu[0–9]%2B"})
start=1535797890&
end=1535818770&
step=2m

With PromQL, you lose control of how you sample the data, as the only operator is last. This means that if (for example) you want to downsample your series with a 5-minute duration, you are only able to keep the last value of each 5-minute series span. In contrast, all competitors include a range of operators. For example, with OpenTSDB, you can choose between several operators, including average, count, standard deviation, first, last, percentiles, minimal, maximal or summing all values inside your defined span.

In the end, a lot of people choose to use a much more complex method: WarpScript, which is powered by the Warp10 Analytics Engine we use behind the scenes.

Our internal adoption of WarpScript

WarpScript is the current Time Series language of Warp 10(R), our underlying backend. WarpScript will help for any complex Time Series use case, and solves numerous real-world problems, as you have full control of all your operations. You have dedicated frameworks of functions to sample raw data and fill missing values. You also have frameworks to apply operations on single-value or window operations. You can apply operations on multiple Time Series sets, and have dedicated functions to manipulate Time Series times, statistics, etc.

It works with a Reverse Polish Notation (like a good, old-fashioned HP48, for those who’ve got one!), and simple uses cases can be easy to express. But when it comes to analytics, while it certainly solves problems, it’s still complex to learn. In particular, Time Series use cases are complex and require a thinking model, so WarpScript helped to solve a lot of hard ones.

This is why it’s still the main query used at OVH on the OVH Metrics platform, with nearly 60% of internal queries making use of it. The same request that that we just computed in OpenTSDB and PromQL would be as follows in WarpScript:

[ "token" "cpu.average" { "cpu" "~cpu[0–9]+" } NOW 2 h ] FETCH
[ SWAP bucketizer.mean 0 2 m 0 ] BUCKETIZE
[ SWAP [ "host" ] reducer.max ] REDUCE

A lot of users find it hard to learn WarpScript at first, but after solving their initial issues with some (sometimes a lot of) support, it becomes the first step of their Time Series adventure. Later, they figure out some new ideas about how they can gain knowledge from their metrics. They then come back with many demands and questions about their daily issues, some of which can be solved quickly, with their own knowledge and experience.

What we learned from WarpScript is that it’s a fantastic tool with which to build analytics for our Metrics data. We pushed many complex use cases with advanced signal-processing algorithms like LTTB, Outliers or Patterns detections, and Kernel Smoothing, where it proved to be a real enabler. However, it proved quite expensive to support for basic requirements, and feedback indicated the syntax and overall complexity were big concerns.

A WarpScript can involve dozens (or even hundreds) of lines, and a successful execution is often an accomplishment, with the special feeling that comes from having made full use of one’s brainpower. In fact, an inside joke amongst our team is being born able to write a WarpScript in a single day, or to earn a WarpScript Pro Gamer badge! That’s why we’ve distributed Metrics t-shirts to users that have achieved significant successes with the Metrics Data Platform.

We liked the WarpScript semantic, but we wanted it to have a significant impact on a broader range of use cases. This is why we started to write TSL with few simple goals:

Offer a clear Time Series analytics semantic
Simplify the writing and making it developer-friendly
Support data flow queries and ease debugging for complex queries
Don’t try and be the ultimate tool box. Keep it simple.

We know that users will probably have to switch back to WarpScript every so often. However, we hope that using TSL will simplify their learning curve. TSL is simply a new step in the Time Series adventure!

The path to TSL

TSL is the result of three years of Time Series analytics support, and offers a functional Time Series Language. The aim of TSL is to build a Time Series data flow as code.

With TSL, native methods, such as select and where, exist to choose which metrics to work on. Then, as Time Series data is time-related, we have to use a time selector method on the selected meta. The two available methods are from and last. The vast majority of the other TSL methods take Time Series sets as input and provide Time Series sets as the result. For example, you have methods that only select values above a specific threshold, compute rate, and so on. We have also included specific operations to apply to multiple subsets of Time Series sets, as additions or multiplications.

Finally, for a more readable language, you can define variables to store Time Series queries and reuse them in your script any time you wish. For now, we support only a few native types, such as Numbers, Strings, Time durations, Lists, and Time Series (of course!).

Finally, the same query used throughout this article will be as follows in TSL:

select("cpu.usage_system")
.where("cpu~cpu[0–9]+")
.last(12h)
.sampleBy(2m,mean)
.groupBy(max)

You can also write more complex queries. For example, we condensed our WarpScript hands-on, designed to detect exoplanets from NASA raw data, into a single TSL query:

sample = select('sap.flux')
 .where('KEPLERID=6541920')
 .from("2009–05–02T00:56:10.000000Z", to="2013–05–11T12:02:06.000000Z")
 .timesplit(6h,100,"record")
 .filterByLabels('record~[2–5]')
 .sampleBy(2h, min, false, "none")

trend = sample.window(mean, 5, 5)

sub(sample,trend)
 .on('KEPLERID','record')
 .lessThan(-20.0)

So what did we do here? First we instantiated a sample variable in which we loaded the ‘sap.flux’ raw data of one star, the 6541920. We then cleaned the series, using the timesplit function (to split the star series when there is a hole in the data with a length greater than 6h), keeping only four records. Finally, we sampled the result, keeping the minimal value of each 2-hour bucket.

We then used this result to compute the series trend, using a moving average of 10 hours.

To conclude, the query returns only the points less than 20 from the result of the subtraction of the trend and the sample series.

TSL is Open Source

Even if our first community of users was mostly inside OVH, we’re pretty confident that TSL can be used to solve a lot of Time Series use cases.

We are currently beta testing TSL on our OVH Metrics public platform. Furthermore, TSL is open-sourced on Github, so you can also test it on your own platforms.

We would love to get your feedback or comments on TSL, or Time Series in general. We’re available on the OVH Metrics gitter, and you can find out more about TSL in our Beta features documentation.