TSL: a developer-friendly Time Series query language for all our metrics

At the Metrics team, we have been working on Time Series for several years now. In our experience, the data analytics capabilities of a Time Series Database (TSDB) platform is a key factor in creating value from your metrics. These analytics capabilities are mostly defined by the query languages they support.

TSL stands for Time Series Language. In simple terms, TSL is an abstracted way of generating queries for different TSDB backends, in the form of an HTTP proxy. It currently supports Warp 10’s WarpScript and Prometheus’ PromQL query languages, but we aim to extend the support to other major TSDBs.

To provide some context around why we created TSL, it began with a review of some of the TSDB query languages supported on the OVH Metrics Data Platform. When implementing them, we learned the good, the bad and the ugly of each one. In the end, we decided to build TSL to simplify the querying on our platform, before open-sourcing it to use it on any TSDB solution.

So why did we decide to invest our time in developing such a proxy? Well, let me tell you the story of the OVH Metrics protocol!

From OpenTSDB…

The first aim of our platform is to be able to support the OVH infrastructure and application monitoring. When this project started, a lot of people were using OpenTSDB, and were familiar with its query syntax. OpenTSDB is a scalable database for Time Series. The OpenTSDB query syntax is easy to read, as you send a JSON document describing the request. The document below will load allsys.cpu.0 metrics of thetestdatacentre, summing them between thestart and end dates:

{
    "start": 1356998400,
    "end": 1356998460,
    "queries": [
        {
            "aggregator": "sum",
            "metric": "sys.cpu.0",
            "tags": {
                "host": "*",
                "dc": "test"
            }
        }
    ]
}

This enables the quick retrieval of specific data, in a specific time range. At OVH, this was used for graphs purpose, in conjunction with Grafana, and helped us to spot potential issues in real time, as well as investigate past events. OpenTSDB integrates simple queries, where you can define your own sampling and deal with counter data, as well as filtered and aggregated raw data.

OpenTSDB was the first protocol supported by the Metrics team, and is still widely used today. Internal statistics shows that 30-40% of our traffic is based on OpenTSDB queries. A lot of internal use cases can still be entirely resolved with this protocol, and the queries are easy to write and understand.

For example, a query with OpenTSDB to get the max value of theusage_systemfor thecpu0 to 9, sampled for a 2-minute span by their values’ average, looks like this:

{
    "start": 1535797890,
    "end": 1535818770,
    "queries":  [{
        "metric":"cpu.usage_system",
        "aggregator":"max",
        "downsample":"2m-avg",
        "filters": [{
            "type":"regexp",
            "tagk":"cpu",
            "filter":"cpu[0–9]+",
            "groupBy":false
            }]
        }]
}

However, OpenTSDB quickly shows its limitations, and some specific uses cases can’t be resolved with it. For example, you can’t apply any operations directly on the back-end. You have to load the data on an external tool and use it to apply any analytics.

One of the main areas where OpenTSDB (version 2.3) is lacking is multiple Time Series set operators, which allow actions like a divide series. Those operators can be a useful way to compute the individual query time per request, when you have (for example) a set of total time spend in requests and a set of total requests count series. That’s one of the reasons why the OVH Metrics Data Platform supports other protocols.

… to PromQL

The second protocol we worked on was PromQL, the query language of the Prometheus Time Series database. When we made that choice in 2015, Prometheus was gaining some traction, and it still has an impressive adoption rate. But if Prometheus is a success, it isn’t for it’s query language, PromQL. This language never took off internally, although it has started to gain some adoption recently, mainly due to the arrival of people that worked with Prometheus in their previous companies. Internally, PromQL queries represent about 1-2% of our daily traffic. The main reasons are that a lot of simple use cases can be solved quickly and with more control of the raw data with OpenTSDB queries, while a lot of more complex use cases cannot be solved with PromQL. A similar request to the one defined in OpenTSDB would be:

api/v1/query_range?
query=max(cpu. usage_system{cpu=~"cpu[0–9]%2B"})
start=1535797890&
end=1535818770&
step=2m

With PromQL, you lose control of how you sample the data, as the only operator is last. This means that if (for example) you want to downsample your series with a 5-minute duration, you are only able to keep the last value of each 5-minute series span. In contrast, all competitors include a range of operators. For example, with OpenTSDB, you can choose between several operators, including average, count, standard deviation, first, last, percentiles, minimal, maximal or summing all values inside your defined span.

In the end, a lot of people choose to use a much more complex method: WarpScript, which is powered by the Warp10 Analytics Engine we use behind the scenes.

Our internal adoption of WarpScript

WarpScript is the current Time Series language of Warp 10(R), our underlying backend. WarpScript will help for any complex Time Series use case, and solves numerous real-world problems, as you have full control of all your operations. You have dedicated frameworks of functions to sample raw data and fill missing values. You also have frameworks to apply operations on single-value or window operations. You can apply operations on multiple Time Series sets, and have dedicated functions to manipulate Time Series times, statistics, etc.

It works with a Reverse Polish Notation (like a good, old-fashioned HP48, for those who’ve got one!), and simple uses cases can be easy to express. But when it comes to analytics, while it certainly solves problems, it’s still complex to learn. In particular, Time Series use cases are complex and require a thinking model, so WarpScript helped to solve a lot of hard ones.

This is why it’s still the main query used at OVH on the OVH Metrics platform, with nearly 60% of internal queries making use of it. The same request that that we just computed in OpenTSDB and PromQL would be as follows in WarpScript:

[ "token" "cpu.average" { "cpu" "~cpu[0–9]+" } NOW 2 h ] FETCH
[ SWAP bucketizer.mean 0 2 m 0 ] BUCKETIZE
[ SWAP [ "host" ] reducer.max ] REDUCE

A lot of users find it hard to learn WarpScript at first, but after solving their initial issues with some (sometimes a lot of) support, it becomes the first step of their Time Series adventure. Later, they figure out some new ideas about how they can gain knowledge from their metrics. They then come back with many demands and questions about their daily issues, some of which can be solved quickly, with their own knowledge and experience.

What we learned from WarpScript is that it’s a fantastic tool with which to build analytics for our Metrics data. We pushed many complex use cases with advanced signal-processing algorithms like LTTB, Outliers or Patterns detections, and Kernel Smoothing, where it proved to be a real enabler. However, it proved quite expensive to support for basic requirements, and feedback indicated the syntax and overall complexity were big concerns.

A WarpScript can involve dozens (or even hundreds) of lines, and a successful execution is often an accomplishment, with the special feeling that comes from having made full use of one’s brainpower. In fact, an inside joke amongst our team is being born able to write a WarpScript in a single day, or to earn a WarpScript Pro Gamer badge! That’s why we’ve distributed Metrics t-shirts to users that have achieved significant successes with the Metrics Data Platform.

We liked the WarpScript semantic, but we wanted it to have a significant impact on a broader range of use cases. This is why we started to write TSL with few simple goals:

Offer a clear Time Series analytics semantic
Simplify the writing and making it developer-friendly
Support data flow queries and ease debugging for complex queries
Don’t try and be the ultimate tool box. Keep it simple.

We know that users will probably have to switch back to WarpScript every so often. However, we hope that using TSL will simplify their learning curve. TSL is simply a new step in the Time Series adventure!

The path to TSL

TSL is the result of three years of Time Series analytics support, and offers a functional Time Series Language. The aim of TSL is to build a Time Series data flow as code.

With TSL, native methods, such as select and where, exist to choose which metrics to work on. Then, as Time Series data is time-related, we have to use a time selector method on the selected meta. The two available methods are from and last. The vast majority of the other TSL methods take Time Series sets as input and provide Time Series sets as the result. For example, you have methods that only select values above a specific threshold, compute rate, and so on. We have also included specific operations to apply to multiple subsets of Time Series sets, as additions or multiplications.

Finally, for a more readable language, you can define variables to store Time Series queries and reuse them in your script any time you wish. For now, we support only a few native types, such as Numbers, Strings, Time durations, Lists, and Time Series (of course!).

Finally, the same query used throughout this article will be as follows in TSL:

select("cpu.usage_system")
.where("cpu~cpu[0–9]+")
.last(12h)
.sampleBy(2m,mean)
.groupBy(max)

You can also write more complex queries. For example, we condensed our WarpScript hands-on, designed to detect exoplanets from NASA raw data, into a single TSL query:

sample = select('sap.flux')
 .where('KEPLERID=6541920')
 .from("2009–05–02T00:56:10.000000Z", to="2013–05–11T12:02:06.000000Z")
 .timesplit(6h,100,"record")
 .filterByLabels('record~[2–5]')
 .sampleBy(2h, min, false, "none")

trend = sample.window(mean, 5, 5)

sub(sample,trend)
 .on('KEPLERID','record')
 .lessThan(-20.0)

So what did we do here? First we instantiated a sample variable in which we loaded the ‘sap.flux’ raw data of one star, the 6541920. We then cleaned the series, using the timesplit function (to split the star series when there is a hole in the data with a length greater than 6h), keeping only four records. Finally, we sampled the result, keeping the minimal value of each 2-hour bucket.

We then used this result to compute the series trend, using a moving average of 10 hours.

To conclude, the query returns only the points less than 20 from the result of the subtraction of the trend and the sample series.

TSL is Open Source

Even if our first community of users was mostly inside OVH, we’re pretty confident that TSL can be used to solve a lot of Time Series use cases.

We are currently beta testing TSL on our OVH Metrics public platform. Furthermore, TSL is open-sourced on Github, so you can also test it on your own platforms.

We would love to get your feedback or comments on TSL, or Time Series in general. We’re available on the OVH Metrics gitter, and you can find out more about TSL in our Beta features documentation.

Aurélien Hébert

+ posts

Software Developer on the OVHcloud Metrics Data Platform and data lover!