Observability Archives - OVHcloud Blog

Picking our Prometheus’ remote storage

Wilfried Roset — Mon, 17 Apr 2023 14:43:34 +0000

If you are running an IT system you are most likely using an Observability stack along it. Nowadays, the question’s no more whether or not you need Observability but more like how will you compose your stack. At OVHcloud, we have been running a scalable timeseries backend for years now.

During the last year, we have the opportunity to reassess our technical choices. Prometheus is the de facto standard but this choice is the beginning of the process. Thanks to open source communities, there is at lot of possible choices.

The previous posts were about the process we have followed select our new backend, this one concludes the series and share what we have chosen and why. In case you missed them, this series covers an introduction to Prometheus remote storage, how to bench such solution from both write and read perspective the hard way or like a pro.

And the winner is… Grafana Mimir!

After all the experimentation we have made we have chosen Grafana Mimir. The first reason why this solution’s a good fit for use is Its read/write performance’s excellent as well as its scalability. My team, core-observability, main mission’s to provide a resilient and feature full observability infrastructure. All teams relying on us, each of them has it own particularity. Multitenancy is a must have for us, with it we must be able to prevent side effect or “noisy neighboor”. This is why rate limiting was on our bucket list. Mimir provides a lots of setting both at the cluster level and the tenant level to make sure one tenant does not impact others or simply impact the quality of services.

Like many cloud native technology Mimir relies on an object storage where the timeseries are stored. Doing so allow to decouple the compute from the storage and therefore avoids to add more computing power or bigger disks to offer the retentions your users need. Data are compacted to have the small storage footprint possible and therefore achieve cost efficiency.

As we said in our, Prometheus is today de facto standard when it comes to timeseries. We wanted to offer our users the full experience, 100% compliant with promql, recording and alerting rules. Mimir is fully featured on this side, it’s even part of a bigger picture with more integration which is like icing on the cake. Let’s start with Grafana, which is of course fully compatible with Mimir, you can also manage you recording or alerting rules directly from the UI. Now comes Loki which is like prometheus but for logs, it allow you to query your logs just like your metrics. And finally Tempo which cover the last observability pillar: distributed tracing.

On the operational side, there is no doubt that Mimir has been built with production stability and resiliency in mind. The default settings are production ready, the documentation is crystal clear but you also have the material to facilitate the day to day care of Mimir in production. As SREs running Mimir you can use their knowledge base. You have at your disposal ready to use dashboards, recording & alerting rules and runbook. Of course deployment might be different one from another. This is a very good opportunity to contribute back to the vivid open source community around Grafana Labs. No matter the size of the contribution it is always welcomed and reviewed in a timely manner. Whether you need to adjust the dashboards, add a feature or build deb/rpm packages you can always contribute.

The definitive reason why we have chosen Mimir is the core values of its maintainers. Kudos to them. They are welcoming, easy going and more importantly they take opensource seriously just like us at OVHcloud. If you want to have a glimps of that come by their slack to see how fast they are answering.

My team can’t wait to see all the beautiful things our users will do with this backend. One thing’s sure, we’ll contribute back and make sure Mimir thrives. Let’s reserve this part for a new blog posts.

Benchmarking Prometheus like a pro with k6

Wilfried Roset — Tue, 04 Apr 2023 12:19:05 +0000

In our previous posts about choosing a Prometheus remote storage we have seen how to set up a benchmarking infrastructure and how to benchmark promql performance. We have been able to obtain results but the whole benchmark is flawned in many ways:

it’s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.
it’s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.
you’re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in prometheus or haproxy configuration.
it focus mainly on the write path without stress from the read path which is not realistic.

This blog post discuss how we should have benchmark our remote storage.

How to do a good benchmark? K6 to the rescue

A good benchmark need to be accurate and reproducible. More over for our usecase we want to have a tool who takes into account both Prometheus’s read and write path. Finally, we need to be able to remove all unnecessary pieces. This way we are able to focus on the remote storage only.

Such software could be a project on its own but fortunately for us there is one opensource solution for that: K6

K6 is a general purpose modern load testing which can be extended with module to support Prometheus remote storage. Sounds interesting don’t you think?

In our previous blog post we have explained how we have built our benchmarking infrastructure which was rather complex to be accurate.

With k6 as benchmarking tool the infrastructure can be greatly simplified:

K6 is quite flexible and configurable. Its input is a load testing script, you can either write your own script or reuse an opensourced one. As the whole logic is in the load testing script it become easily reproducible which is exactly what we need.

To launch a benchmark you need two piece of infrastructure:

Somewhere where you can run k6 which could be a c2-120 instance on our public cloud
A remote storage to benchmark. for a quick start users are one helm apply away to start on k8s

For our use case we have chosen to reuse the load testing from Grafana which does exactly what we are looking for. All useful information to tune and assess your remote storage are outputed by k6.

     ✓ write worked

     █ instant query high cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     █ range query

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field is 'success' to equal 'success'
       ✓ expected resultType is 'matrix' to equal 'matrix'

     █ instant query low cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     checks............................................................................: 100.00% ✓ 1454     ✗ 0
     ✓ { type:read }...................................................................: 0.00%   ✓ 0        ✗ 0
     ✓ { type:write }..................................................................: 100.00% ✓ 6        ✗ 0
     data_received.....................................................................: 1.0 MB  8.4 kB/s
     data_sent.........................................................................: 277 kB  2.3 kB/s
     group_duration....................................................................: avg=64.61ms min=39.94ms med=60.43ms max=230.05ms p(90)=80.39ms p(95)=107.93ms
     http_req_blocked..................................................................: avg=4.65ms  min=2µs     med=6µs     max=96.84ms  p(90)=11µs    p(95)=58.42ms
     http_req_connecting...............................................................: avg=1.31ms  min=0s      med=0s      max=21.87ms  p(90)=0s      p(95)=16.99ms
     http_req_duration.................................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
       { expected_response:true }......................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
     ✓ { type:read }...................................................................: avg=53.8ms  min=34.23ms med=52.76ms max=164.1ms  p(90)=66.85ms p(95)=71.62ms
     ✓ { url:https://admin:security-matters@remote-storage.poc.ovh.net/api/v1/push }...: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_failed...................................................................: 0.00%   ✓ 0        ✗ 368
     http_req_receiving................................................................: avg=92.34µs min=32µs    med=89µs    max=301µs    p(90)=125.3µs p(95)=150µs
     http_req_sending..................................................................: avg=49.05µs min=12µs    med=40µs    max=566µs    p(90)=68µs    p(95)=94.59µs
     http_req_tls_handshaking..........................................................: avg=3.11ms  min=0s      med=0s      max=54.28ms  p(90)=0s      p(95)=39.39ms
     http_req_waiting..................................................................: avg=53.56ms min=33.94ms med=52.56ms max=163.93ms p(90)=66.88ms p(95)=71.66ms
     http_reqs.........................................................................: 368     3.064697/s
     iteration_duration................................................................: avg=64.88ms min=40.34ms med=60.78ms max=230.27ms p(90)=80.87ms p(95)=108.47ms
     iterations........................................................................: 368     3.064697/s
     vus...............................................................................: 26      min=26     max=26
     vus_max...........................................................................: 26      min=26     max=26

What a time saver? With k6 we have been able to efficiently assess all remote storage solutions. This is a significative improvement if we compare it to our previous benchmarking plan.

The next and final post will be about which remote storage we have chosen to be our internal solution.

Stay tuned.

Benchmarking Prometheus promql performance

Julien Girard — Fri, 17 Mar 2023 12:00:00 +0000

Here @OVHCloud, we try to replace our legacy metrics oriented infrastructure. This infrastructure matters a lot for us as internal teams use it to supervise the core services of OVH, so before making any choices, we wanted to apply a bullet proof test to the challengers.

You can do two main things with a storage backend. You can write in it or you can read from it. That on the test of this last part we are focusing on today. We wanted our test to reproduce a production oriented scenario, let’s see how we build it.

In this blog post we wont cover the building of the underlying TSDB as it could apply to any of them as long as it ensure PromQL compatibility. We will also assume that you can write to the TSDB using Prometheus remote write protocol.

Now that we have our bench cluster up and running, we need to fill it up with data and this is the subject of the following part.

Let’s find some “real” data

As a cloud provider, all our solutions use compute instances wherever they are virtual or baremetal. One of our most common use case is to “look” at system server metrics through automatic recording rules or through Grafana dashboards. All this query are PromQL expressions.

To emulate our ingestion workflow, we deployed nodes exposing their metrics trough node exporter. We also charge couples of Prometheus to scrap them several time to emulate a large amount of host (several thousands of them). Those Prometheus are in charge of writing scrapped metrics to various backend we are benchmarking using remote write protocol.

After waiting several hours or day, our backend is full of data and we can move on. If you need more info on this subject, we have written another blog post about it.

It’s time to bench

As we said it earlier, our read production workload has two components: automatic recording rules and Grafana dashboards. As our alerting system is not widely distributed, we won’t discuss it here, so let’s focus on the Grafana part. A dashboard is a collection of requests to execute against a backend, this is why we have extracted both range and instant the queries from one.

Once we have got this first result, we need a way to execute this request against the backend. As a PromQL request is mainly an HTTP call, we can use an http benchmark tool as a support for our test. One of the most widely used is Apache JMeter and this is the one we have used.

To fit into Apache JMeter who is not able to directly execute promQL request against a Prometheus compliant backend, the previously extracted series have been converted to a test plan. This tool takes various parameters, but three of them are quite important, the timestamp, the interval and the step that will apply to every query forwarded to the backend, just like you do when you submit a time frame to a dashboard in Grafana.

We are now able emulate the load of a dashboard with various time frame and extract meaningful information from this run as Apache Jmeter is a quite powerful tool. It allow us to use warm up period to exploit the benefice of cache or ramp up to study the response of our cluster when the load increase, loading always the same data or from new nodes.

For our first bench, we decided to go with the most widely use node exporter dashboard. We also identified time frame widely used (5m, 15m, 30m, 1h, 6h, 12h, 24h, 2d, 3d, 4d, 5d, 6d, 7d). Those are mainly the default time frame proposed by Grafana.

With the set of tools defined above, we identified three tests we wanted to make against each one of those time frame.

First test “Hot and cold storage”

A lot of solution use hot and cold storage sometimes also named short term storage and long term storage. With this test we want to identify the performances of those various layers.

As the purpose of this test is to check the response time of the various underlying storage, you may want to be sure to disable any cache that may alter the results.

Moreover, we do not want to test the saturation of the platform so we will emulate ten clients.

Second test “Caching performances”

This test is quite the opposite of the previous one. Here we want to test the response time of the TSB in the best possible scenario (data cached).

To get the best results from this test, we will use a warm-up period that will populate the various caches and then measure the response time of the TSDB.

Once again, in this test, we do not want to test the saturation of the platform so we will emulate ten clients.

Third test “Filling up the cache”

The purpose of this last bench is to test the saturation of the platform. Here we will use a ramp-up, adding more and more client to the test over a defined period of time and check the according errors and response time of the underlying platform.

At a certain point, we should see that the platform is not able to handle anymore clients. We assume this number of client will differ with the lookup time frame.

Conclusion

The benchmark concluded to two expected conclusions.

Some support of data are way more faster than other (Memory is faster than local disk which is faster than a distant object storage).
The use of the various caches proposed is a game changer.

It’s time for a second conclusion

Our approach of the benchmark is quite interesting as it aims to emulate the more precisely our production workload. You may be wondering where do we store this wonderful collection of tools. Well, here is the truth, maybe those tool don’t need to be shared and for several reasons:

The result of the test widely depends of the data stored inside the TSDB, which is the result of another procedure and is difficult to reproduce. That leads to a result that is subject to interpretation
The tooling is difficult to use and time consuming
Just like the time flies, the truth of today is not the one of tomorrow and your production reality of today will probably be quite different from the one to come
We like to fight against the not invented here syndrome

In consequence, we need a tool more convenient to use, ideally used by others and with a more reproducible pattern to bench. We will discuss how we should have benchmarked our remote storage in the next blog post.

Prometheus’ remote storage playground

Wilfried Roset — Sun, 05 Mar 2023 23:49:35 +0000

Introduction

In the previous post we have discuss how important remote storage are for Prometheus. We have also covered several attention points. In the following post we are covering remote write storage and how to bench them.

Context

After you have identify one (or more) remote storage who might suit your must bench it. However it is not as straight forward as it seems. Let’s review what we will need for this experiment:

A (scalable) remote storage, in our case one which is remote write
One or more data generator

Introducing Hachimon

Benchmarking is always fun but you know what is even more fun? Gamification! With my team mates we have created a short benchmark plan which we have called the Hachimon path:

Gate of Opening
- 1k targets
- 1000 series/target
- ~ 66k datapoints/sec
Gate of Healing
- 2k targets
- 1000 series/target,
- ~133k datapoints/sec
Gate of Life
- 4k targets
- 1000 series/target
- ~266k datapoints/sec
Gate of Pain
- 4k targets
- 1000 series/target
- ~266k datapoints/sec after deduplication
- dual prometheus to increase pressure on deduplication features
Gate of Limit
- 4k targets
- 2500 series/target to increase pressure on storage
- ~660k datapoints/sec
- dual prometheus
Gate of View
- 8k targets
- 2500 series/target
- ~1.3M datapoints/sec
- dual prometheus
Gate of Wonder
- 10k targets
- 2500 series/target
- ~1.6M datapoints/sec
- dual prometheus
Gate of Death
- Add as many targets as you can until the backend almost on fire

To walk the Hachimon path we’ve built an infrastructure where only the central piece, the remote storage, changes. Doing so help us compare results.

The write path is stress by one or more Prometheus clusters which will scrap many time the same node_exporter under a different set of labels. Doing so allow us to emulate an infrastructure bigger than it is. To increase the cardinality we can tweak node_exporter configuration to expose more or less series. By deploying one or more Prometheus clusters we can both stress the deduplication feature of the backend and workaround the hardware limitation of a given prometheus.

This approach is very similar to the one of Victoriametrics which has inspired us. Kudos!

By the time we have reach the end of our tests the infrastucture we have built looks like the following:

This is the infrastucture we have used to bench both the read and the write path of the remote storages. There is load balancing on both side, multiple pairs of Prometheus to put more or less pressure on the write path and the deduplication. Finally, the data comes from little instances exposing node_exporter metrics.

Expectation

Thanks to this benchmarking plan we have been able to differentiate the remote storage on a performance perspective. We’ve been able to get a first understanding about how each remote storage works, how to tune them and what can you done and what you cannot with them. It seems to us that it is equally important to have ease to operate a solution and good performance. But most importantly we learnt a lot of thing while having fun.

Conclusion

This benchmarking plan’s s obviously flawned in many ways:

it’s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.
it’s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.
you’re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in Prometheus or haproxy configuration.
it focus mainly on the write path without stress from the read path which is not realistic.

The two next posts of this series continue to focus on benchmarking. The first one focus on the read performance.

The second one focus on how we should have benchmarked our solution from the beginning.

Stay tuned

Welcome to Prometheus world of remote storage

Wilfried Roset — Thu, 16 Feb 2023 16:29:25 +0000

At OVHcloud, we recently made a change to our internal Observability stack. After testing and comparing the different solutions on the market, we opted for on open source solution. With this blog post, we’re starting a series of articles to provide feedback on our selection process and what we’ve learned along the way. Our mission was to find an horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus, we begin this series with an introduction to Prometheus remote storage…

Over the last decade Prometheus has become one of the standard for Observability. It’s core concept is well suited for today technological use cases and it makes sense that open source community loves it. While Prometheus does a lot of thing really well when it comes to long term storage users must find a solution. This blog post serie discuss Prometheus’s remote storages, the technical challenges they aim to solve and more importantly we discuss how to pick the right one for you.

What is a remote storage?

Prometheus can be configured to read or write to a remote storage on top of its local storage. This allow it to support long-terme storage of users data. The two features are called remote_read and remote_write.

With remote_read configured, Prometheus will answer read queries with data from the remote storage. The remote_write is responsible for shipping samples to the remote storage. Both of them are extremely useful and highly configurable.For the rest of this blog post let’s focus on remote write.

Whether you are a cloud provider or building an in-house Observability it is not always appropriate nor possible to connect to your customers infrastructure to extract data.

With a remote write approach customers can have a strict control on what comes in/out of the infrastructure. We could argue that IPtables coupled with authentication is secure enough but this is still one more door to keep an eye on. With tight security taken into account we understand that remote write makes a lot of sense from a service provider point of view.

Now that we know that we want a remote write compatible storage we must take into account that not all remote storages are equal. The list of solution keeps growing every day, let’s see if we can differentiate them.

When writing metrics to a remote storage it is because we want to read then back later. Most Observability use cases imply writing down tons of data that will be queried afterwards. PromQL is the query language use to query Prometheus and therefore associated remote storage. It would make sense to check how PromQL compliant the solutions are. Fear not, Prometheus community is already tackling this question for us with PromQL Compliance

PromQL Compliance results as of 2021-10-14

As you can, see most remote storage are 100% compliant with Prometheus results. Good news. This means users have a plethora of

However, readers must not under estimate this point. Indeed compliance impacts what you can query from the backend, how you can query it and, the accuracy of a result. It might not be trivial to reach full compliance and to stay compliant. Maintainers might also choose to not be compliant and explain why.

Prometheus world grows in adoption and under active development. If a solution is compatible today there is no guarantee it’ll stay compatible tomorrow.

Which bring us to the second point, the community. How healthy, large and active are the community behind each software?
Is it easy to contact them? Discuss issues? Propose feature and PRs? We tend to take granted the fact that PRs will be reviewed, that we’ll found someone to help us troubleshoot a bug but this is not necessarily the case.

Features set

To better address the technical challenges that are your own you must pick the solution that have the features you need. If you need multi tenancy check that point. If you need to downsample your data add this to your checklist. Don’t be shy, dig a little deeper. Test the feature look for its limitation. Tests are the only way to be able to make an informed decision.

To give you an idea you might want to have a look at the following features:

multi tenancy
rate limiting
deduplication
deletion
downsampling

Scalability

Nowadays the word scalability is present almost everywhere. How well each remote storage scale? Can you write 2M samples/sec? Can you answer 1M queries/sec? Can you have 200M active series in total? 1B active series? Per tenant?

You can have a rough understanding of the bottleneck by looking at the architecture diagram. But to have a crystal clear answer there is only one way, you need to make a proof of concept.

By the way, you can even try one remote storage right now on our managed k8s. Most of the open source remote storage offer helm charts or operator to do so: VictoriaMetrics, Timescale, Mimir.

Cost

Along scalability comes tco which stand for Total Cost of Ownership. This boil down to how expensive a solution, infrastructure can be when you take all cost into account. For remote storage, on top of the team operating the infrastructure we must take into account the aforementioned infrastructure. All technical solution relies on 4 categories: trained engineers, compute resources, network and… Storage. Nevertheless, it is critical to take it into account all aspect of the target solution. Otherwise be ready for a surprise at the end of the month.

Conclusion

As we have demonstrate, we have a lot of technical solutions to address long term storage. However before putting one solution in production we need to thoroughly identify and assess all trade offs. In the next posts we will have a look on how to get to know your remote storage, bench it, break it.

OVHcloud Web Statistics: A new statistics interface for your OVHcloud hosted website

Matias Hastaran — Fri, 29 May 2020 12:43:41 +0000

If you have ever managed or edited a website, you will likely have experience tracking page views and hit statistics.

If this is the case, then this article is for you! Get ready to step into 2020 with a brand new interface!

A bit of history

There are multiple solutions on the market designed to help you analyse visits to your website.

Two methods exist to help you gather this information:

Embed some code on your website to track your visits and send those results to a third party to render those results
Keep control of your data: analyse your own raw logs to compute the relevant metrics

In 2004, in order to keep your data safe, we decided to use an on-premise solution called Urchin… but its time to change!

Why?

Urchin was bought by Google and the software is discontinued. It hasn’t evolved, therefore, since 2012.
Urchin is Flash Player based. Flash Player is discontinued and will be stopped in 2020 by Adobe. There will be no more support for it.
It doesn’t offer the best possible experience.
Urchin doesn’t allow users to visualize subdomain statistics. (example: app.mydomain.com)

How do we provide your website stats?

Every day, we compute statistics for several millions of websites. This is a specific requirement and few solutions exist to fill it.

What are those needs:

Being able to compute the statistics of all the websites as fast as possible.
Aggregate data to show anonymized data.
Do not embed code/tracker on your site
Have an easy to use interface aligned with today’s standards
Give you the ability to see at the subdomain level your statistics
Migrate your previous statistics from Urchin not to lose them

For a long time, we tried to avoid the “not invented here” effect because rebuilding a statistic tool is not our main job. So we tried a lot of solutions on the market. Open source or not. Free or with licence. And we did not find a solution able to scale our quantity of logs and compute statistics for all websites we host!

So, we decided to develop an alternative solution, and proposed it by default for everyone: OVHcloud Web Statistics (or OWStats)

So, what’s new?

A new user Interface to quickly visualise the most relevant statistics

You can find few sections:

Dashboard: A summary of the activities on your domain with the dashboard
Browsers: More technical information relating to the various browsers and platforms used to visit your domain
Geolocalization: Which country/region visits your domain (the data is anonymized, so this is only a high level overview)
Requests: Overview of the most viewed pages
Robots: Analysis of the bots visiting your domain
Status: Status code evolution and which pages are raising errors and should be investigated.

It would be simpler if we showed some pictures, wouldn’t it?

Well, of course it would! Here you have:

The dashboard:

Geolocalization page:

And the status pages:

Some numbers

Thanks to this new tool, we can compute your statistics up to 8x faster.

We also compute 2.5 TB of logs per day!

Want more infomation?

This post is an early preview of our incoming OVH Web Statistics service, we will come back to you with more posts about the technical details as the release date will approach!

Jerem: An Agile Bot

Aurélien Hébert — Fri, 21 Feb 2020 16:58:47 +0000

At OVHCloud, we are open sourcing our “Agility Telemetry” project. Jerem, as our data collector, is the main component of this project. Jerem scrapes our JIRA at regular intervals, and extracts specific metrics for each project. It then forwards them to our long-time storage application, the OVHCloud Metrics Data Platform.

Agility concepts from a developer’s point of view

To help you understand our goals for Jerem, we need to explain some Agility concepts we will be using. First, we will establish a technical quarterly roadmap for a product, which sets out all features we plan to release every three months. This is what we call an epic.

For each epic, we identify the tasks that will need to be completed. For all of those tasks, we then evaluate the complexity of tasks using story points, during a team preparation session. A story point reflects the effort required to complete the specific JIRA task.

Then, to advance our roadmap, we will conduct regular sprints that correspond to a period of ten days, during which the team will onboard several tasks. The amount of story points taken in a sprint should match, or be close to, the team velocity. In other words, the average number of story points that the team is able to complete each day.

However, other urgent tasks may arise unexpectedly during sprints. That’s what we call an impediment. We might, for example, need to factor in helping customers, bug fixes, or urgent infrastructure tasks.

How Jerem works

At OVH we use JIRA to track our activity. Our Jerem bot scraps our projects from JIRA and exports all necessary data to our OVHCloud Metrics Data Platform. Jerem can also push data to any Warp 10 compatible database. In Grafana, you simply query the Metrics platform (using Warp10 datasource) with for example our program management dashboard. All your KPI are now available in a nice dashboard!

Discover Jerem metrics

Now that we have an overview of the main Agility concepts involved, let’s dive into Jerem! How do we convert those Agility concepts into metrics? First of all, we’ll retrieve all metrics related to epics (i.e. new features). Then, we will have a deep look at the sprint metrics.

Epic data

To explain Jerem epic metrics, we’ll start by creating a new one. In this example, we called it Agile Telemetry. We add a Q2-20 label, which means that we plan to release it for Q2. To record an epic with Jerem, you need to set a quarter for the final delivery! Next, we’ll simply add four tasks, as shown below:

To get the metrics, we need to evaluate each individual task. We we’ll do this together during preparation sessions. In this example, we have custom story points for each task. For example, we estimated the write a BlogPost about Jerem task as being a 3.

As a result, Jerem now has everything it needs to start collecting epic metrics. This example provides five metrics:

jerem.jira.epic.storypoint: the total number of story points needed to complete this epic. The value here is 14 (the sum of all the epic story points). This metric will evolve whenever as the epic is updated by adding or removing tasks.
jerem.jira.epic.storypoint.done: the number of completed tasks. In our example, we have already completed the Write Jerem bot and Deploy Jerem Bot, so we have already completed eight story points.
jerem.jira.epic.storypoint.inprogress: the number of ‘in progress’ tasks, such as Write a BlogPost about Jerem.
jerem.jira.epic.unestimated: the number of unestimated tasks, shown as Unestimated Task in our example.
jerem.jira.epic.dependency: the number of tasks that have dependency labels, indicating that they are mandatory for other epics or projects.

This way, for each epic in a project, Jerem collects five unique metrics.

Sprint data

To complete epic tasks, we work using a sprint process. When doing sprints, we want to provide a lot of insights into our achievements. That’s why Jerem collects sprint data too!

So let’s open a new sprint in JIRA and start working on our task. This gives us the following JIRA view:

Jerem collects the following metrics for each sprint:

jerem.jira.sprint.storypoint.total: the total number of story points onboarded into a sprint.
jerem.jira.sprint.storypoint.inprogress: the story points currently in progress within a sprint.
jerem.jira.sprint.storypoint.done: the number of story points currently completed within a sprint.
jerem.jira.sprint.events: the ‘start’ and of the ‘end’ dates of sprint events, recorded as Warp10 string values.

As you can see in the Metrics view above, we will record every sprint metric twice. We do this to provide a quick view of the active sprint, which is why we use the ‘current’ label’. This also enables us to query past sprints, using the real sprint name. Of course, an active sprint can also be queried using its name.

Impediment data

Starting a sprint means you need to know all the tasks you will have to work on over the next few days. But how can we track and measure unplanned tasks? For example, the very urgent one for your manager, or the teammate that needs a bit of help?

We can add special tickets on JIRA to keep track of those task. That’s what we call an ‘impediment’. They are labelled according their nature. If, for example, the production requires your attention, then it’s an ‘Infra’ impediment. You will also retrieve metrics for the ‘Total’ (all kinds of impediments), ‘Excess’ (the unplanned tasks), ‘Support’ (helping teammates), and ‘Bug fixes or other’ (for all other kinds of impediment).

Each impediment belongs to the active sprint it was closed in. To close an impediment, you only have to flag it as ‘Done’ or ‘Closed’.

We also retrieve metrics like:

jerem.jira.impediment.TYPE.count: the number of impediments that occurred during a sprint.
jerem.jira.impediment.TYPE.timespent: the amount of time spent on impediments during a sprint.

TYPE corresponds to the kind of recorded impediment. As we didn’t open any actual impediments, Jerem collects only the total metrics.

To start recording impediments, we simply create a new JIRA task, in which we add an ‘impediment’ label. We we also set its nature, and the actual time spent on it.

For the impediment, we’ll we also retrieve the global metrics that Jerem always records:

jerem.jira.impediment.total.created: the time spent from the creation date to complete an impediment. This allows us to retrieve a total impediment count. We can also record all impediment actions, even outside sprints.

For a single Jira project, like our example, you can expect around 300 metrics. This might increase depending on the epic you create and flag on Jira, and the one you close.

Grafana dashboard

We love building Grafana dashboards! They provide both the team and the manager a lot of insights into KPIs. The best part of it for me, as a developer, is that I see why it’s nice to fill a JIRA task!

In our first Grafana dashboard, you will retrieve all the best program management KPIs. Let’s start with the global overview:

Quarter data overview

Here, you will find the current epic in progress. You will also find the global team KPIs, such as the predictability, the velocity, and the impediment stats. It’s here where the magic happens! When filled correctly, this dashboard will show you exactly what your team should deliver in the current quarter. This means you have quick access to all important current subjects. You will also be able to see if your team is expected to deliver on too many subjects, so you can quickly take action and delay some of the new features.

Active sprint data

The active sprint data panel is often a great support during our daily meetings. In this panel, we get a quick overview of the team’s achievements, and can establish the time spent on parallel tasks.

Detailed data

The last part provides more detailed data. Using the epic Grafana variable, we can check specific epics, along with the completion of more global projects. You have also a velocity chart, which plots the past sprint, and compares the expected story points to the ones actually completed.

The Grafana dashboard is directly availble in the Jerem project. You can import it directly in Grafana, provided you have a valid Warp 10 datasource configured.
To make the dashboard work as required, you have to configure the Grafana project variable in the form of a WarpScript list [ 'SAN' 'OTHER-PROJECT' ]. If our program manager can do it, I am sure you can! 😉

Setting up Jerem and automatically loading program management data give us a lot of insights. As a developer, I really enjoy it and I’ve quickly become used to tracking a lot more events in JIRA than I did before. You are able to directly see the impact of your tasks. For example, you see how quickly the roadmap is advancing, and you become able to identify any bottlenecks that are causing impediments. Those bottlenecks then become epics. In other words, once we start to use Jerem, we just auto-fill it! I hope you will enjoy it too! If you have any feedback, we would love to hear it.

Contributing to Apache HBase: custom data balancing

Pierre Zemb — Fri, 14 Feb 2020 16:37:19 +0000

In today’s blogpost, we’re going to take a look at our upstream contribution to Apache HBase’s stochastic load balancer, based on our experience of running HBase clusters to support OVHcloud’s monitoring.

The context

Have you ever wondered how:

we generate the graphs for your OVHcloud server or web hosting package?
our internal teams monitor their own servers and applications?

All internal teams are constantly gathering telemetry and monitoring data and sending them to a dedicated team, who are responsible for handling all the metrics and logs generated by OVHcloud’s infrastructure: the Observability team.

We tried a lot of different Time Series databases, and eventually chose Warp10 to handle our workloads. Warp10 can be integrated with the various big-data solutions provided by the Apache Foundation. In our case, we use Apache HBase as the long-term storage datastore for our metrics.

Apache HBase, a datastore built on top of Apache Hadoop, provides an elastic, distributed, key-ordered map. As such, one of the key features of Apache HBase for us is the ability to scan, i.e. retrieve a range of keys. Thanks to this feature, we can fetch thousands of datapoints in an optimised way.

We have our own dedicated clusters, the biggest of which has more than 270 nodes to spread our workloads:

between 1.6 and 2 million writes per second, 24/7
between 4 and 6 million reads per second
around 300TB of telemetry, stored within Apache HBase

As you can probably imagine, storing 300TB of data in 270 nodes comes with some challenges regarding repartition, as every bit is hot data, and should be accessible at any time. Let’s dive in!

How does balancing work in Apache HBase?

Before diving into the balancer, let’s take a look at how it works. In Apache HBase, data is split into shards called Regions, and distributed through RegionServers. The number of regions will increase as the data is coming in, and regions will be split as a result. This is where the Balancer comes in. It will move regions to avoid hotspotting a single RegionServer and effectively distribute the load.

The actual implementation, called StochasticBalancer, uses a cost-based approach:

It first computes the overall cost of the cluster, by looping through cost functions. Every cost function returns a number between 0 and 1 inclusive, where 0 is the lowest cost-best solution, and 1 is the highest possible cost and worst solution. Apache Hbase is coming with several cost functions, which are measuring things like region load, table load, data locality, number of regions per RegionServers… The computed costs are scaled by their respective coefficients, defined in the configuration.
Now that the initial cost is computed, we can try to Mutate our cluster. For this, the Balancer creates a random nextAction, which could be something like swapping two regions, or moving one region to another RegionServer. The action is applied virtually , and then the new cost is calculated. If the new cost is lower than our previous one, the action is stored. If not, it is skipped. This operation is repeated thousands of times, hence the Stochastic.
At the end, the list of valid actions is applied to the actual cluster.

What was not working for us?

We found out that for our specific use case, which involved:

Single table
Dedicated Apache HBase and Apache Hadoop, tailored for our requirements
Good key distribution

… the number of regions per RegionServer was the real limit for us.

Even if the balancing strategy seems simple, we do think that being able to run an Apache HBase cluster on heterogeneous hardware is vital, especially in cloud environments, because you may not be able to buy the same server specs again in the future. In our earlier example, our cluster grew from 80 to ~250 machines in four years. Throughout that time, we bought new dedicated server references, and even tested some special internal references.

We ended-up with differents groups of hardware: some servers can handle only 180 regions, whereas the biggest can handle more than 900. Because of this disparity, we had to disable the Load Balancer to avoid the RegionCountSkewCostFunction, which would try to bring all RegionServers to the same number of regions.

Two years ago we developed some internal tools, which are responsible for load balancing regions across RegionServers. The tooling worked really good for our use case, simplifying the day-to-day operation of our cluster.

Open source is at the DNA of OVHcloud, and that means that we build our tools on open source software, but also that we contribute and give it back to the community. When we talked around, we saw that we weren’t the only one concerned by the heterogenous cluster problem. We decided to rewrite our tooling to make it more general, and to contribute it directly upstream to the HBase project.

Our contributions

The first contribution was pretty simple, the cost function list was a constant. We added the possibility to load custom cost functions.

The second contribution was about adding an optional costFunction to balance regions according to a capacity rule.

How does it works?

The balancer will load a file containing lines of rules. A rule is composed of a regexp for hostname, and a limit. For example, we could have:

rs[0-9] 200
rs1[0-9] 50

RegionServers with hostnames matching the first rules will have a limit of 200, and the others 50. If there’s no match, a default is set.

Thanks to these rule, we have two key pieces of information:

the max number of regions for this cluster
the rules for each servers

The HeterogeneousRegionCountCostFunction will try to balance regions, according to their capacity.

Let’s take an example… Imagine that we have 20 RS:

10 RS, named rs0 to rs9, loaded with 60 regions each, which can each handle 200 regions.
10 RS, named rs10 to rs19, loaded with 60 regions each, which can each handle 50 regions.

So, based on the following rules:

rs[0-9] 200
rs1[0-9] 50

… we can see that the second group is overloaded, whereas the first group has plenty of space.

We know that we can handle a maximum of 2,500 regions (200×10 + 50×10), and we have currently 1,200 regions (60×20). As such, the HeterogeneousRegionCountCostFunction will understand that the cluster is full at 48.0% (1200/2500). Based on this information, we will then try to put all the RegionServers at ~48% of the load, according to the rules.

Where to next?

Thanks to Apache HBase’s contributors, our patches are now merged into the master branch. As soon as Apache HBase maintainers publish a new release, we will deploy and use it at scale. This will allow more automation on our side, and ease operations for the Observability Team.

Contributing was an awesome journey. What I love most about open source is the opportunity ability to contribute back, and build stronger software. We had an opinion about how a particular issue should addressed, but the discussions with the community helped us to refine it. We spoke with engineers from other companies, who were struggling with Apache HBase’s cloud deployments, just as we were, and thanks to those exchanges, our contribution became more and more relevant.

TSL (or how to query time series databases)

Aurélien Hébert — Fri, 31 Jan 2020 13:41:34 +0000

Last year, we released TSL as an open source tool to query a Warp 10 platform, and by extension, the OVHcloud Metrics Data Platform.

But how has it evolved since then? Is TSL ready to query other time series databases? What about TSL states on the Warp10 eco-system?

TSL to query many time series databases

We wanted TSL to be usable in front of multiple time series databases. That’s why we also released a PromQL query generator.

One year later, we now know this wasn’t the way to go. Based on what we learned, the TSL-Adaptor project was open sourced, as a proof of concept for how TSL can be used to query a non-Warp 10 database. Put simply, TSL-Adaptor allows TSL to query an InfluxDB.

What is TSL-Adaptor?

TSL-Adaptor is a Quarkus JAVA REST API that can be used to query a backend. TSL-Adaptor parses the TSL query, identifies the fetch operation, and loads natively raw data from the backend. It then computes the TSL operations on the data, before returning the result to the user. The main goal of TSL-Adaptor is to make TSL available on top of other TSDBs.

In concrete terms, we are running a JAVA REST API that integrates the WarpScript library in its runtime. TSL is then used to compile the query into a valid WarpScript one. This is fully transparent, and only deals with TSL queries on the user’s side.

To load raw data from the InfluxDB, we created a WarpScript extension. This extension integrates an abstract class LOADSOURCERAW that needs to be implemented to create an TSL-Adaptor data source. This requires only two methods: find and fetch. Find gathers all series selectors matching a query (class names, tags or labels), while fetchactually retrieves the raw data within a time span.

Query Influx with TSL-Adaptor

To get started, simply run an InfluxDB locally on the 8086 port. Then, let’s start an influx Telegraf agent and record Telegraf data on the local influx instance.

Next, make sure you have locally installed TSL-Adaptor and updated its config with the path to a tsl.so library.

To specify a custom influx address or databases, update the TSL-Adaptor configuration file accordingly.

You can start TSL-Adaptor with the following example command:

java -XX:TieredStopAtLevel=1 -Xverify:none -Djava.util.logging.manager=org.jboss.logmanager.LogManager -jar build/tsl-adaptor-0.0.1-SNAPSHOT-runner.jar

And that’s it! You can now query your influx database with TSL and TSL-Adaptor.

Let’s start with the retrieval of the time series relating to the disk measurements.

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk")'

Now let’s use the TSL analytics power!

First, we would like to retrieve only the data containing a mode set to rw.

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw")'

We would like to retrieve the maximum value at every five-minute interval, for the last 20 minutes. The TSL query will therefore be:

curl --request POST \
  --url http://u:p@0.0.0.0:8080/api/v0/tsl \
  --data 'select("disk").where("mode=rw").last(20m).sampleBy(5m,max)'

Now it’s your turn to have some fun with TSL and InfluxDB. You can find details of all the implemented functions in the TSL documentation. Enjoy exploring!

What’s new on TSL with Warp10?

We originally built TSL as a GO proxy in front of Warp10. TSL has now integrated the Warp 10 ecosystem as a Warp10 extension, or as a WASM library! We have also added some new native TSL functions to make the language even richer!

TSL as WarpScript function

To make TSL work as a Warp10 function, you need to have the tsl.so library available locally. This library can be found in TSL github repository. We have also made a TSL WarpScript extension available from WarpFleet, the extension repository of the Warp 10 community.

To set up TSL extension on your Warp 10, simply download the JAR indicated in WarpFleet. You can then configure the extension in the Warp 10 configuration file:

warpscript.extensions = io.ovh.metrics.warp10.script.ext.tsl.TSLWarpScriptExtension
warpscript.tsl.libso.path =

Once you reboot Warp 10, you are ready to go. You can test if it’s working by running the following query:

// You will need to put here a valid Warp10 token when computing a TSL select statement
// '' 

// A valid TOKEN isn't needed on the create series statement in this example
// You can simply put an empty string
''

// Native TSL create series statement
 <'
    create(series('test').setValues("now", [-5m, 2], [0, 1])) 
'>
TSL

With the WarpScript TSL function, you can use native WarpScript variables in your script, as shown in the example below:

// Set a Warp10 variable

NOW 'test' STORE

'' 

// A Warp10 variable can be reused in TSL script as a native TSL variable
 <'
    create(series('test').setValues("now", [-5m, 2], [0, 1])).add(test)
'>
TSL

TSL WASM

To expand TSL’s potential uses, we have also exported it as a Wasm library, so you can use it directly in a browser! The Wasm version of the library parses TSL queries locally and generates the WarpScript. The result can then be used to query a Warp 10 backend. You will find more details on the TSL github.

TSL’s new features

As TSL has grown in popularity, we have detected and fixed a few bugs, and also added some additional native functions to accommodate new use cases.

We added the setLabelFromNamemethod, to set a new label to a series, based on its name. This label can be the exact series name, or the result of a regular expression.

We also completed the sortmethod, to allow users to sort their series set based on series meta data (i.e. selector, name or labels).

Finally, we added a filterWithoutLabels, to filter a series set and remove any series that do not contain specific labels.

Thanks for reading! I hope you will give TSL a try, as I would love to hear your feedback.

Paris Time Series meetup

We are delighted to be hosting, at the OVHcloud office in Paris, soon the third Paris Time Series meetup, organised by Nicolas Steinmetz. During this meetup, we will be speaking about TSL, as well as listening to an introduction of the Redis Times Series platform.

If you are available, we will be happy to meet you there!

Alerting based on IPMI data collection

Morvan Le Goff — Fri, 10 May 2019 13:56:55 +0000

The problem to solve…

How to continuously monitor the health of all OVH servers, without any impact on their performance, and no intrusion on the operating systems running on them – this was the issue to address. The end goal of this data collection is to allow us to detect and forecast potential hardware failure, in order to improve the quality of service delivered to our customers.

We began by splitting the problem into four general steps:

- Data collection
- Data storage
- Data analytics
- Visualisation/actions

Data collection

How did we collect massive amounts of server health data, in a non-intrusive way, within short time intervals?

Which data to collect?

On modern servers, a BMC (Board Management Controller) allows us to control the firmware updates, reboots, etc.. This controller is independent of the system running on the server. In addition, the BMC gives us access to sensors for all the motherboard components through an I2C bus. The protocol used to communicate with the BMC is the IPMI protocol, which accessible via LAN (RMCP).

What is IPMI?

Intelligent Platform Management Interface.
Management and monitoring capabilities independently of the host’s OS.
Led by INTEL, first published in 1998.
Supported by more than 200 computer system vendors such as Cisco, DELL, HP, Intel, SuperMicro…

Why use IPMI?

Access to hardware sensors (cpu temp, memory temp, chassis status, power, etc.).
No dependency on the OS (i.e. an agentless solution)
IPMI functions accessible after OS/system failure
Restricted access to IPMI functionalities via user privileges

Multi-source data collection

We needed a scalable and responsive multi-source data collection tool to grab the IPMI data of about 400k servers at fixed intervals.

We decided to build our IPMI data collector on an Akka framework. Akka is a open-source toolkit and runtime, simplifying the construction of concurrent and distributed applications on the JVM.

The Akka framework defines an abstraction built above thread called ‘actor’. This actor is an entity that handles messages. This abstraction eases the creation of multi-thread applications, so there’s no need to fight against deadlock. By selecting the dispatcher policy for a group of actors, you can fine-tune your application to be fully reactive and adaptable to the load. This way, we were able to design an efficient data collector that could adapt to the load, as we intended to grab each sensor value every minute.

In addition, the cluster architecture provided by the framework allowed us to handle all the servers in a datacentre with a single cluster. The cluster architecture also helped us to design a resilient system, so if a node of the cluster crashes or becomes too slow, it will automatically restart. The servers monitored by the failing node are then handled by the remaining, valid nodes of the cluster.

With the cluster architecture, we implemented a quorum feature, to take down the whole cluster if the minimal number of started nodes is not reached. With this feature, we can easily solve the split-brain problem, as if the connection is broken between nodes, the cluster will be split into two entities, and the one that does not reached the quorum will be automatically shut down.

A REST API is defined to communicate with the data collector in two ways:

To send the configurations
To get information on the monitored servers

A cluster node is running on one JVM, and we are able to launch one or more nodes on a dedicated server. Each dedicated server used in the cluster is put in an OVH VRACK. An IPMI gateway pool is used to access the BMC of each server, with the communication between the gateway and the IPMI data collector secured by IPSEC connections.

Data storage

Of course, we use the OVH Metrics service for data storage! Before storing the data, the IPMI data collector unifies the metrics, by qualifying each sensor. The final metric name is defined by the entity the sensor belongs to and the base unit of the value. This will ease the post-treatment processes and data visualisation/comparison.

Each datacentre IPMI collector pushes its data to a Metrics live cache server with a limited persistence time. All important information is persisted in the OVH Metrics server.

Data analytics

We store ours metrics in warp10. Warp 10 comes with a Time series scripting language: WarpScript which wakes the analytics powerful to easily manipulate and post-process (on the server side) our collected data.

We have defined three levels of analysis to monitor the health of the servers:

- A simple threshold-per-server metric.
- By using OVH metric loops service, we aggregate data per rack and per room and calculate a mean. We set a threshold for this mean, this permits to detect racks or room common failure in the cooling or power supply system.
- The OVH MLS service performs some anomaly detections on the racks and rooms by forecasting the possible evolution of metrics, depending on past values. If the metrics value is outside of this template, an anomaly is raised.

Visualisation/actions

All the alerts generated by the data analysis are pushed under TAT, which is an OVH tool we use to handle the alerting flow.

Grafana is used to monitored the metrics. We have dashboards to visualise the metrics and the aggregations for each rack and room, the detected anomalies, and the evolution of the opened alerts.