Wilfried Roset, Author at OVHcloud Blog

OVHcloud and Epitech Team Up: Building the Future Together!

Wilfried Roset — Wed, 24 Sep 2025 11:48:44 +0000

At OVHcloud, we know great ideas happen when in-school learning meets real-world experience. The future isn’t something that happens to us, it’s something we build — day by day. To do that, we need to support the students who will be tomorrow’s leaders.

That’s why we’re so excited to announce our new partnership with Epitech 🥳, a school renowned for helping talented students learn.

From Classroom to Career

This new partnership is our promise to help students succeed, giving them the tools and experience they need to build great things. Here’s how we’ll be helping the Epitech community:

Fun Hackathons: We’ll host hackathons where students can team up to solve real problems and be creative, turning good ideas into actual projects. 💡

Great Internships: Epitech students can come work with our teams. Our internships offer a real chance to use what they’ve learned in class and to get a solid start on their careers.

Hands-on Training: Our experts will run special workshops for students. They’ll learn the latest tech skills directly from the people who use them firsthand every day.

Help for Students: New ideas often start with research. We’ll give Epitech’s students special access to public cloud services on beneficial terms. We want to help them turn their ideas into reality, faster! 🔬

This Is Just the Beginning

Our partnership with Epitech is a huge first step for us, but we’re just getting started.
OVHcloud aims to team up with more schools in the future, as we believe it’s imperative for companies and schools to work together! By supporting students everywhere, we help them become the strong leaders of tomorrow.

Schools Are Our Future

We’re deeply convinced of this simple truth: the future is in the hands of today’s students.

In an ever-evolving landscape, building a sustainable and innovative future depends on our ability to develop and master our own technologies. And in today’s geopolitical context, where questions of digital independence and data control are more critical than ever, giving young talents the means to innovate locally is not just an opportunity — it’s a necessity.

After all, technical sovereignty doesn’t start in boardrooms, it starts in schools. By providing students with the skills, tools, and mindsets to design and control their own digital solutions, we are actively contributing to a stronger, more resilient ecosystem. By investing in education, we’re investing in our future – ensuring that the next generation of engineers, developers, and visionaries have the opportunity to build a better, more technologically advanced world. In this world of tomorrow, Europe and its partners remain free, innovative, and sovereign.

To everyone at Epitech, we can’t wait to start working with you. Let’s build the future together!

Picking our Prometheus’ remote storage

Wilfried Roset — Mon, 17 Apr 2023 14:43:34 +0000

If you are running an IT system you are most likely using an Observability stack along it. Nowadays, the question’s no more whether or not you need Observability but more like how will you compose your stack. At OVHcloud, we have been running a scalable timeseries backend for years now.

During the last year, we have the opportunity to reassess our technical choices. Prometheus is the de facto standard but this choice is the beginning of the process. Thanks to open source communities, there is at lot of possible choices.

The previous posts were about the process we have followed select our new backend, this one concludes the series and share what we have chosen and why. In case you missed them, this series covers an introduction to Prometheus remote storage, how to bench such solution from both write and read perspective the hard way or like a pro.

And the winner is… Grafana Mimir!

After all the experimentation we have made we have chosen Grafana Mimir. The first reason why this solution’s a good fit for use is Its read/write performance’s excellent as well as its scalability. My team, core-observability, main mission’s to provide a resilient and feature full observability infrastructure. All teams relying on us, each of them has it own particularity. Multitenancy is a must have for us, with it we must be able to prevent side effect or “noisy neighboor”. This is why rate limiting was on our bucket list. Mimir provides a lots of setting both at the cluster level and the tenant level to make sure one tenant does not impact others or simply impact the quality of services.

Like many cloud native technology Mimir relies on an object storage where the timeseries are stored. Doing so allow to decouple the compute from the storage and therefore avoids to add more computing power or bigger disks to offer the retentions your users need. Data are compacted to have the small storage footprint possible and therefore achieve cost efficiency.

As we said in our, Prometheus is today de facto standard when it comes to timeseries. We wanted to offer our users the full experience, 100% compliant with promql, recording and alerting rules. Mimir is fully featured on this side, it’s even part of a bigger picture with more integration which is like icing on the cake. Let’s start with Grafana, which is of course fully compatible with Mimir, you can also manage you recording or alerting rules directly from the UI. Now comes Loki which is like prometheus but for logs, it allow you to query your logs just like your metrics. And finally Tempo which cover the last observability pillar: distributed tracing.

On the operational side, there is no doubt that Mimir has been built with production stability and resiliency in mind. The default settings are production ready, the documentation is crystal clear but you also have the material to facilitate the day to day care of Mimir in production. As SREs running Mimir you can use their knowledge base. You have at your disposal ready to use dashboards, recording & alerting rules and runbook. Of course deployment might be different one from another. This is a very good opportunity to contribute back to the vivid open source community around Grafana Labs. No matter the size of the contribution it is always welcomed and reviewed in a timely manner. Whether you need to adjust the dashboards, add a feature or build deb/rpm packages you can always contribute.

The definitive reason why we have chosen Mimir is the core values of its maintainers. Kudos to them. They are welcoming, easy going and more importantly they take opensource seriously just like us at OVHcloud. If you want to have a glimps of that come by their slack to see how fast they are answering.

My team can’t wait to see all the beautiful things our users will do with this backend. One thing’s sure, we’ll contribute back and make sure Mimir thrives. Let’s reserve this part for a new blog posts.

Benchmarking Prometheus like a pro with k6

Wilfried Roset — Tue, 04 Apr 2023 12:19:05 +0000

In our previous posts about choosing a Prometheus remote storage we have seen how to set up a benchmarking infrastructure and how to benchmark promql performance. We have been able to obtain results but the whole benchmark is flawned in many ways:

it’s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.
it’s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.
you’re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in prometheus or haproxy configuration.
it focus mainly on the write path without stress from the read path which is not realistic.

This blog post discuss how we should have benchmark our remote storage.

How to do a good benchmark? K6 to the rescue

A good benchmark need to be accurate and reproducible. More over for our usecase we want to have a tool who takes into account both Prometheus’s read and write path. Finally, we need to be able to remove all unnecessary pieces. This way we are able to focus on the remote storage only.

Such software could be a project on its own but fortunately for us there is one opensource solution for that: K6

K6 is a general purpose modern load testing which can be extended with module to support Prometheus remote storage. Sounds interesting don’t you think?

In our previous blog post we have explained how we have built our benchmarking infrastructure which was rather complex to be accurate.

With k6 as benchmarking tool the infrastructure can be greatly simplified:

K6 is quite flexible and configurable. Its input is a load testing script, you can either write your own script or reuse an opensourced one. As the whole logic is in the load testing script it become easily reproducible which is exactly what we need.

To launch a benchmark you need two piece of infrastructure:

Somewhere where you can run k6 which could be a c2-120 instance on our public cloud
A remote storage to benchmark. for a quick start users are one helm apply away to start on k8s

For our use case we have chosen to reuse the load testing from Grafana which does exactly what we are looking for. All useful information to tune and assess your remote storage are outputed by k6.

     ✓ write worked

     █ instant query high cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     █ range query

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field is 'success' to equal 'success'
       ✓ expected resultType is 'matrix' to equal 'matrix'

     █ instant query low cardinality

       ✓ expected request status to equal 200
       ✓ has valid json body
       ✓ expected status field to equal 'success'
       ✓ expected data.resultType field to equal 'vector'

     checks............................................................................: 100.00% ✓ 1454     ✗ 0
     ✓ { type:read }...................................................................: 0.00%   ✓ 0        ✗ 0
     ✓ { type:write }..................................................................: 100.00% ✓ 6        ✗ 0
     data_received.....................................................................: 1.0 MB  8.4 kB/s
     data_sent.........................................................................: 277 kB  2.3 kB/s
     group_duration....................................................................: avg=64.61ms min=39.94ms med=60.43ms max=230.05ms p(90)=80.39ms p(95)=107.93ms
     http_req_blocked..................................................................: avg=4.65ms  min=2µs     med=6µs     max=96.84ms  p(90)=11µs    p(95)=58.42ms
     http_req_connecting...............................................................: avg=1.31ms  min=0s      med=0s      max=21.87ms  p(90)=0s      p(95)=16.99ms
     http_req_duration.................................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
       { expected_response:true }......................................................: avg=53.7ms  min=34.23ms med=52.71ms max=164.1ms  p(90)=67.02ms p(95)=71.82ms
     ✓ { type:read }...................................................................: avg=53.8ms  min=34.23ms med=52.76ms max=164.1ms  p(90)=66.85ms p(95)=71.62ms
     ✓ { url:https://admin:security-matters@remote-storage.poc.ovh.net/api/v1/push }...: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_failed...................................................................: 0.00%   ✓ 0        ✗ 368
     http_req_receiving................................................................: avg=92.34µs min=32µs    med=89µs    max=301µs    p(90)=125.3µs p(95)=150µs
     http_req_sending..................................................................: avg=49.05µs min=12µs    med=40µs    max=566µs    p(90)=68µs    p(95)=94.59µs
     http_req_tls_handshaking..........................................................: avg=3.11ms  min=0s      med=0s      max=54.28ms  p(90)=0s      p(95)=39.39ms
     http_req_waiting..................................................................: avg=53.56ms min=33.94ms med=52.56ms max=163.93ms p(90)=66.88ms p(95)=71.66ms
     http_reqs.........................................................................: 368     3.064697/s
     iteration_duration................................................................: avg=64.88ms min=40.34ms med=60.78ms max=230.27ms p(90)=80.87ms p(95)=108.47ms
     iterations........................................................................: 368     3.064697/s
     vus...............................................................................: 26      min=26     max=26
     vus_max...........................................................................: 26      min=26     max=26

What a time saver? With k6 we have been able to efficiently assess all remote storage solutions. This is a significative improvement if we compare it to our previous benchmarking plan.

The next and final post will be about which remote storage we have chosen to be our internal solution.

Stay tuned.

Prometheus’ remote storage playground

Wilfried Roset — Sun, 05 Mar 2023 23:49:35 +0000

Introduction

In the previous post we have discuss how important remote storage are for Prometheus. We have also covered several attention points. In the following post we are covering remote write storage and how to bench them.

Context

After you have identify one (or more) remote storage who might suit your must bench it. However it is not as straight forward as it seems. Let’s review what we will need for this experiment:

A (scalable) remote storage, in our case one which is remote write
One or more data generator

Introducing Hachimon

Benchmarking is always fun but you know what is even more fun? Gamification! With my team mates we have created a short benchmark plan which we have called the Hachimon path:

Gate of Opening
- 1k targets
- 1000 series/target
- ~ 66k datapoints/sec
Gate of Healing
- 2k targets
- 1000 series/target,
- ~133k datapoints/sec
Gate of Life
- 4k targets
- 1000 series/target
- ~266k datapoints/sec
Gate of Pain
- 4k targets
- 1000 series/target
- ~266k datapoints/sec after deduplication
- dual prometheus to increase pressure on deduplication features
Gate of Limit
- 4k targets
- 2500 series/target to increase pressure on storage
- ~660k datapoints/sec
- dual prometheus
Gate of View
- 8k targets
- 2500 series/target
- ~1.3M datapoints/sec
- dual prometheus
Gate of Wonder
- 10k targets
- 2500 series/target
- ~1.6M datapoints/sec
- dual prometheus
Gate of Death
- Add as many targets as you can until the backend almost on fire

To walk the Hachimon path we’ve built an infrastructure where only the central piece, the remote storage, changes. Doing so help us compare results.

The write path is stress by one or more Prometheus clusters which will scrap many time the same node_exporter under a different set of labels. Doing so allow us to emulate an infrastructure bigger than it is. To increase the cardinality we can tweak node_exporter configuration to expose more or less series. By deploying one or more Prometheus clusters we can both stress the deduplication feature of the backend and workaround the hardware limitation of a given prometheus.

This approach is very similar to the one of Victoriametrics which has inspired us. Kudos!

By the time we have reach the end of our tests the infrastucture we have built looks like the following:

This is the infrastucture we have used to bench both the read and the write path of the remote storages. There is load balancing on both side, multiple pairs of Prometheus to put more or less pressure on the write path and the deduplication. Finally, the data comes from little instances exposing node_exporter metrics.

Expectation

Thanks to this benchmarking plan we have been able to differentiate the remote storage on a performance perspective. We’ve been able to get a first understanding about how each remote storage works, how to tune them and what can you done and what you cannot with them. It seems to us that it is equally important to have ease to operate a solution and good performance. But most importantly we learnt a lot of thing while having fun.

Conclusion

This benchmarking plan’s s obviously flawned in many ways:

it’s expensive as you need to spawn more than necessary to assess a particular point of your remote storage.
it’s hard to reproduce 100% the same setup, even with the same configuration and software version you will have a similar result but not exactly the same.
you’re not always benchmarking what you think you are. We have spent couple of time troubleshoot performance issue which where in Prometheus or haproxy configuration.
it focus mainly on the write path without stress from the read path which is not realistic.

The two next posts of this series continue to focus on benchmarking. The first one focus on the read performance.

The second one focus on how we should have benchmarked our solution from the beginning.

Stay tuned

Welcome to Prometheus world of remote storage

Wilfried Roset — Thu, 16 Feb 2023 16:29:25 +0000

At OVHcloud, we recently made a change to our internal Observability stack. After testing and comparing the different solutions on the market, we opted for on open source solution. With this blog post, we’re starting a series of articles to provide feedback on our selection process and what we’ve learned along the way. Our mission was to find an horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus, we begin this series with an introduction to Prometheus remote storage…

Over the last decade Prometheus has become one of the standard for Observability. It’s core concept is well suited for today technological use cases and it makes sense that open source community loves it. While Prometheus does a lot of thing really well when it comes to long term storage users must find a solution. This blog post serie discuss Prometheus’s remote storages, the technical challenges they aim to solve and more importantly we discuss how to pick the right one for you.

What is a remote storage?

Prometheus can be configured to read or write to a remote storage on top of its local storage. This allow it to support long-terme storage of users data. The two features are called remote_read and remote_write.

With remote_read configured, Prometheus will answer read queries with data from the remote storage. The remote_write is responsible for shipping samples to the remote storage. Both of them are extremely useful and highly configurable.For the rest of this blog post let’s focus on remote write.

Whether you are a cloud provider or building an in-house Observability it is not always appropriate nor possible to connect to your customers infrastructure to extract data.

With a remote write approach customers can have a strict control on what comes in/out of the infrastructure. We could argue that IPtables coupled with authentication is secure enough but this is still one more door to keep an eye on. With tight security taken into account we understand that remote write makes a lot of sense from a service provider point of view.

Now that we know that we want a remote write compatible storage we must take into account that not all remote storages are equal. The list of solution keeps growing every day, let’s see if we can differentiate them.

When writing metrics to a remote storage it is because we want to read then back later. Most Observability use cases imply writing down tons of data that will be queried afterwards. PromQL is the query language use to query Prometheus and therefore associated remote storage. It would make sense to check how PromQL compliant the solutions are. Fear not, Prometheus community is already tackling this question for us with PromQL Compliance

PromQL Compliance results as of 2021-10-14

As you can, see most remote storage are 100% compliant with Prometheus results. Good news. This means users have a plethora of

However, readers must not under estimate this point. Indeed compliance impacts what you can query from the backend, how you can query it and, the accuracy of a result. It might not be trivial to reach full compliance and to stay compliant. Maintainers might also choose to not be compliant and explain why.

Prometheus world grows in adoption and under active development. If a solution is compatible today there is no guarantee it’ll stay compatible tomorrow.

Which bring us to the second point, the community. How healthy, large and active are the community behind each software?
Is it easy to contact them? Discuss issues? Propose feature and PRs? We tend to take granted the fact that PRs will be reviewed, that we’ll found someone to help us troubleshoot a bug but this is not necessarily the case.

Features set

To better address the technical challenges that are your own you must pick the solution that have the features you need. If you need multi tenancy check that point. If you need to downsample your data add this to your checklist. Don’t be shy, dig a little deeper. Test the feature look for its limitation. Tests are the only way to be able to make an informed decision.

To give you an idea you might want to have a look at the following features:

multi tenancy
rate limiting
deduplication
deletion
downsampling

Scalability

Nowadays the word scalability is present almost everywhere. How well each remote storage scale? Can you write 2M samples/sec? Can you answer 1M queries/sec? Can you have 200M active series in total? 1B active series? Per tenant?

You can have a rough understanding of the bottleneck by looking at the architecture diagram. But to have a crystal clear answer there is only one way, you need to make a proof of concept.

By the way, you can even try one remote storage right now on our managed k8s. Most of the open source remote storage offer helm charts or operator to do so: VictoriaMetrics, Timescale, Mimir.

Cost

Along scalability comes tco which stand for Total Cost of Ownership. This boil down to how expensive a solution, infrastructure can be when you take all cost into account. For remote storage, on top of the team operating the infrastructure we must take into account the aforementioned infrastructure. All technical solution relies on 4 categories: trained engineers, compute resources, network and… Storage. Nevertheless, it is critical to take it into account all aspect of the target solution. Otherwise be ready for a surprise at the end of the month.

Conclusion

As we have demonstrate, we have a lot of technical solutions to address long term storage. However before putting one solution in production we need to thoroughly identify and assess all trade offs. In the next posts we will have a look on how to get to know your remote storage, bench it, break it.

Explaining slow queries to my manager

Wilfried Roset — Fri, 13 Mar 2020 14:46:52 +0000

In our previous blog post about observability, we explained how to build a comprehensive view of how your SQL workloads behave, and the many reasons why it is important to have this view. In this blog post, we will take a closer look into the four types of SQL query, and how they can impact the end user’s experience.

Before going further, let us recapitulate what these four types of queries are:

Select is for reading data
Insert is for adding data
Update is for changing data that already exists
Delete is for deleting data

my-database=# select * from customers where nic = XXX;
my-database=# insert into customers values (1, 'user-firstname', 'user-lastname', '21');
my-database=# update customers set address = 'xxx xxx' where nic = 'XXX';
my-database=# delete from customers where nic = XXX

When it comes to SQL workloads, there are two different types: OLTP and OLAP.

OLTP workloads

OLTP (for OnLine Transactional Process) workloads correspond to the “organic” use of databases. These operations are used to make more effective use of the databases behind websites, APIs, e-commerce platforms etc. While OLAP relies exclusively on read, OLTP workloads rely on all types of queries, including select, insert, update, and delete. With OLTP, we want the queries to reply as fast as possible, generally in under a few hundred milliseconds. The reason behind this need for speed is to reduce the impact queries will have on your application’s user experience. After all, who loves websites that take forever to load? In terms of the number of queries, we usually count them in thousands per second.

my-database=# insert into cart values (...); -- create a new cart
my-database=# insert into cart_content values (...); -- add items
my-database=# update cart_content set ... where ...; -- modify your cart
my-database=# select item_name, count from cart_content where ...; -- check the content
my-database=# insert into sales values (...); -- validate the cart

OLAP workload

OLAP (for OnLine Analytic Process) workloads are used to extract and analyse huge volumes of data (hence the name). They are the main tool used by business intelligence software platforms to produce forecasts and reports. In terms of queries, OLAP workloads usually rely exclusively on a few select ones that are periodically executed, and can take a long time (from minutes to hours) to complete.

As you can see, OLTP and OLAP workloads are very different. Comparing them is like comparing a racecar (OLTP, hopefuly) to a truck (OLAP).

Classifying queries: the good, the bad, the ugly… and the slow

Now that we have a general understanding of the two types of workloads, let us focus on OLTP, as they are usually the most relevant to the customer-facing parts of your platforms. At the beginning of this post, I described the four different types of SQL queries in terms of their purpose. Now, we’ll classify them according to their behaviour: good, bad, ugly and slow. What does that mean, you ask? Let me explain (spoiler: you want your queries to fall in the “good” category)…

The good

As you would expect, good queries are the ones that run as expected and reply relatively fast. At OVHcloud, we define “fast” as a response time of lower than one second for our internal databases. But one second is still a long time to wait for a response, especially when multiple queries are made to load a single web page. Generally, we aim for 10-20ms. You should draw this “fast-line” depending on your setup, resources, and intended usecase.

Your backend will query the database and get an answer in, let’s say, 20ms, which will leave plenty of time to process the data and send a result. The faster your queries runs, the happier your customers and boss will be.

When I want to explain to my boss why fast queries are good, it is pretty simple: fast queries mean good user experience, fast order, fast checkout, and more profit.

my-database=# select firstname, lastname from customers where id = 123;

The bad

Bad queries, on another hand, are queries that can not be executed by the DBMS. There can be multiple reasons for that: bug in the code, lack of control somewhere in the process etc…

Let’s take an example. Your website has a form where users can create an account, in which they provide their age. In the UI, the text box lets the user type in whatever they want and passes the value as a string. But if your schema is well-designed, it should expect an integer in the “age” field. This way, if a user tries to type their age as a string in the box rather than as a number, the DBMS should return an error. The solution is simple: the UI form should check the type of data filled in the box and return an error message, such as “invalid data”, at the UI front-end, instead of waiting for the DBMS to do so. In cases like this, it would be advisable to only allow numbers.

my-database=# insert into user values (1, 'user-firstname', 'user-lastname', 'twenty years old');
ERROR:  invalid input syntax for integer: "twenty years old"

You can fix this type of “bad” query by adding more control through the chain, using the correct type in the UI, with checks in the front-end, middle-ware, back-end, and so on.

To my boss, I would explain that bad queries are a hindrance to customers wanting to use your service, and will lead to a loss of profit. However, because of their straightforward nature, they are usually relatively easy to debug.

The ugly

Ugly queries are more problematic. They are queries that sometimes work, sometimes don’t because of deadlocks.

Deadlock is a vast subject, but for now let’s keep it simple. A deadlock happens when multiple queries are waiting for each other to finish. Let’s look at the following example:

-- STEP #1
process 1: BEGIN; -- this will start the transaction
process 2: BEGIN;
 
-- STEP #2
process 1: UPDATE stock SET available = 10 WHERE id = 123; -- lock row 123
process 2: UPDATE stock SET available = 0 WHERE id = 456; -- lock row 456
 
-- STEP #3 The ugly part starts now
process 1: UPDATE stock SET available = 1 WHERE id = 456; -- wait for lock on row 456
process 2: UPDATE stock SET available = 9 WHERE id = 123; -- wait for lock on row 123

We can see that we have two processes trying to update the stock within a transaction. Processes #1 and #2 are each locking a different row. Process #1 is locking row 123, while process #2 is locking row 456 in step 2. In step 3, without releasing the lock on its current row, process #1 tries to acquire a lock on row 456, which is already owned by process #2, and vice versa. To complete their transaction, they are both waiting on each other. I have chosen a simple example with only two queries, but this problem can happen with tens or thousands of queries at the same time. A general rules when working with transactions is to commit them as fast as possible.

The difficulty is that the query can be perfectly correct and work most of the time, especially in your CI/CD pipeline, where corner cases and rare events are not necessarily identified and tested. But the more your business grows, the likelier these rare events are to happen, as you increase the number of concurrent queries made. And unfortunately, the likeliest time for deadlocks issues is during load peaks caused by sales, holidays, etc. In other words, exactly when you need your workflow to work perfectly.

To explain that to my boss, I would say that when deadlock happen, there is either something wrong in the backend, the database schema, or in the workflow logic itself. In a worst-case scenario, the problem will hit at the most incovenient moment, so the customer will be unable to interact with your system, and not even receive an understandable error message, which will be hard to fix. Deadlocks take time to understand, debug and fix. By the time you have prodded the fix, your customers will be spending their money elsewhere, or your support will be crumbling under tickets and calls.

process 34099 detected deadlock while waiting for ShareLock on transaction 4026716977 after 1000.042 ms
DETAIL: Process holding the lock: 34239. Wait queue: .
CONTEXT: while locking tuple (259369,24) in relation "table"
STATEMENT: SELECT col from table where ...

Your favourite DBMS will eventually kill all queries, but only after a given timeout. And of course, timeout means your customers will wait for the result before ending up with an error. Yeah, this is ugly…

And the slow…

Finally, as you can probably imagine, slow queries are queries that take time to finish. They are very easy to describe, but not that easy to fix, and improving them should be an ongoing effort. Here are some common causes of slow queries:

Poorly written queries (but you don’t have this in prod, do you?)
Missing indexes
Fetching too many rows
Too much data to go through

For this one, my boss doesn’t need an explanation. Slow queries mean slower API calls, slower UI and fewer customers reaching the checkout stage.

The fix can be straight forward: rewrite your queries, find and add missing indexes, and fetch only what is needed. However, reducing the amount of data your queries have to go through is a little bit more difficult. It can be done through regular purges in your DB, archiving, partitioning, etc. But in practice, you should only keep hot and relevant data in your customer-facing databases to avoid bloat.

Conclusion

Let’s wrap up and summarise:

Good queries are a sign of a healthy workload. The more you have, the better it is
Bad queries mean that something is broken somewhere
Ugly queries are waiting for the worst possible moment to slap you
Slow queries mean that you have something working, but there is room for improvement

One last tip… this is not a one-time job. You need to keep a close eye on those four query categories.

That’s it folks! You now know enough to dig into your applications and databases to continuously improve your workloads. Ultimately, your customers are impacted by all four categories, so I’m sure that you know why you only want good queries in your information systems!

Improve your SQL workload with observability

Wilfried Roset — Fri, 24 Jan 2020 12:26:19 +0000

Our team of database administrators used to be overwhelmed by our developers’ needs. Picture this: we had only 1.5 database admins per more than 400 tech people (DevOps, developers etc.), with only 24 hours in a day! This was a perfect use-case for observability.

But what is observability, apart from a buzzword? To me, observability is not only a set of tools designed to collect data (logs, metrics …) but also – and more importantly – the value that you extract from this data and how you act on it. This goes far beyond simple dashboards or other visualisation tools.

If you remember our first blog post, you will know that a single cluster can host multiple databases, potentially from different applications. Therefore, an application can have an impact on an unrelated application, by using the SQL service.

Let’s begin with some examples

In the example Situation 1, DB-A and DB-B – which are respectively accessed by APP-A and APP-B – share a cluster named SQL-1. APP-A can indirectly have an impact on APP-B.

So why not just provision one cluster per database? Well, firstly that would be quite extensive since most databases don’t require that much performance. But that wouldn’t even solve all the issues: now imagine the situation on the example Situation 2, where APP-A accesses both DB-A and DB-B. Then database DB-B itself is shared between APP-A and APP-B.

It can be a nightmare to totally split and isolate everything. That’s why we prefer to fix the root cause: the behaviour of our applications.

Observability to improve the behavior of the databases

Being a good database administrator and improving how applications behave with databases requires several things: a good understanding of the business use case behind the application, knowledge of the different technical trade-offs behind your application and your databases parameters, and information about your database’s health.

And here comes observability: we need to have data about our databases. We have two data sources: logs and metrics. To collect logs we use a combination of syslog, filebeat on one side and OVHcloud logs Data Platform on the other side. For the metrics part we use telegraf and OVHcloud metrics.

Industrializing the analysis

A database administrator will usually perform analysis on demand, but human interventions are difficult to scale. Therefore, we had to industrialise this analysis wherever possible.

Fortunately for us, we have tools at our disposal for this. The first (and easiest) thing we have industrialised (and the easiest one) was the analysis of our logs. PGBadger and pt-query-digest from Percona toolkit provided us with everything we needed.

I will not focus here on how to get the data itself as that topic is already well covered. I’d rather explain to you what we do with this freshly-collected data.

Dashboards

Obviously – and in spite of what I told you earlier – the first thing we did was to build dashboards. To do that, we used Grafana to display information stored in OVHcloud Logs and Metrics data platforms in a single dashboard. Grafana is extremely powerful for this. This gives us the same dashboard information from logs and metrics.

Building dashboards was the easy part: system information (CPU, RAM, load, disk space), I/O (disk and network) and DBMS (transactions per second, SQL or syntax errors, and more interestingly, deadlocks and slow queries). In the later one, we can see the number of transactions per second with different granularity, SQL error, invalid syntax, deadlocks and slow queries.

Deadlocks and slow queries

Let me explain what I mean exactly by deadlock and slow queries:

A deadlock happens when two or more queries wait on each other to complete. They are eventually killed by the DMBS leaving only one.
Slow queries are exactly what you think: queries that take more than a certain fixed duration to complete.

In our case we consider a query slow if it lasts longer than 1 second. If you are concerned with the behaviour and performance of your application, this is where you should dig.

So what do we do with this information? Simple: we share it, with every OVHcloud employee on our tech-related mailing Lists. And then we could start thinking about how to exploit all this information.

Do it yourself as a DBA

We try to fix the issues that come to our attention through this means. We first focus on slow queries because you can fix them in several ways, either by fixing the schema or fixing the application. Because we – as DBAs – don’t have any control over the application, we do our best to fix the schemas. But this is an endless job that doesn’t scale: we have hundreds of databases for thousands of applications. Then we repeat this process for the other types of queries like deadlocks.

Open issues in the concerned teams’s backlog

After that, we tried to open issues in the backlog of the teams who were responsible for applications suffering from slow queries. Again, that didn’t scale as we can’t spend a significant amount of time opening defects in a backlog we don’t have control of.

Understand that we as DBAs can’t scale if we do it alone

We took a step back and thought about what we wanted to achieve: our goal was not to babysit developers, it was to empower them and help them do their jobs by providing them with documentation and information.

Start documenting how to use these new tools

Writing good and comprehensive documentation is hard, so that’s where we started: how to access the dashboards we created, what they mean, what patterns are negative and should be avoided and so on. While the initial effort has been done, it is a continuous effort and we strive to improve it day after day.

Advertise

After documentation, came the second, more difficult step: communication. At first, we took time to discuss with each team when we noticed something was wrong, but it took us too much time and again – it didn’t scale. But talking to the developers gave us a great idea: what if we could make them actively want to settle these issues rather than going to them one team at a time? And that is when we decided to appeal to their pride and competitive spirit.

So we decided, at the beginning of each week, to publish a report sent to all tech people at OVHcloud summarising the most impressive improvements, the biggest offenders etc… This is the template we have been using for quite some time:

Hello
You will find information which can help you identify your queries in our welcome guide: [link to documentation]
If you need assistance to dig and fix your slow queries, feel free to ask: [link to procedure to ask for help]
Tldr:
* Database 94 has successfully fixed its slow queries
* Database 59 is our #1 producer of slow queries
* Database 75 is our #1 producer of deadlocks
* Database 33 is no more in TOP5, congrats
Please find below last week sql report.
MySQL slow queries/database
PostgreSQL slow queries/database
PostgreSQL error/database
PostgreSQL deadlock/database
…
W.

At first, we feared that this report would have a rather cold reception. But the opposite occurred: people started to anticipate this weekly mail and started to compete with each other to decrease the amount of slow queries. Adding congratulations, gifs and personalised message helped a lot.

In conclusion: it works!

As I’m writing this post, nearly one year after we started doing these reports, our SQL workload has greatly improved thanks to the hard work of our developers. There are 5 times fewer slow queries than there used to be and we have almost no deadlock. Even better: we identify performance issues and bugs a lot faster and easier than before. And last, but certainly not least, trust and communication has been established between developers and database administrators, allowing us to work more closely to improve the responsiveness of our applications. That’s it for today, stay tuned!

Database replication 101

Wilfried Roset — Wed, 13 Nov 2019 13:31:26 +0000

In our tour of OVHcloud’s Internal Databases Infrastructure, we showed you what our infrastructure looks like. This post will be about SQL replication, why it matters, and how we use it at OVHcloud.

Nodes are not enough: they have to work together. To do that, we rely on a feature called replication to keep data in sync across all cluster nodes. With the replication feature enabled, all changes made on the primary node are propagated and applied to the rest of the cluster, so that the same data is stored on every node. When working with replication, there are certain trade-offs you have to be aware of. As there are several types of replication, we should first have a look at how replication works.

Asynchronous replication

The first one is called asynchronous replication. With this replication type, write queries are acknowledged as soon as the primary has executed them and stored the result on disk. The application can continue its work as soon as it receives this acknowledgment. In parallel, the changes are being propagated and applied to the rest of the cluster, which means that those changes won’t be visible on the replicas until they are fully propagated and applied. This also means that if you lose the primary before changes have been propagated, not-yet-propagated data will be lost.

Semi Synchronous replication

The second one is called semi-synchronous replication. With this one, write queries are acknowledged as soon as the changes have been propagated to replicas. Propagated, but not necessarily applied: changes might not yet be visible on the replicas. But if the primary node is lost, replicas have all the data they need to apply the changes.

Synchronous replication

The last one is synchronous replication. The name is self-explanatory, but let us walk through how it works. In this mode, write queries are only acknowledged when changes have been propagated AND applied to the replicas. Obviously, this is the most secure way to replicate data: no data or progress is lost if there is a primary node failure.

A real world example

But a real example is worth a thousand technical explanations: Imagine you are running a website selling hand-made items. You have two customers and an item with a single piece remaining. Now imagine that the first customer is buying this last piece, and that the second customer checks availability at the exact same time that the first one is completing his purchase.

Buying an item is a write query as you need to update the stock, so you must perform it on the primary node. Displaying the webpage is a read query, so you decide to perform it on a replica node. What happens for the second customer if he/she displays the webpage at the same exact moment that the first customer receives the purchase confirmation? Does he see the item as in or out of stock.

As you will probably have guessed by now, it depends on the type of replication you have chosen for your database. With an asynchronous or semi-synchronous replication, the customer would see the item as still available. With a synchronous, she would see the item as out of stock.

This can be confusing: this is always the case. The real difference between the modes here is that in the synchronous mode, the first customer’s purchase will take longer to complete (not complete before all replicas have applied the change). The logical difference is not between the time it takes for the change to be applied, it’s between when the first client sees her purchase completed.

Before going all in for synchronous replication you must take into account a crucial point: the latency. Indeed, waiting for all replicas to receive and apply the changes takes time. Latency impacts your website or application reactivity, and thus potentially your revenue. Indeed, multiple studies show that having a higher latency on purchase operations directly translates to fewer complete purchases, which you probably don’t want.

You might now get where I’m going with this: asynchronous replication operations take less time to complete and thus make your applications more reactive, but enable undesirable consequences like items appearing in stock when they are not.

As you can see, you choose either throughput and reactivity, or security. There is no right answer, it mostly depends on your use case. Fortunately some Database Management Systems (DBMS), such as PostgreSQL, allow you to define the level of security you want for a given query. This means you can use synchronous replication when a customer makes a purchase of at least $1000 and use asynchronous replication otherwise.

And which method does OVHcloud use?

At OVHcloud, we manage several mission critical databases, from banking transactions, to DNS parameters for all our domains names, or for our Public and Private Cloud Control Panels, information pushed by our APIs, and so on. We opted for asynchronous replication for all our databases. We cope with asynchronous disadvantages by reducing the latency as much as possible, making it negligible. Moreover our developers are experienced and thus familiar with this trade-off and are therefore able to make the best design decisions corresponding to the application they are building. So, folks, be aware of this trade-off, and think about what you really want from your application before you configure your DBMS.

OVHcloud’s internal databases infrastructure

Wilfried Roset — Wed, 30 Oct 2019 16:08:13 +0000

Today, most applications rely directly or indirectly on databases. I would even take a bet and say that a large portion of those are relational databases. At OVHcloud, we rely on several dozens of clusters hosting hundreds of databases to power thousands of applications. Most of those databases power our API, host billing information and customer details.

As part of the team responsible for this infrastructure, I can say it is a huge responsibility to maintain such a critical part of the beating heart of OVHcloud.

In this new series of blog posts we will take a closer look at OVHcloud internal relational databases infrastructure. This first post is about the infrastructure of the internal databases. At OVHcloud, we use 3 majors DBMS (database management systems), PostgreSQL MariaDB and MySQL, every one of them relying on the same cluster architecture.

But first, what exactly is a cluster? A cluster is a group of nodes (physical or virtual) working together to provide a SQL service.

At OVHcloud we have an open source and “do it yourself” culture. It allow us to control our costs and more importantly to master the technologies we rely on.

That’s why during the last 2 years we designed, deployed, improved and ran failure-proof cluster topologies, then industrialised them. To satisfy our reliability, performance and functional requirements, we decided on a common topology for all these clusters. Let’s find out what it looks like!

Each cluster is composed of 3 nodes, with each node its role – primary, replica and backup.

The primary node assumes read-write workloads, while the replica(s) only handle read-only queries. When the primary node fails, we promote a replica node to become the primary node. Because in the vast majority of cases, databases handle much more read-only than read-write queries, replica nodes can be added to scale the cluster’s read-only capabilities. This is called horizontal scaling. Our last node is dedicated to backup operations. Backups are incredibly important we will talk a bit more about them later.

Because every node in the cluster can be promoted to primary, they need to be able to handle the same workload. Thus, they must have exactly the same resources (CPU, RAM, disk, network …). This is particularly important when we need to promote a replica because it will have to handle the same workload. In this case, having primary and replica not equally sized can be disastrous for your workload. With our clusters up and running we can start querying them. Each cluster can host one or more databases depending on several factor such as infrastructure cost and workload types (business critical or not, transactional or analytic…).

Thus, a single cluster can host from only one big database to tens of smaller ones. In this context, small and big are not only defined by the quantity of data but also by the expected frequency of queries. For this reason, we carefully tailor each cluster to provision them accordingly. When a database grows and the cluster is no longer appropriately sized, we migrate the database to a new cluster.

Aside from production we have another smaller environment that fulfils two needs. This is our development environment. We use it to test our backups and to provide our developers with a testing environment. We will get back to this matter in just a few lines.

Now let us talk about backups. As I mentioned earlier, backups are a critical part of enterprise-grade databases. To avoid having to maintain different processes for different DBMS flavors, we designed a generic backup process that we apply to all of those.

This allowed us to automate it more efficiently and abstract the complexity behind different software.

As you have probably guessed by now, backups are performed by the backup node. This node is part of the cluster and data is synchronously replicated on it, but it does not receive any query. When a snapshot is performed, the DBMS process is stopped and a snapshot of the filesystem is taken and sent to a storage server outside of the cluster for archival and resiliency. We use ZFS for this purpose because of its robustness and because of the incremental bandwidth which reduces the storage costs associated with snapshot archival.

But the main reason for having a separate backup node is the following: the cluster is not affected in any way by the backup. Indeed, backing up a full database can have a very visible impact on production (locks, CPU and RAM consumption etc…), and we don’t want that on production nodes.

But backups are useless if they can’t be restored. Therefore, every day, we restore the last snapshot of each cluster on a separate, dedicated node. This allows us to kill two birds with one stone, as this freshly restored backup is also used by our developers team to have an almost up-to-date development environment in addition to making sure we are able to restore backups.

To summarise: our database clusters are modular but follow a common topology. Clusters can host a variable number of databases depending on their expected workloads. Each of these databases scale horizontally for read-only operations by proposing different connections for read-only and read-writes operations. Furthermore, backup nodes are used to having regular backups without impacting the production databases. Internally, these backups are then restored on separate nodes as a fresh development environment.

This completes our tour of OVHcloud’s Internal Database Infrastructure and you are now all set for the next post which will be about replication. Stay tuned!