Our team of database administrators used to be overwhelmed by our developers’ needs. Picture this: we had only 1.5 database admins per more than 400 tech people (DevOps, developers etc.), with only 24 hours in a day! This was a perfect use-case for observability.
But what is observability, apart from a buzzword? To me, observability is not only a set of tools designed to collect data (logs, metrics …) but also – and more importantly – the value that you extract from this data and how you act on it. This goes far beyond simple dashboards or other visualisation tools.
If you remember our first blog post, you will know that a single cluster can host multiple databases, potentially from different applications. Therefore, an application can have an impact on an unrelated application, by using the SQL service.
Let’s begin with some examples
In the example Situation 1, DB-A and DB-B – which are respectively accessed by APP-A and APP-B – share a cluster named SQL-1. APP-A can indirectly have an impact on APP-B.
So why not just provision one cluster per database? Well, firstly that would be quite extensive since most databases don’t require that much performance. But that wouldn’t even solve all the issues: now imagine the situation on the example Situation 2, where APP-A accesses both DB-A and DB-B. Then database DB-B itself is shared between APP-A and APP-B.
It can be a nightmare to totally split and isolate everything. That’s why we prefer to fix the root cause: the behaviour of our applications.
Observability to improve the behavior of the databases
Being a good database administrator and improving how applications behave with databases requires several things: a good understanding of the business use case behind the application, knowledge of the different technical trade-offs behind your application and your databases parameters, and information about your database’s health.
And here comes observability: we need to have data about our databases. We have two data sources: logs and metrics. To collect logs we use a combination of syslog, filebeat on one side and OVHcloud logs Data Platform on the other side. For the metrics part we use telegraf and OVHcloud metrics.
Industrializing the analysis
A database administrator will usually perform analysis on demand, but human interventions are difficult to scale. Therefore, we had to industrialise this analysis wherever possible.
Fortunately for us, we have tools at our disposal for this. The first (and easiest) thing we have industrialised (and the easiest one) was the analysis of our logs. PGBadger and pt-query-digest from Percona toolkit provided us with everything we needed.
I will not focus here on how to get the data itself as that topic is already well covered. I’d rather explain to you what we do with this freshly-collected data.
Dashboards
Obviously – and in spite of what I told you earlier – the first thing we did was to build dashboards. To do that, we used Grafana to display information stored in OVHcloud Logs and Metrics data platforms in a single dashboard. Grafana is extremely powerful for this. This gives us the same dashboard information from logs and metrics.
Building dashboards was the easy part: system information (CPU, RAM, load, disk space), I/O (disk and network) and DBMS (transactions per second, SQL or syntax errors, and more interestingly, deadlocks and slow queries). In the later one, we can see the number of transactions per second with different granularity, SQL error, invalid syntax, deadlocks and slow queries.
Deadlocks and slow queries
Let me explain what I mean exactly by deadlock and slow queries:
- A deadlock happens when two or more queries wait on each other to complete. They are eventually killed by the DMBS leaving only one.
- Slow queries are exactly what you think: queries that take more than a certain fixed duration to complete.
In our case we consider a query slow if it lasts longer than 1 second. If you are concerned with the behaviour and performance of your application, this is where you should dig.
So what do we do with this information? Simple: we share it, with every OVHcloud employee on our tech-related mailing Lists. And then we could start thinking about how to exploit all this information.
Do it yourself as a DBA
We try to fix the issues that come to our attention through this means. We first focus on slow queries because you can fix them in several ways, either by fixing the schema or fixing the application. Because we – as DBAs – don’t have any control over the application, we do our best to fix the schemas. But this is an endless job that doesn’t scale: we have hundreds of databases for thousands of applications. Then we repeat this process for the other types of queries like deadlocks.
Open issues in the concerned teams’s backlog
After that, we tried to open issues in the backlog of the teams who were responsible for applications suffering from slow queries. Again, that didn’t scale as we can’t spend a significant amount of time opening defects in a backlog we don’t have control of.
Understand that we as DBAs can’t scale if we do it alone
We took a step back and thought about what we wanted to achieve: our goal was not to babysit developers, it was to empower them and help them do their jobs by providing them with documentation and information.
Start documenting how to use these new tools
Writing good and comprehensive documentation is hard, so that’s where we started: how to access the dashboards we created, what they mean, what patterns are negative and should be avoided and so on. While the initial effort has been done, it is a continuous effort and we strive to improve it day after day.
Advertise
After documentation, came the second, more difficult step: communication. At first, we took time to discuss with each team when we noticed something was wrong, but it took us too much time and again – it didn’t scale. But talking to the developers gave us a great idea: what if we could make them actively want to settle these issues rather than going to them one team at a time? And that is when we decided to appeal to their pride and competitive spirit.
So we decided, at the beginning of each week, to publish a report sent to all tech people at OVHcloud summarising the most impressive improvements, the biggest offenders etc… This is the template we have been using for quite some time:
Hello
You will find information which can help you identify your queries in our welcome guide: [link to documentation]
If you need assistance to dig and fix your slow queries, feel free to ask: [link to procedure to ask for help]
Tldr:
* Database 94 has successfully fixed its slow queries
* Database 59 is our #1 producer of slow queries
* Database 75 is our #1 producer of deadlocks
* Database 33 is no more in TOP5, congrats
Please find below last week sql report.
MySQL slow queries/database
PostgreSQL slow queries/database
PostgreSQL error/database
PostgreSQL deadlock/database
…W.
At first, we feared that this report would have a rather cold reception. But the opposite occurred: people started to anticipate this weekly mail and started to compete with each other to decrease the amount of slow queries. Adding congratulations, gifs and personalised message helped a lot.
In conclusion: it works!
As I’m writing this post, nearly one year after we started doing these reports, our SQL workload has greatly improved thanks to the hard work of our developers. There are 5 times fewer slow queries than there used to be and we have almost no deadlock. Even better: we identify performance issues and bugs a lot faster and easier than before. And last, but certainly not least, trust and communication has been established between developers and database administrators, allowing us to work more closely to improve the responsiveness of our applications. That’s it for today, stay tuned!
After 10 years as a Sysadmin in High Performance Computing, Wilfried Roset is now part of OVHcloud as Engineering Manager for their Databases product Unit. He focuses on industrialization, reliability and performances for both internal and public clusters offers.