Explaining slow queries to my manager

In our previous blog post about observability, we explained how to build a comprehensive view of how your SQL workloads behave, and the many reasons why it is important to have this view. In this blog post, we will take a closer look into the four types of SQL query, and how they can impact the end user’s experience.

Before going further, let us recapitulate what these four types of queries are:

Select is for reading data
Insert is for adding data
Update is for changing data that already exists
Delete is for deleting data

my-database=# select * from customers where nic = XXX;
my-database=# insert into customers values (1, 'user-firstname', 'user-lastname', '21');
my-database=# update customers set address = 'xxx xxx' where nic = 'XXX';
my-database=# delete from customers where nic = XXX

When it comes to SQL workloads, there are two different types: OLTP and OLAP.

OLTP workloads

OLTP (for OnLine Transactional Process) workloads correspond to the “organic” use of databases. These operations are used to make more effective use of the databases behind websites, APIs, e-commerce platforms etc. While OLAP relies exclusively on read, OLTP workloads rely on all types of queries, including select, insert, update, and delete. With OLTP, we want the queries to reply as fast as possible, generally in under a few hundred milliseconds. The reason behind this need for speed is to reduce the impact queries will have on your application’s user experience. After all, who loves websites that take forever to load? In terms of the number of queries, we usually count them in thousands per second.

my-database=# insert into cart values (...); -- create a new cart
my-database=# insert into cart_content values (...); -- add items
my-database=# update cart_content set ... where ...; -- modify your cart
my-database=# select item_name, count from cart_content where ...; -- check the content
my-database=# insert into sales values (...); -- validate the cart

OLAP workload

OLAP (for OnLine Analytic Process) workloads are used to extract and analyse huge volumes of data (hence the name). They are the main tool used by business intelligence software platforms to produce forecasts and reports. In terms of queries, OLAP workloads usually rely exclusively on a few select ones that are periodically executed, and can take a long time (from minutes to hours) to complete.

As you can see, OLTP and OLAP workloads are very different. Comparing them is like comparing a racecar (OLTP, hopefuly) to a truck (OLAP).

Classifying queries: the good, the bad, the ugly… and the slow

Now that we have a general understanding of the two types of workloads, let us focus on OLTP, as they are usually the most relevant to the customer-facing parts of your platforms. At the beginning of this post, I described the four different types of SQL queries in terms of their purpose. Now, we’ll classify them according to their behaviour: good, bad, ugly and slow. What does that mean, you ask? Let me explain (spoiler: you want your queries to fall in the “good” category)…

The good

As you would expect, good queries are the ones that run as expected and reply relatively fast. At OVHcloud, we define “fast” as a response time of lower than one second for our internal databases. But one second is still a long time to wait for a response, especially when multiple queries are made to load a single web page. Generally, we aim for 10-20ms. You should draw this “fast-line” depending on your setup, resources, and intended usecase.

Your backend will query the database and get an answer in, let’s say, 20ms, which will leave plenty of time to process the data and send a result. The faster your queries runs, the happier your customers and boss will be.

When I want to explain to my boss why fast queries are good, it is pretty simple: fast queries mean good user experience, fast order, fast checkout, and more profit.

my-database=# select firstname, lastname from customers where id = 123;

The bad

Bad queries, on another hand, are queries that can not be executed by the DBMS. There can be multiple reasons for that: bug in the code, lack of control somewhere in the process etc…

Let’s take an example. Your website has a form where users can create an account, in which they provide their age. In the UI, the text box lets the user type in whatever they want and passes the value as a string. But if your schema is well-designed, it should expect an integer in the “age” field. This way, if a user tries to type their age as a string in the box rather than as a number, the DBMS should return an error. The solution is simple: the UI form should check the type of data filled in the box and return an error message, such as “invalid data”, at the UI front-end, instead of waiting for the DBMS to do so. In cases like this, it would be advisable to only allow numbers.

my-database=# insert into user values (1, 'user-firstname', 'user-lastname', 'twenty years old');
ERROR:  invalid input syntax for integer: "twenty years old"

You can fix this type of “bad” query by adding more control through the chain, using the correct type in the UI, with checks in the front-end, middle-ware, back-end, and so on.

To my boss, I would explain that bad queries are a hindrance to customers wanting to use your service, and will lead to a loss of profit. However, because of their straightforward nature, they are usually relatively easy to debug.

The ugly

Ugly queries are more problematic. They are queries that sometimes work, sometimes don’t because of deadlocks.

Deadlock is a vast subject, but for now let’s keep it simple. A deadlock happens when multiple queries are waiting for each other to finish. Let’s look at the following example:

-- STEP #1
process 1: BEGIN; -- this will start the transaction
process 2: BEGIN;
 
-- STEP #2
process 1: UPDATE stock SET available = 10 WHERE id = 123; -- lock row 123
process 2: UPDATE stock SET available = 0 WHERE id = 456; -- lock row 456
 
-- STEP #3 The ugly part starts now
process 1: UPDATE stock SET available = 1 WHERE id = 456; -- wait for lock on row 456
process 2: UPDATE stock SET available = 9 WHERE id = 123; -- wait for lock on row 123

We can see that we have two processes trying to update the stock within a transaction. Processes #1 and #2 are each locking a different row. Process #1 is locking row 123, while process #2 is locking row 456 in step 2. In step 3, without releasing the lock on its current row, process #1 tries to acquire a lock on row 456, which is already owned by process #2, and vice versa. To complete their transaction, they are both waiting on each other. I have chosen a simple example with only two queries, but this problem can happen with tens or thousands of queries at the same time. A general rules when working with transactions is to commit them as fast as possible.

The difficulty is that the query can be perfectly correct and work most of the time, especially in your CI/CD pipeline, where corner cases and rare events are not necessarily identified and tested. But the more your business grows, the likelier these rare events are to happen, as you increase the number of concurrent queries made. And unfortunately, the likeliest time for deadlocks issues is during load peaks caused by sales, holidays, etc. In other words, exactly when you need your workflow to work perfectly.

To explain that to my boss, I would say that when deadlock happen, there is either something wrong in the backend, the database schema, or in the workflow logic itself. In a worst-case scenario, the problem will hit at the most incovenient moment, so the customer will be unable to interact with your system, and not even receive an understandable error message, which will be hard to fix. Deadlocks take time to understand, debug and fix. By the time you have prodded the fix, your customers will be spending their money elsewhere, or your support will be crumbling under tickets and calls.

process 34099 detected deadlock while waiting for ShareLock on transaction 4026716977 after 1000.042 ms
DETAIL: Process holding the lock: 34239. Wait queue: .
CONTEXT: while locking tuple (259369,24) in relation "table"
STATEMENT: SELECT col from table where ...

Your favourite DBMS will eventually kill all queries, but only after a given timeout. And of course, timeout means your customers will wait for the result before ending up with an error. Yeah, this is ugly…

And the slow…

Finally, as you can probably imagine, slow queries are queries that take time to finish. They are very easy to describe, but not that easy to fix, and improving them should be an ongoing effort. Here are some common causes of slow queries:

Poorly written queries (but you don’t have this in prod, do you?)
Missing indexes
Fetching too many rows
Too much data to go through

For this one, my boss doesn’t need an explanation. Slow queries mean slower API calls, slower UI and fewer customers reaching the checkout stage.

The fix can be straight forward: rewrite your queries, find and add missing indexes, and fetch only what is needed. However, reducing the amount of data your queries have to go through is a little bit more difficult. It can be done through regular purges in your DB, archiving, partitioning, etc. But in practice, you should only keep hot and relevant data in your customer-facing databases to avoid bloat.

Conclusion

Let’s wrap up and summarise:

Good queries are a sign of a healthy workload. The more you have, the better it is
Bad queries mean that something is broken somewhere
Ugly queries are waiting for the worst possible moment to slap you
Slow queries mean that you have something working, but there is room for improvement

One last tip… this is not a one-time job. You need to keep a close eye on those four query categories.

That’s it folks! You now know enough to dig into your applications and databases to continuously improve your workloads. Ultimately, your customers are impacted by all four categories, so I’m sure that you know why you only want good queries in your information systems!