Optimizing SQL queries with ChatGPT

Table of contents

SQL schemas, queries and indexes can be hard to get right, especially when most of your time is spent dealing with application logic, not the data storage layer. Luckily, large language models like ChatGPT are capable of figuring out ways to speed up your queries and optimize the way your code interacts with SQL without you having to spend hours reading through documentation and manuals.

A sample scenario

To get started, we need a database schema and sample query. I decided to use a really old personal project of mine that i wrote back when i understood very little about postgresql. It was an application managing image albums, with most metadata stored as JSONB and just the fields relevant to search set up as proper SQL columns. The relevant table definitions were as follows:

CREATE TABLE aio_albums(
   id INT NOT NULL PRIMARY KEY,
   search_data TEXT NOT NULL,
   meta JSONB NOT NULL,
   release_date TIMESTAMP WITH TIME ZONE NOT NULL,
   cover BYTEA NOT NULL
);
CREATE INDEX aio_albums_search_data ON aio_albums(search_data);
CREATE INDEX aio_albums_release_date ON aio_albums(release_date);

The column meta contains the actual information such as titles, artists, page count etc, which was not supposed to be filterable in the application. The main purpose of this table was to page through the entire catalog of albums in reverse order of release_date (ie latest first). A page contained 20 items, so to fetch a query would use an OFFSET value of 20*(page_number-1) and a fixed LIMIT of 20. A sample query to fetch page 3033 would thus cause an offset of 60660 items and look like this:

SELECT meta FROM aio_albums ORDER BY release_date DESC, id DESC OFFSET 60660 LIMIT 20;

This has the obvious disadvantage of still processing (but skipping) the first 60660 results of the query, making it very costly for a query to fetch just 20 items, as evident from the EXPLAIN ANALYZE output:

 Limit (cost=12011.55..12015.51 rows=20 width=632) (actual time=47.541..47.553 rows=20 loops=1)
  -> Incremental Sort (cost=0.46..13868.26 rows=70037 width=632) (actual time=0.045..46.414 rows=60680 loops=1)
        Sort Key: release_date DESC, id DESC
        Presorted Key: release_date
        Full-sort Groups: 1897 Sort Method: quicksort Average Memory: 55kB Peak Memory: 55kB
        -> Index Scan Backward using aio_albums_release_date on aio_albums (cost=0.29..10716.59 rows=70037 width=632) (actual time=0.012..16.608 rows=60681 loops=1)
 Planning Time: 0.151 ms
 Execution Time: 47.580 ms
(8 rows)

Involving ChatGPT

When giving the table definitions, query and EXPLAIN ANALYZE output to ChatGPT and asking it why this query is slow and how it could be optimized, it correctly identifies (among other things), the missed opportunity of using keyset pagination:

[...] 2. Pagination with Keyset Pagination: Instead of using OFFSET and LIMIT, you can consider implementing keyset pagination. Keyset pagination relies on using the last seen value from the previous page as a marker for fetching the next page. This approach avoids the need to skip a large number of rows using OFFSET, resulting in better performance. [...]

It even provides a sample query syntax for it

SELECT meta
FROM aio_albums
WHERE (release_date, id) < (<last_release_date>, <last_id>)
ORDER BY release_date DESC, id DESC
LIMIT 20;

followed by a basic explanation of keyset pagination:

In this modified query, <last_release_date> and <last_id> represent the values of release_date and id from the last record of the previous page. This query fetches the next page of results based on the values of release_date and id, eliminating the need for OFFSET and improving performance.

Putting the advice to the test

Modifying our query according to ChatGPT's suggestion would look something like this:

SELECT meta FROM aio_albums WHERE (release_date, id) < ('2022-07-06 12:53:56.527719+00', '409812') ORDER BY release_date DESC, id DESC LIMIT 20;

and produce a very different EXPLAIN ANALYZE output:

 Limit (cost=0.48..4.87 rows=20 width=631) (actual time=0.048..0.050 rows=20 loops=1)
  -> Incremental Sort (cost=0.48..13730.56 rows=62461 width=631) (actual time=0.047..0.048 rows=20 loops=1)
        Sort Key: release_date DESC, id DESC
        Presorted Key: release_date
        Full-sort Groups: 1 Sort Method: quicksort Average Memory: 44kB Peak Memory: 44kB
        -> Index Scan Backward using aio_albums_release_date on aio_albums (cost=0.29..10919.82 rows=62461 width=631) (actual time=0.018..0.026 rows=21 loops=1)
              Index Cond: (release_date <= '2022-07-06 12:53:56.527719+00'::timestamp with time zone)
              Filter: (ROW(release_date, id) < ROW('2022-07-06 12:53:56.527719+00'::timestamp with time zone, 409812))
              Rows Removed by Filter: 1
 Planning Time: 0.204 ms
 Execution Time: 0.084 ms
(11 rows)

By switching pagination strategies from offset-based to keyset pagination, we have effectively reduced the query duration from 47.580ms to just 0.084ms - that's a 566x faster query producing the same result!

Caveats

At this point i have to note that ChatGPT is by no means perfect and it's suggestions shouldn't be followed blindly. While it identified an issue with the query and correctly pointed out a better way to query for the data we wanted, it also gave incorrect/unhelpful suggestions:

Index on id column: Since the query involves sorting based on the id column in descending order, you can create an index on the id column to speed up the sorting operation. However, keep in mind that creating an index on a primary key column (which is already indexed by default) may not provide a significant performance improvement.

This is wrong because as you can see above, the id column is our primary key, which is automatically an indexed column in postgresql.


Query Optimization: Depending on your specific requirements, you might consider optimizing the query itself. For example, if you only need the meta column, you can modify the query to retrieve only that column, which can reduce the amount of data being processed and improve performance.

This is unhelpful as we are already doing that.



So take the advice you get with a grain of salt and never implement it blindly. Test the suggested improvements thoroughly and also make note of contextual changes like possibly higher cpu/memory usage while testing.

More articles

Avoiding the 3 most common python pitfalls

Learn how to overcome the most notorious coding pitfalls instead of being surprised by bugs

Demystifying /dev: Understanding Linux Device Files

Understanding device files and their uses in linux

Managing users on linux

A guide on users, groups and authentication

Forwarding docker container logs to Grafana Loki

Advanced log processing for local container environments

Brute-forcing logins with hydra: Attack and defense

How an attacker would crack a login, and how to protect against it