SQL schemas, queries and indexes can be hard to get right, especially when most of your time is spent dealing with application logic, not the data storage layer. Luckily, large language models like ChatGPT are capable of figuring out ways to speed up your queries and optimize the way your code interacts with SQL without you having to spend hours reading through documentation and manuals.
A sample scenario
To get started, we need a database schema and sample query. I decided to use a really old personal project of mine that i wrote back when i understood very little about postgresql. It was an application managing image albums, with most metadata stored as JSONB and just the fields relevant to search set up as proper SQL columns. The relevant table definitions were as follows:
CREATE TABLE aio_albums(
id INT NOT NULL PRIMARY KEY,
search_data TEXT NOT NULL,
meta JSONB NOT NULL,
release_date TIMESTAMP WITH TIME ZONE NOT NULL,
cover BYTEA NOT NULL
);
CREATE INDEX aio_albums_search_data ON aio_albums(search_data);
CREATE INDEX aio_albums_release_date ON aio_albums(release_date);
The column meta
contains the actual information such as titles, artists, page count etc, which was not supposed to be filterable in the application. The main purpose of this table was to page through the entire catalog of albums in reverse order of release_date
(ie latest first). A page contained 20 items, so to fetch a query would use an OFFSET
value of 20*(page_number-1) and a fixed LIMIT
of 20. A sample query to fetch page 3033 would thus cause an offset of 60660 items and look like this:
SELECT meta FROM aio_albums ORDER BY release_date DESC, id DESC OFFSET 60660 LIMIT 20;
This has the obvious disadvantage of still processing (but skipping) the first 60660 results of the query, making it very costly for a query to fetch just 20 items, as evident from the EXPLAIN ANALYZE
output:
Limit (cost=12011.55..12015.51 rows=20 width=632) (actual time=47.541..47.553 rows=20 loops=1)
-> Incremental Sort (cost=0.46..13868.26 rows=70037 width=632) (actual time=0.045..46.414 rows=60680 loops=1)
Sort Key: release_date DESC, id DESC
Presorted Key: release_date
Full-sort Groups: 1897 Sort Method: quicksort Average Memory: 55kB Peak Memory: 55kB
-> Index Scan Backward using aio_albums_release_date on aio_albums (cost=0.29..10716.59 rows=70037 width=632) (actual time=0.012..16.608 rows=60681 loops=1)
Planning Time: 0.151 ms
Execution Time: 47.580 ms
(8 rows)
Involving ChatGPT
When giving the table definitions, query and EXPLAIN ANALYZE
output to ChatGPT and asking it why this query is slow and how it could be optimized, it correctly identifies (among other things), the missed opportunity of using keyset pagination:
[...] 2. Pagination with Keyset Pagination: Instead of usingOFFSET
andLIMIT
, you can consider implementing keyset pagination. Keyset pagination relies on using the last seen value from the previous page as a marker for fetching the next page. This approach avoids the need to skip a large number of rows usingOFFSET
, resulting in better performance. [...]
It even provides a sample query syntax for it
SELECT meta
FROM aio_albums
WHERE (release_date, id) < (<last_release_date>, <last_id>)
ORDER BY release_date DESC, id DESC
LIMIT 20;
followed by a basic explanation of keyset pagination:
In this modified query,<last_release_date>
and<last_id>
represent the values ofrelease_date
andid
from the last record of the previous page. This query fetches the next page of results based on the values ofrelease_date
andid
, eliminating the need forOFFSET
and improving performance.
Putting the advice to the test
Modifying our query according to ChatGPT's suggestion would look something like this:
SELECT meta FROM aio_albums WHERE (release_date, id) < ('2022-07-06 12:53:56.527719+00', '409812') ORDER BY release_date DESC, id DESC LIMIT 20;
and produce a very different EXPLAIN ANALYZE
output:
Limit (cost=0.48..4.87 rows=20 width=631) (actual time=0.048..0.050 rows=20 loops=1)
-> Incremental Sort (cost=0.48..13730.56 rows=62461 width=631) (actual time=0.047..0.048 rows=20 loops=1)
Sort Key: release_date DESC, id DESC
Presorted Key: release_date
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 44kB Peak Memory: 44kB
-> Index Scan Backward using aio_albums_release_date on aio_albums (cost=0.29..10919.82 rows=62461 width=631) (actual time=0.018..0.026 rows=21 loops=1)
Index Cond: (release_date <= '2022-07-06 12:53:56.527719+00'::timestamp with time zone)
Filter: (ROW(release_date, id) < ROW('2022-07-06 12:53:56.527719+00'::timestamp with time zone, 409812))
Rows Removed by Filter: 1
Planning Time: 0.204 ms
Execution Time: 0.084 ms
(11 rows)
By switching pagination strategies from offset-based to keyset pagination, we have effectively reduced the query duration from 47.580ms to just 0.084ms - that's a 566x faster query producing the same result!
Caveats
At this point i have to note that ChatGPT is by no means perfect and it's suggestions shouldn't be followed blindly. While it identified an issue with the query and correctly pointed out a better way to query for the data we wanted, it also gave incorrect/unhelpful suggestions:
Index onid
column: Since the query involves sorting based on theid
column in descending order, you can create an index on theid
column to speed up the sorting operation. However, keep in mind that creating an index on a primary key column (which is already indexed by default) may not provide a significant performance improvement.
This is wrong because as you can see above, the id
column is our primary key, which is automatically an indexed column in postgresql.
Query Optimization: Depending on your specific requirements, you might consider optimizing the query itself. For example, if you only need the meta
column, you can modify the query to retrieve only that column, which can reduce the amount of data being processed and improve performance.
This is unhelpful as we are already doing that.
So take the advice you get with a grain of salt and never implement it blindly. Test the suggested improvements thoroughly and also make note of contextual changes like possibly higher cpu/memory usage while testing.