Storing sessions or cache data in postgresql

Every now and then, people realize that postgresql is much more flexible than they expected, and can take over large portions of infrastructure that were handled by other databases or services. But just because it can serve different tasks doesn't mean it easily lends itself to them. Experience in configuration and database internals are essential for production workloads, especially when using postgresql outside it's comfort zone.

Sample scenario

To properly showcase the capabilities of postgresql and the hurdles you may face, we are using one of the worst use cases for relational databases like postgresql: semi-ephemeral cache data. While this could be an actual cache system, we are specifically choosing to store user sessions, which require automatic expiry and keepalive semantics (sessions automatically extend while they are read/used/active).

The dynamically changing expiration field prevents using table partitioning and cheaper TRUNCATE based workarounds, forcing real table tuning for the workload, while also causing very high write load on the table to pressure autovacuum beyond normal capacity.

In terms of planning, we assume that the sessions we store are backing a web- or mobile app and need to be read on every page load, many users are active simultaneously and a session must include an arbitrary payload, likely a json-encoded map or object. Users are handed an automatically generated session ID by the server and will provide only that when making requests.

How to shoot yourself in the foot

When searching the internet, you will quickly find posts suggesting that cache-like workloads from memcached/valkey/redis can easily be migrated to an existing postgresql instance to save money.

Just create a table:

CREATE UNLOGGED TABLE sessions(
  id INT ALWAYS GENERATED AS IDENTITY PRIMARY KEY,
  payload JSONB,
  expires_at TIMESTAMPTZ
);

Note the UNLOGGED keyword in the statement, disabling WAL (write-ahead-log) entries for the table, resulting in a roughly 10x speed boost to write queries.

Combined with a background query your app executes periodically to reap expired rows, the feature is supposedly complete:

DELETE FROM sessions WHERE expires_at < NOW();

They often also suggest adding an index on expires_at to improve the performance of the query:

CREATE INDEX idx_session_expires on sessions(expires_at);

All of this may sound reasonable to you if you aren't very familiar with postgres. At second glance you might spot your database administrator has fainted at the sight of the proposed setup.

While looking innocent, this configuration is making virtually every mistake you could when storing sessions in postgres, while missing out on real optimizations and even bottlenecking your infrastructure's ability to scale.

Implementing this in production is a horrible idea, do not ever do this.

There are many reasons why this design is so problematic, so let's go over them carefully one by one.

Table persistence and data integrity

One of the biggest footguns is the idea that UNLOGGED tables are a cure-all to table performance. Creating an UNLOGGED table means that writes are not written to the WAL (write-ahead log) for persistence. Some people think this means their data is lost on crash, but the table is actually truncated (emptied) on every restart, even clean shutdown/restarts and server reboots.

What many people don't consider is that the WAL is not only used for data integrity, but also for replication to read-only replicas in a clustered setup (postgres's only way of scaling horizontally) - no WAL means no replication, so the UNLOGGED table will exist but always be empty on all read replicas - effectively preventing you from scaling your postgres sessions beyond a single machine. Highly available clusters will also lose all table rows if the primary replica (master instance) fails and a replica is promoted - even if all other data and connections survive, your sessions will be gone, logging all your users out.

As you can see, UNLOGGED is not a magic switch to make tables faster - it is a tradeoff, and one you likely aren't willing to make considering the scaling issue. Luckily, we don't have to rely on such a destructive feature to increase write performance to our session table. Let's set up a simple (logged, replicated) table for sessions torage:

CREATE TABLE sessions(
  id BIGINT ALWAYS GENERATED AS IDENTITY(CACHE 100) PRIMARY KEY,
  payload TEXT,
  expires_at TIMESTAMP DEFAULT NOW() + INTERVAL + '30 Minutes'
);

Ignore the data type differences to the previous table, we will explain them in the next paragraph.

Note the added (CACHE 100) setting in the statement: by default, autogenerated serials like the id column in our example use a lightweight locking mechanism inside postgres when generating a new value to prevent duplicates. However lightweight it may be, it adds latency you can feel. Setting a cache value of 100 instead tells clients to fetch the next 100 serial sequence values and cache/use them locally, removing 99% of internal locking within the postgresql instance during INSERTs.

For a performance baseline, i ran a test on my machine against a local docker container running the postgres:15-alpine image in default configuration. Inserted were 100k rows of randomly generated 250 character long dummy payload strings in separate queries, like actual session storage would:

INSERT INTO sessions (payload) VALUES ('<session payload data>');

For an UNLOGGED table, writes took on average 0.03ms/query, the logged table above averaged 0.24ms/query, roughly 8x slower. However, we can remove most of this latency without disabling WAL writes at all, simply by not waiting until the write is complete. The setting for this is synchronous_writes = off, which can't be set for a table directly. It can only be set in four places, two of which are a bad fit, like postgresql.conf or in a session with SET synchronous_commits = off; would affect other tables / queries - avoid these. However, the remaining options are great for our needs.

You could wrap the write in a transaction, making the setting local to that transaction only:

BEGIN;
  SET LOCAL synchronous_commit = off;
  INSERT INTO sessions (payload) VALUES ('<session payload data>');
COMMIT;

Re-running the 100k row insert test comes in at 0.097ms/query, roughly 2.5x faster than a normal logged table - but still more than 3x as much as an UNLOGGED one.

The SQL statement is rather long, so sending it over the network to the postgres instance causes a good chunk of the remaining latency. We can solve that problem by storing the query as a postgres function:

CREATE OR REPLACE FUNCTION new_session(TEXT)
RETURNS void AS $$
BEGIN
  PERFORM set_config('synchronous_commit', 'off', true);
  INSERT INTO sessions (payload) VALUES ($1);
END;
$$ LANGUAGE plpgsql;

now only a short string needs to be sent to postgres for a write:

SELECT new_session('<session payload data>');

With both optimizations, queries only take 0.037ms on average, very close to the 0.03ms of an UNLOGGED table - except our table still works with horizontally scaled clusters and data isn't lost on reboots, crashes or failovers.

A 23.3% difference in write speed may seem like a lot to you, so let's put this into perspective: In a best case scenario, where your database server is housed in a high-end datacenter, in the same rack as your application server, with only a single network switch inbetween, networking alone would add 0.1ms latency. With this in mind, in a perfect world, with best-in-class hardware, that comparison becomes 0.13ms vs 0.137ms - merely 5.38% difference. Real world deployments are rarely this fortunate. If your database server is in a different nearby datacenter from your application server, networking latency will add around 5ms to your query speed, decreasing the performance gains to 0.139%. If your servers are further apart for geographic resiliency or global clustering, anywhere from 10-200ms are common.

Picking the correct data types

Write performance includes the cost of validating, decoding and converting incoming data depending on column data type. For maximum performance, you need to use simple types with minimal processing overhead to keep writes lightweight.

For the primary key column id, you should likely use BIGINT, since you need a numeric value to have values automatically generated at virtually no cost with an internal monotonic counter. Using INT is a bad fit because it provides a very limited amount of possible values, which write-heavy session data coul quickly exhaust.

The payload column is best served as a TEXT field, even if it is encoded JSON. Relying on JSON/JSONB would add high validation/processing cost on INSERT, while BYTEA would offer minimal performance increases on write but at the cost of of more complex application logic and less data integrity between clients (possibly ambiguous text encoding).

Lastly, the expires_at column should be a TIMESTAMP. There is no need for the extra timezone information of a TIMESTAMPTZ since the value will only be used by postgres internally (which won't drift from its own timezone). Do not rely on BIGINT with a unix epoch value here: It is the same size internally as a TIMESTAMP (8 bytes), but computing it is problematic. In order to delete expired rows, you would need a current unix epoch; if your app logic generates it, you introduce problems like timezone differences and clock skew. If postgres generates it, you lose performance from generating it:

DELETE FROM sessions WHERE expires_at < EXTRACT(EPOCH FROM NOW());

This generates a TIMESTAMP value from NOW(), then EXTRACT casts it into a DOUBLE PRECISION floating point number, which is then implicitly casted back into a BIGINT for comparison to the column. This is much more expensive than simply using a TIMESTAMP field and comparing directly to NOW(), resulting ina shorter query and no type casting:

DELETE FROM sessions WHERE expired_at < NOW();

Indexes and constraints

Coming from other databases, you might expect that using fixed-size columns improves performance. In postgres, that doesn't hold up the way you expect. For example, a varchar(255) is actuallyslower than a TEXT column, because postgres simply treats varchar as an alias for TEXT, and the size restriction is a constraint that is checked on INSERT - it does not affect internal data storage layout or query planning.

Adding any form of restrictions to our session table only adds more write latency for virtually no benefit, so refrain from using size constraints, NOT NULL, UNIQUE or foreign key relationships.

You should also not add any indexes on any of the columns: id is already indexed implicitly because it is the PRIMARY KEY field, and payload should never be used for querying / filtering - only written once and read with a id comparison conditon.

Not adding an index on expires_at might seem counter-intuitive, but we will discuss that case in the next paragraph.

For maximum performance, read only the payload column when looking up sessions, do not SELECT * or include any other columns in read queries.

Row expiry semantics

In order to automatically expire sessions, you need a background task that filters expired columns and runs DELETE operations on them. It is enough to run a simple query like this periodically (perhaps once every 5 mintues) from your application:

DELETE FROM sessions WHERE expires_at < NOW() LIMIT 1000;

Using the expires_at column in the query condition might tempt you to create an index on it, but you really shouldn't. The field will be updated frequently to keep sessions alive while active, and an index would amplify the write latency since you now need to write a new data tuple and update the index. The delete operation is a background job, making it faster does not provide much benefit either way, so the index is not worth that tradeoff.

More important would be the LIMIT 1000; portion of the query. Since the sessions table is expected to experience high churn (write/update/delete load), a large number of rows may expire within the same time window. Deleting a large number of rows at once has implications on table performance that need a little more explanation.

In postgres, data from rows is stored as a "tuple" (like an array containing all column values for that row). Many of these tuples are stored together in larger chunks called "pages" for performance reasons. When you delete a row, the tuple is marked as deleted, commonly refered to as a "dead tuple". This dead tuple remains in the page, since deleting it immediately would mean rearranging the page and all tuples within it, which would be too expensive and slow for every delete.

Dead tuples still impact query performance, because read queries need to skip over the space they occupy in the page. More dead tuples lead to less query performance.

The VACUUM command is used to physically reap dead tuples out of pages and remove their performance penalty. You can run these manually, but postgres will also run them automatically in the background, a feature called autovacuum.

While autovacuum is reaping dead tuples from pages, it may use a significant portion of resources (cpu, shared memory buffers, disk i/o), lock the rows in question and potentially interfere noticeably with queries on the table. Deleting many rows at once produces many dead tuples, causing excessive pressure on the next autovacuum run to keep up with dead tuple removal. Keeping deleted rows capped at a limit like 1000 per 5 minutes ensures that autovacuum is never overloaded, thus maintaining query performance. Potentially leaving sessions unexpired for slightly longer than configured can be mitigated by adding a WHERE expires_at < NOW() condition to queries instead, so there is no downside to this approach.

The specific row delete limit depends somewhat on your postgres setup, server hardware and cleaning interval, but 1000-5000 is a good value to start with.

Keeping sessions alive

Most modern session systems treat expiration as a dynamic timer, resetting back to the initial lifetime value every time it is used. For our session table, that means updating the expires_at column when a session is read/used.

A simple way to manage this is to use UPDATE ... RETURNING syntax:

UPDATE sessions
  SET expires_at = NOW() + INTERVAL '30 Minutes'
  WHERE id = 15 AND expires_at < NOW()
RETURNING payload;

This guarantees that keepalive updates cannot be forgotten when reading session payload data, but it also creates a problem: If reads are bound to writes, then only the primary (master) instance of a cluster can serve our queries, ruining horizontal scaling with read replicas again.

For scalability, use simple SELECT queries for lookup:

SELECT payload FROM sessions WHERE id=15 AND expires_at < NOW();

and in your application logic, keep an in-memory buffer (or a postgres TEMPtable) of recently used id values you looked up. Every 30s-60s, update session lifetime as a single batched write:

UPDATE sessions
  SET expires_at = NOW + INTERVAL '30 Minutes'
  WHERE id = ANY('{15,16,17,18}'::BIGINT[]);

This way, all read queries can still be load balanced across read replicas, and write pressure is reduced to once or twice a minute with a single larger batch write. If your buffer is very large, split the batch query into chunks of roughly 50k rows per update.

For even better performance, consider excluding recently updated sessions with a secondary condition:

UPDATE sessions
  SET expires_at = NOW() + INTERVAL '30 Minutes'
  WHERE id = ANY('{15,16,17,18}'::BIGINT[])
  AND expires_at < NOW() + '20 Minutes';

Now sessions are at most kept alive every 10 minutes, heavily reducing write load on the sessions table, but at the cost of potentially cutting session lifetime short by up to 10 minutes. You could instead increase session lifetime by 10 minutes to counter-act it, potentially making some sessions slightly longer, which your users are less likely to be upset about.

Tuning pages and autovacuum

Dead tuples will accumulate fast in the sessions table, both directly from the background queries deleting expired rows, but also implicitly from the keepalive updates. Row updates in postgres are simply an atomic DELETE + INSERT, producing a dead tuple every time any row value changes.

Autovacuum needs some help to keep up with the high churn session table. One quick win is to decrease the fillfactor of pages. A page is using 100% of its space for tuple storage by default, so an update marks the current tuple dead and needs to write the new version to a different page. Decreasing the fillfactor value keeps some free space inside pages, so an update can write the new value into the same page the old value is in, avoiding index updates and writes to different pages. This is called a HOT (Heap-Only Tuple) update, and is much less expensive than a normal one. You pay for this performance with mroe disk space, but sessions aren't expected to grow extremely large since they expire after some time, so using 30% more storage shouldn't be a problem.

Reducing the fill factor to 70 is a good starting point, leaving 30% free space in every page:

ALTER TABLE sessions SET (fillfactor = 70);
VACUUM FULL sessions;

Remember to add the VACUUM command so existing pages are rebuilt with the new empty space requirement. In extreme cases you could reduce the fill factor down to 50, but that would be overkill for most use cases.

Autovacuum also needs to be tuned to run more often and more aggressively to keep up with the many dead tuples created by the session workload.

By default, autovacuum only reaps dead tuples from a table when 2% of all stored tuples are dead (autovacuum_vacuum_scale_factor = 0.02) - but only if the total number of dead tuples is great than 50 (autovacuum_vacuum_threshold = 50). The vacuum process itself sleeps for 20ms (autovacuum_vacuum_cost_delay = 20) inbetween tasks when too much work is needed in the table, to prevent interfering with active queries.

These default values should be tuned for high-churn tables:

ALTER TABLE sessions SET(
  autovacuum_vacuum_threshold = 2000,
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_vacuum_cost_delay = 2
);

We don't need aggressive reaping with few dead tuples because fillfactor buys some time, so threshold can be set much higher to around 2000. Scale factor on the other hand is too large - we need to clean faster for larger tables, so reducing it to 1% (0.01) is a better baseline. Additionally, the cost delay is reduced to 2ms sleep timers, since the session table may contain a large amount of dead tuples that need to be cleaned quickly, before the keepalive background task executes the next batch update for recently used sessions.

By default psotgres only has 3 worker processes that can vacuum tables, and the session table will likely hog one of them most of the time, so you might want to increase the amount of globally available worker processes on the psotgres instance, so other tables aren't negatively affected:

ALTER SYSTEM SET autovacuum_max_workers = 6;
SELECT pg_reload_conf();

Don't forget the second command, or your changes won't be applied until the next restart!

Should you store sessions in postgres?

Congratulations on making it through the article! Now look back and ask yourself: is all that effort really worth it just to remove caching databases from your architecture?

Memcached, valkey or redis are all purpose-built for this kind of workload, and would have given you almost identical performance and capabilities out of the box, with no or minimal tuning.

Yes, reducing another database sounds good on paper: reduced cloud/server bills, less dependencies - but it adds hidden cost in the form of complexity. Your sessions or cache implementation is now much more difficult to maintain and reason about, and work or debugging involving them requires highly skilled engineers that know postgres inside and out. These engineers aren't cheap, and their work hours should be spent on business logic (the part that makes money), not internal semantics like sessions.

So yes, you could store sessions or cache data in postgres. But you really shouldn't.