The hidden cost of schemaless databases

Schemaless databases have been around for a while, gaining most of their popularity during the NoSQL hype around 2010. The pitch sounded great: Build fast, adjust documents as needed. Unfortunately, schemaless databases proved to be a nightmare for scaling or long-living projects with larger teams in practice. This article focuses on schemaless databases specifically. It is not a comparison of SQL vs NoSQL, or a smear campaign aimed at a specific database. Most modern NoSQL databases allow you to enforce a schema, and many SQL databases offer schemaless options through JSON columns - all of these databases could be on either side of the argument, depending on how you decide to use them. We are talking about data structure, not storage.

What's wrong with schemaless databases?

Imagine you are a developer who just found out about schemaless document databases. You read up on the claimed benefits of fast development and simplicity, and decide to give it a try in your next project.

Said and done, your first project comes together nicely. Dynamically changing document schemas allows for fast iterations, and your team makes progress very quickly. Everyone is happy.

A few years later, you start working at a new company which uses the same schemaless database you used before. When looking at the database, you can't really tell what kinds of documents are stored: there are user, account and identity documents with different fields and data types, and even documents of the same type have varying fields. The internal documentation was abandoned during last crunch cycle and refers to documents that either don't look like described or don't exist at all anymore. The only choice is to read through the code in hopes to find all references for each document, so you figure out what data type goes into which document field, which fields are used for querying and so on. Nothing about your work is fast or fun anymore.

Weeks later, you caught up with documentation and convinced the team to streamline collections and documents with the next update - but the deployment fails. Some documents were converted, some weren't, and some were even mid-write when the database crashed. There are no rollbacks, no way to undo the transaction: you are stuck at restoring a backup, and explaining the hours of downtime to your manager will be a nightmare.

Where did things go wrong?

This could have been avoided by sticking to some best practices: keeping the internal documentation up to date, keeping a staging environment for deployment testing and the like. The problem is that these are policies, which turn from "solid rules" to "mild suggestions" under the pressure of a deadline. Any rule that isn't rigorously enforced will be ignored eventually, this is true for any part of software development, from quality assurance to formatting, test coverage to schema documentation.

And right here, we touch on an important point: just because the database doesn't enforce a schema doesn't mean your data doesn't have one. The schema simply shifted from the rigid technical enforcement of the database to the brittle and disconnected internal wiki, where it may drift from actual database state over time, without anyone knowing.

The schema does exist, and at scale you do need to keep data in some kind of common format either way. Database-enforced schemas require developers to think about their data upfront, but guarantee that documentation and database contents can never drift apart.

All data has structure

When talking about data schemas, we really just mean structure. Any data you process or store has some kind of structure, some form of rules you apply and validate. The misconception that schema can somehow be avoided falls apart quickly in reality. Let's take the popular claim that schemaless is a good choice for ecommerce products as an example.

An online shop may offer different kinds of goods with varying properties. For example, they may offer clothes in different sizes and colors, hard drives with different capacities, and tapes of different lengths. At first glance, it would seem that a "product" has no stable properties and schemaless would be an excellent fit here. But in reality, they are different types of products, each with their own inherent schema. You wouldn't want to allow paint to have a size, or clothes to display storage capacity - their inherent schema forbids it. Instead of enforcing them schema at the database level, you merely move it into the frontend, checking the type of each product as you read it, then process only the fields it should have. The schema didn't go away - it leaked into the application layer.

The same pattern applies to virtually any data you decide to store in a database. If you think you have schemaless data, take a second to think if you are going to validate or type check it in your application - if you do, then your data does have a schema, and you are just thinking about moving it outside the database.

Moving schema to the application layer

Initially, leaking schema to the application layer may not seem like a big deal, because the side effects only become visible later. To pick up the ecommerce product example again, let's say a sample product looks like this:

"type": "clothing",
"name": "Fantastic T-Shirt",
"price": 12.95,
"sizes": ["xs", "s", "m", "l", "xl"],
"colors": ["red", "blue", "green"]

That data clearly has schema you expect: A clothing item may have sizes and color that are expected to be arrays of strings, a price of type float and a name of string type.

When not enforcing this a the database layer, your application code has to make up for it.

The frontend now becomes a mess of validation and type checks:

const product = readFromDb("prod_123");

if (product.type === "tshirt") {
  product.price = parseFloat(product.price);
  if(Number.isNaN(product.price)){
    throw new Error("Invalid price value");
  }
  product.finalPrice = calculateTaxes(product.price);

  if (Array.isArray(product.colors)) {
    product.colors.forEach(color => {
      console.log(color.toUpperCase());
    });
  }
}

Since the schemaless database makes no guarantees about data types for each field, you have to validate every field before using it. The price field is supposed to be a float, but you can't be sure that's what the database returned for the query. The same validation logic needs to be applied to the colors array. Even this much validation is still too little, because the code calls color.toUpperCase(), which blindly assumes the array contents are of type string, which the database doesn't guarantee either.

You can neither trust data when it goes into your schemaless database nor when you read it back out, so these checks now become an appendage to every database query in the entire codebase. The code has to validate the same stored values again and again for every page load or write query. You now need to have the expected schema in your head while writing code, and mistakes are not immediately obvious.

You have now enabled some great footguns to blow up your program logic at runtime:

In aggregate queries, incompatible field types are quietly ignored. If a price field was accidentally stored as a string, most frontend code will work fine, but the document will be missing from aggregate analytics queries or filter/search functions. Languages with type coercion can produce these accidents much faster than you expect. Database schemas only allow one type per field, removing the possibility.
A simple typo in the type field effectively orphans the document, making it entirely invisible in frontend without any errors. Unless someone sees and reports it, you won't even know this problem exists. A database schema would bind the document to the expected structure, trying to create a document of unknown types is an immediate error.
If you decide to rename a field later, you now have to go through your codebase blindly and hope you found every line that needed to be changed. If you forgot one, no errors will tell you - the line simply never runs anymore, or break unexpectedly at runtime. Enforced schema would immediately throw and error when trying to work with nonexistent fields during build or tests.

Schemaless means more verbosity

The primary goal of schemaless is to increase development speed, without worrying about changing database expectations. This works great in the beginning, but reverses at scale. Moving schema enforcement outside of the database increases developer load instead of decreasing it.

Positive effects:

No need to worry about changing schemas or migrations. Altering data and changing code is the migration

Real world cost:

Having no migrations means being unable to roll a schema change back when things go wrong.
You now have to document the schema in code as comments, or externally in a wiki or docs, which will drift from the real database state over time.
Code needs to defensively fence against bad data types and validate fields on every read and write query, bloating code with large volumes of lines for type switching and field validation, without benefitting business logic.
Adding new members to the team becomes tedious. A schema database would immediately tell them what data format is expected, in a single place, guaranteed to be correct, for all stored data. Instead they now have to read through code and try to memorize not just the structure and business logic of the application, but also the database schema all in one go, increasing the time to becoming productive. This is made worse by the code bloat incurred by all the lines of necessary runtime validation.

None of these problems are immediately obvious when starting to use a schemaless database, which is exactly why so many developers get stuck with them.

The price of schemaless

Schemaless provides an initial boost, but the velocity gain isn't free - it is borrowed from the future, where you will pay it back, with interest. Payback takes the shape of reduced developer velocity, higher developer load and increased onboarding time, all translating directly to increased development and maintenance cost. That is, if your code is perfect and not a single line of validation is missed, no typos and not a single schema documentation has gone unnoticed.

Startups often fall into this trap, because their data schemas may evolve with their customers or use cases, causing frequent changes and rewrites. This creates a huge perceived cost for what feels like "meaningless" structure rules that won't survive the next 4 weeks either way. It is very tempting to pick up a schemaless database and use the newly gained free time to focus on features and application logic. But in reality, every schema settles eventually. And when it does, you are stuck with a database and contents that are built for a feature you don't need anymore - plus the technical debt it created along the way by treating structure as optional.

The hidden cost of schemaless databases

What's wrong with schemaless databases?

Where did things go wrong?

All data has structure

Moving schema to the application layer

Schemaless means more verbosity

The price of schemaless

More articles

Automated guest vm provisioning on KVM

Protecting web forms from bots

Storing sessions or cache data in postgresql