What are NewSQL Databases?

SQL databases are one of the oldest and most widely used database types in the world. In an ever-changing field of server and service architectures, they have remained a constant viable option for data storage and adapted to new tasks and environments. But while developers and maintainers make great efforts to give their RDBMS offerings all the flexibility they can, some new requirements are so radically different from what the databases were originally designed for, that adapting can be a tedious, or even impossible, task. This problem has created the NewSQL category of SQL databases, aiming to fix some of these problems.

Vocabulary

To follow along and understand the arguments in this post, you will need to know the meaning of some terms often used in scalable or distributed systems:

Scalability: The ability to add resources to an application or service as you go, to increase for example the processing power of a database or add more web servers with an increasing number of client requests.

Cluster: A network of machines or devices carrying out a task as a group, for example multiple servers running a database as a cluster, to ensure the database remains up and usable even when individual servers crash or get disconnected.

Vertical scalability: An approach to scaling by running on increasingly more powerful hardware (e.g. replacing the CPU with a faster one, or adding more RAM to a server)

Horizontal scalability: An approach to scaling by adding more devices to a cluster, for example adding more servers to a cluster to increase disk storage, instead of replacing the current disks with larger ones.

Scaling traditional SQL databases

When talking about traditional RDBMS, we are referring to the most popular ones, namely MySQL and PostgreSQL. These databases have existed for decades and proven their capabilities in countless production environments and companies of all sizes. These databases were created and designed in a time long before terms like Cloud and Big Data were even imaginable, and some design decisions reflect that.

These databases work best for their initially intended use case: running a single database instance on a single server. This is where they shine in all their glory, being easy to get running and keeping maintenance to a minimum. But once your application or data grows, you inevitably discover a different side of these databases: scaling them.

Scaling traditional RDBMS will typically look something like this: You have one server acting as the master of the cluster (aka master node), and any number of slaves. The master server will act as the source of truth, being the only server that's allowed to process write transactions and replicating it's data (and changes to that data) to the slaves. Slave servers act as a live copy of the data on the master server and may handle read queries on their own (like SELECT SQL queries). When requesting a write operation (like an INSERT, UPDATE or DELETE SQL query) from a slave server, they will forward it to the master server internally, as only that is allowed to process them. So no matter how many servers you add to your cluster, the master server's speed will remain a bottleneck.

Distributing data through traditional SQL clusters

Another common issue is storage space: When your data grows to hundreds of terabytes or more, it is not feasible to use slave servers as full copies anymore. Instead, data is split into chunks (often called shards or partitions), that are then distributed to specific servers in the cluster. To ensure data availability, these shards are replicated to other servers (so every chunk of data will have exact copies on other servers, in case one crashes or disconnects). This has to be done manually through SQL queries on table level. An example for postgres could look like this:

CREATE TABLE measurement (
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
) PARTITION BY RANGE (logdate);

You will then have to either write all your SQL queries to target specific table partitions, or write trigger functions to automatically select the correct partition for insert queries. This will affect the entire application's SQL layer in one way or another, causing large rewrites when moving from a single node to a cluster, or when changing partitioning schemes to accommodate larger data sets.

The promise of NewSQL databases

With ever-growing data and services, these problems eventually resulted in a new type of RDBMS: The NewSQL databases. These databases aim to be drop-in replacements for the traditional databases, by supporting their entire SQL dialect, so applications moving from one to another would not need to change a single line of code. The most popular ones are:

YugabyteDB and CockroachDB, supporting the PostgreSQL dialect
TiDB, supporting the MySQL dialect

They are built with the challenges of modern scalable database requirements in mind, so they ship with highly automated clustering built-in: They typically create and distribute shards through the cluster on their own, can add/remove servers without downtime, supports self-healing or automated failover and use consensus mechanisms like RAFT to run with multiple master servers, automatically promoting slaves to masters if masters become unresponsive for any reason.

The biggest selling point of these databases is their ability to scale much more efficiently, with immensely reduced maintenance requirements. Their relatively recent creation also means they include modern legal requirements out of the box, like encrypting data at rest or fencing data within specific geological regions (like ensuring that data from EU customers remains on servers within the EU, as GDPR now requires). All of these features are almost fully automated, with the developers writing the application needing to understand nothing of any of these processes. They just work™.

NewSQL, New Problems

The promise of NewSQL databases may seem tempting at first, but there are drawbacks to adopting them. The first one may seem counter-intuitive at first: They aren't traditional RDBMS. While you may argue that that's the point, this works for and against these databases: Traditional RDBMS enjoy a large portion of their popularity because they are old: They have been used in production extensively, running mission-critical applications for decades. They are tried and tested, survived countless reviews and tests from developers, administrators, researchers and even malicious attackers. NewSQL databases are, as the name suggests, new: They have had a fraction of that exposure at best, and may contain bugs that lead to data loss/corruption or security vulnerabilities in the future. Take this with a grain of salt: There is no guarantee they do suffer from these issues, but it is certainly more reasonable to trust a tried and tested system with your mission critical data than a newcomer with little real-world usage.

Another drawback is feature completion. While these databases aim to be drop-in replacements for their target RDBMS, they aren't quite the same: Postgres users may quickly find out that most extensions, like TRGM for typo correction or PostGIS for geospatial queries won't work, and MySQL replacements will typically not fully support fulltext search or indexes. You may not need these, or care, or be willing to replace them with dedicated microservices - but it is a good thing to keep in mind.

Lastly, while they may appear easier to maintain at first glance, they provide an entire new set of challenges and considerations for operators: Data schemes can have unintended side effects, for example primary keys are commonly used for sharding and patritioning of the table's data. This can lead to problems: some NewSQL databases may use hashes to ensure data is distributed evenly among cluster members (at the cost of higher CPU usage), while others may use the primary key values directly, at the risk of storing data (and thus redirecting query load) unevenly to different servers in the cluster.

Resource usage will be different, even if it's just increased CPU usage from encrypted traffic derived from the consensus protocol, or unexpected spikes in bandwidth usage when replicating data to a crashed server that restarted, or that was newly added to the cluster.

While NewSQL databases have a place in the modern database ecosystem, they are no silver bullet. They were designed to solve some painful real-world problems when scaling RDBMS systems, but also bring an entirely new set of challenges to the table. They may be a perfect replacement for your current database needs, or fall short in terms of mission-critical features or extension support, depending on your application's requirements. Sadly, there is no way to decide if NewSQL is a good fit for your needs, other than carefully evaluating your data and operations, and running some tests on possible replacement databases.