Btrfs vs ZFS

A modern filesystem is essential to modern storage systems. In the linux world, there are two prominent options to choose from: ZFS and Btrfs. But which is right for you? There is no simple answer to this question, so let's look at their features, differences and tradeoffs so your next storage solution uses the filesystem that fits your use case and workload.

Performance and efficiency

Both filesystems use a copy-on-write (CoW) mechanism to store data, meaning a write doesn't alter data in-place but instead writes an entire block again somewhere else and updates the metadata to point to the newer version, eventually freeing the old block later if no references to it remain (for example from snapshots). CoW systems have some common performance issues, for example write amplification for workloads that make many small updates to file contents like databases or virtual machine disks, because even small updates require an entire data block to be rewritten elsewhere. They are also prone to fragmentation, as continuous data eventually gets scattered across the disk after many writes.

ZFS includes a very efficient in-memory cache that holds both most frequently used and least recently used data, but also to collect writes in batches to better utilize the sequential write abilities of drives. Block sizes are configurable per dataset to allow tuning for the expected type of data, between 4kib and 1mib, with a default setting of 128kib. Block sizes are interpreted as limits, for example a 16kib write with a 128kib block size would actually write a 16kib block, but writing 256kib of data would cause two 128kib blocks to be written. Workloads involving large sequential writes (large files or many smaller files at once) benefit heavily from higher block sizes. Data is synced from the write cache to disk every 5s by default, and the aggressive caching provides best-in-class performance for large sequential writes and read operations.

These features come at the cost of significantly higher memory pressure, recommending at least 1GB memory per TB of storage size, 2-4GB to maximize performance benefits from the cache (and even more with deduplication/compression enabled). Systems should have at least 8-16GB of system memory to run reliably. ZFS can be configured to run with less, but will largely sacrifice it's performance benefits in those scenarios. Batched writes and dynamic block sizes mitigate some of the write amplification issues, but for random i/o-heavy workloads it is still 2-5x slower than non-cow filesystems like ext4. ZFS includes no built-in mechanism to fix data fragmentation, forcing operators to re-create the entire dataset to remove the performance penalty on older filesystems.

Btrfs also uses batched writes and dynamic block sizes, but block size limits cannot be configured (they depend on the kernel page cache size, typically 64KiB). Batched writes are synced to disk every 30s by default. The filesystem does not contain an aggressive caching mechanism like ZFS, making it somewhat slower but also reducing its runtime memory consumption, with 512MB per 1TB recommended, 1GB with extensive snapshot usage. Systems should have at least 4GB of total memory to run reliably. The copy-on-write mechanism can be disabled for files/directories that expect frequent small writes (databases, vm disks etc) by setting the NOCOW attribute. Files with this attribute allow writes to happen in-place like in traditional filesystems, fully removing the performance penalty - but also sacrificing features like checksumming, compression, deduplication and snapshots for them to heavily improve raw read/write performance. NOCOW can only be set on files/directories with no compression enabled and no prior snapshots, and does not convert existing files unless they are (re)written. A snapshot including a NOCOW file can observe changes in that file's contents, as it is not immutable like the rest. Fragmented filesystems can be defragmented online, but defragmentation can put heavy load on the storage devices. Operators need to find a balance between frequent defragmentation to keep disk load per operation low, and ensuring maintenance is scheduled outside of peak usage hours.

tl;dr ZFS offers better peak performance with high memory use, excels at sequential I/O, but lacks native defrag. Btrfs has less peak throughput but uses less memory, can disable CoW for small writes, and has online defrag.

Storage pool management

ZFS creates zpools from devices (disks, partitions or files) as the backbone for individual datasets. Pools can be extended dynamically, but shrinking is impossible, instead forcing admins to create a new, smaller pool, sync data to it and destroy the old one. This also necessarily involves downtime, as the mounted volumes from the old pool need to be changed to the new one, requiring admins to plan their future workloads more carefully with ZFS.

Btrfs behaves very similarly, but instead combines devices (disks, partitions or files) into a single filesystem, then dividing it into logical subvolumes that can be mounted individually. It is common for the root filesystem to only be mounted when managing subvolumes, and using only subvolumes for data storage itself. Btrfs filesystems can be both extended and shrunk at will - but shrinking requires rebalancing of data, potentially putting heavy load on all remaining devices until complete.

tl;dr ZFS pools can expand but not shrink, requiring a new pool and downtime to reduce size. Btrfs combines devices into a filesystem with subvolumes that can expand and shrink, though shrinking needs data rebalancing causing heavy device load.

Data redundancy and integrity

ZFS provides a complete range of RAID options for high-availability, including mirroring (RAID10), single-parity disk RAID-Z1 (RAID5), double parity disk RAID-Z2 (RAID 6) and triple parity disk RAID-Z3. It also supports storing multiple block copies on single-disk setups to protect against data corruption without redundancy. It uses data checksums to automatically detect corrupted/damaged blocks when reading data and attempts to find a duplicate that is still intact, returning the correct data for the write and quietly fixing the corrupted data simultaneously. Manual scrubbing can be executed to fix corrupted data for an entire pool, and not just the portions that are accessed by reads. Checksums can use either the software-optimized fletcher4 algorithm or the cryptographic sha512 hash to reduce collisions (but this is very slow). Checksums are stored in separate metadata blocks to reduce the risk of corrupting both at the same time. Redundant configurations require all disks in the pool to have equal capacities, otherwise limiting all disks to the size of the smallest disk.

Btrfs supports mirroring for redundancy (RAID10). More advanced RAID5/6 levels are also implemented, but their implementation has been unreliable for a while. Even after fixes to most of the write-hole issue, they remain marked as unstable and aren't recommended for production workloads. Similar to ZFS, Btrfs also automatically repairs corrupted data on read if undamaged copies exist and provides a scrubbing tool to fix entire filesystems or subvolumes. Redundant configurations use crc32c hashes for data integrity checksums, without support for alternative hash algorithms. While crc32c is less collision-safe than the ZFS's options, it is optimized for hardware-acceleration, making them the faster and reducing CPU load at runtime. Checksums are stored in metadata alongside data pointers by default, putting them at risk of losing both data and checksum contents simultaneously in extreme corruption scenarios. Multiple disks of different capacities can be combined in redundant filesystems without losing storage capacity on larger drives.

tl;dr ZFS can run in mutliple redundancy configurations with adjustable software-based checksum algorithms, but requires equal disk capacities. Even single disks can enable corruption resistance with multiple data copies at the cost of half the disk size. Btrfs should only use mirroring with RAID5/6 still unreliable, uses a hardware-accelerated checksum algorithm and allows mixing differently-sized disks

Desaster recovery

The redundancy levels of ZFS provide well-balanced options for redundancy vs remaining disk space, and storing metadata separate from block data drastically reduces the risk of catastrophic corruption where both a block and it's checksums are lost at the same time. Rebuilding degraded arrays within limits (e.g. a single dead drive in RAID-Z1 parity) is largely automatic. Simply replacing the pool's broken disk with a new one will automatically rebuild (aka "resilver") the pool and write redundancy data as needed. While redundancy is reliable and well-tested, putting a pool beyond it's parity limit (e.g. losing more than one disk at once in RAID-Z1) can be difficult or even impossible to recover from, potentially even making existing snapshots unusable. Pools with missing or heavily corrupted drives may refuse to import or force rollbacks, introducing data loss. Additionally, a catastrophic hardware failure that results in corrupted metadata can entirely break the storage pool, with even recovery tools failing to rescue the data. Only manual intervention or careful manual data extraction by experts remains a last hope in these scenarios, together with their long downtime and heavy cost. Having backups for ZFS systems, even redundant ones, is highly recommended.

Btrfs on the other hand is running a higher risk of losing checksums and block data at the same time since they are located physically close to one another, but also offers better support for fixing degraded filesystems. Replacing a drive requires similar steps as ZFS, but should be followed by a balance operation to ensure data is written consistenly across all disks in the filesystem. Catastrophically damaged filesystems can often still be mounted in a "degraded" state, even if some parts of the filesystem are physically inaccessible. Unaffected subvolumes continue to function normally, and built-in rescue tooling allows reading large parts of remaining data off largely broken storage arrays.

tl;dr ZFS has more robust data integrity but can be difficult to impossible to recover from catastrophic failure. Btrfs has higher risk of checksum corruption but much better support to partially survive or rescue remaining data from heavily damaged filesystems.

Encryption and deduplication

Encryption and deduplication are natively built into ZFS, offering a streamlined experience without external dependencies. Encryption is applied to datasets, so only one key is needed to unlock the encrypted filesystem even if it spans multiple physical devices, but encryption has to be enabled when creating a dataset and can't be enabled retroactively. Deduplication on the other hand can be freely enabled on existing datasets and removes duplicate data blocks across all devices in a dataset automatically during writes (in-band / inline deduplication). This results in minimal operation complexity, but appliTools like bees can run continuously to reduce the cost per operation at the cost of constant overhead, with performance tradeoffs similar to ZFS.es some write overhead and constant memory pressure, occasionally with potentially large memory usage spikes.

Btrfs doesn't provide either of these features natively, instead relying on external tools. Encryption is handled by the standardized LUKS mechanism for disk encryption, but requires to first manually encrypt all devices in a disk array manually, and unlock/open each individually before mounting the btrfs filesystem every time. There are workarounds to encrypt only a subvolume using a loop device, but that approach can be unreliable and requires manual management.

Deduplication is handled by tools like duperemove or bees for deduplication, which have to be run manually or in scheduled intervals and read all written data, then remove duplicate extents (variable-size chunks of data) from individual files, subvolumes, or an entire filesystem. The approach of deduplicating data outside of write operations (out-of-band / out-of-line) means write operations incur less overhead, but deduplication jobs have to be scheduled and can cause more load on the storage layer when running (large read operations, potentially removing multiple extents at once). The trade-off can be adjusted with the extent (data chunk) size the operation operates on or by only targeting specific filesystem contents to better balance deduplication gains with operational cost. Tools like bees can run continuously to reduce the cost per operation at the cost of constant overhead, with performance tradeoffs similar to ZFS.

tl;dr ZFS has builtin encryption and deduplication, but with high memory/cpu overhead on writes. Btrfs lacks native encryption and deduplication, relying on external tools like LUKS, duperemove or bees that can be difficult to manage or cause high read load during maintenance.

Compression

Data compression in ZFS is handled at the block-level and can be configured for a dataset or an entire pool. Data is chunked into blocks (default 128KB) before writing them to disk and compressed indiscriminately. Since a block does not know or care how many files are stored in it, it can apply compression over hundreds of tiny files, allowing much better compression ratios, but may also struggle to efficiently compress large files that span multiple blocks.

Btrfs supports compression for an entire filesystem, specific directories or individual files. This allows better control over compression in environments with changing or unpredictable workloads, for example to turn off compression for virtual machine drives or apply heavier compression to large text-based log files. Marking a directory or filesystem only implicitly enables compression for all contained files, but files are still compressed individually, so those workloads compress much worse than block-level compression. Proper tuning of compression settings for individual filesystem contents requires careful planning and understanding of the stored data / directory structure (good for e.g. controlled servers, bad for storing unpredictable upload data).

tl;dr ZFS compresses data at the block level (default 128KB), good for many small files but less effective for large files. Btrfs offers compression per filesystem/directory/file, can tune compression per file for better ratios but worse performance for many small files.