Common pitfalls running docker in production

Table of contents

Docker made it easy to build and deploy applications, but it also introduced new challenges when deploying multiple long-running containers on a single host. We've compiled a list of the most common issues and how to avoid them.

Set log size constraints

By default, docker containers use the json-file logging driver. This driver writes all log output from stdout and stderr to json files in /var/lib/docker/containers/<container-id>/<container-id>-json.log. These files have no default limits, meaning a container in a crash loop will happily write gigabytes of error messages into this file until the entire hard drive fills up (which may grind the entire system to a halt). To prevent this, the default logging driver has built-in options to limit the maximum size per log file (max-size) and the maximum number of log files kept (max-file). Running a container with adjusted logging options could look like this:

docker run -it --log-driver json-file --log-opt max-size=10m --log-opt max-file=3 alpine sh

Container updates and version incompatibilities

Keeping docker container images updates is a given, and easily automated in minutes using tools like watchtower.

But blindly updating can be a problem: Let's assume you have a PostgreSQL container running from the image postgres:latest. This image will point to the very latest version available, let's say 14.5.2. If some time later a new major version 15.0.0 is released, updating your local container will change the image from Postgres version 14 to 15 – which Postgres can't do without running migration steps inbetween, thus putting your container into a crash loop. For this reason, Postgres exposes major version tags such as 15 so you can run a container using the image postgres:15 that will happily update to new versions of the v15.x.x release, but not upgrade to new releases like v16. Always check the versioning strategy and upgrade path of the software when choosing the image and tag for a container.

Clean unused data

Long running docker daemons accumulate leftover data over time, such as old images and layers, unused networks, stopped containers and unused volumes.

These can take up a significant amount of storage over time and should be cleaned up regularly. To make this process easier, docker ships with the prune command to automatically find and remove these unused parts. A full system cleanup can look as simple as one command:

docker system prune # cleans dangling images, stopped containers and unused networks
docker system prune -a # same as above, but also removes unused images (not only dangling ones)

Note that this will skip volumes to prevent accidental data loss, use docker volume prune for that.

You can also clean each part of docker separately:

docker container prune # remove stopped containers
docker network prune # remove unused networks
docker volume prune # remove unused volumes
docker image prune # remove dangling container images

Remember to automate cleanup commands using for example a cron job to run them periodically. If you need more control over data retention, the --filter options might be a good place to start.

Set a restart policy

Deciding what happens when a container stops can be automated using restart policies. The following restart policies are available to containers:

  • none (default): never restart this container automatically
  • always: always restart this container no matter why it stopped
  • unless-stopped: always restart this container unless it was stopped manually
  • on-failure: always restart this container if it stopped with a non-zero exit code

In production, you want either unless-stopped or on-failure. The reason you don't want always is simple: it will ignore the reason a container was stopped. Assume you run a container with restart policy always, then decide to stop it for some time (maybe a dependency is unavailable or you are debugging something). While the container is stopped, the server crashes (or reboots). The daemon will now restart this container even though you had it stopped before the reboot, as always does not care why a container stopped.

More articles

Modern linux networking basics

Getting started with systemd-networkd, NetworkManager and the iproute2 suite

Understanding how RAID 5 works

Striking a balance between fault tolerance, speed and usable disk space

A practical guide to filesystems

Picking the best option to format your next drive

The downsides of source-available software licenses

And how it differs from real open-source licenses

Configure linux debian to boot into a fullscreen application

Running kiosk-mode applications with confidence

How to use ansible with vagrant environments

Painlessly connect vagrant infrastructure and ansible playbooks