Scaling to zero is expensive

Modern cloud-native deployments are often billed hourly and configured to scale up and down based on actual traffic, leading to a seemingly reasonable cost optimization: What if you could scale the workload to zero instances when there is no incoming traffic? Unfortunately, the technical reality isn't quite as simple, and a lot of hidden costs are ready to ruin the savings calculation.

Scaling to zero doesn't mean zero cost

Even though cloud workloads are billed by runtime, this doesn't map proportionally to cost reduction when scaling to zero. The difference comes from implicit added costs of the scaling and its side effects, mostly two primary factors: cold starts and increased complexity.

Cold starts refer to the moment when a deployment had zero instances running and a new request comes in, requiring scaling up to at least one instance. Software typically does initialization when starting. Each setup task consumes CPU cycles and delays actual response latency, making the first handled request for a new instance disproportionately more expensive than subsequent ones.

Complexity on the other hand adds costs outside of runtime bills. If a deployment can scale to zero, you need some kind of orchestration to manage it. If an incoming request needs to wait until a potentially zero-scaled deployment scales back up, you need to intercept every incoming request and check if an instance can handle it. If a traffic spike comes in during a zero-scaled deployment state, the cold start delay may cause excessive upscaling, increasing platform cost. To top it off, someone needs to deal with this complexity, so you now need skilled employees just to manage all of this setup, adding their wages to total cost.

Cold start latency

Let's look a little closer at the cold start problem specifically. What seems like a simple problem requires a lot more careful planning than most operators expect.

The obvious problem is direct startup time from initialization, coming either from the application logic (connect to database, load configuration etc) or language setup (interpreter turning text into machine code, JVM startup etc).

There are many more implicit factors in cold start times, for example:

Even if you don't manage the orchestrator yourself, the platform still needs to run logic to schedule the deployment of a new instance, adding control plane requests and processing to the latency, often an unpredictable latency influence.

Container based systems may need to pull an image before it can start an instance on a node in a cluster like kubernetes. Even if the image is already locally available, they often have to check if the image is the latest desired version, adding an HTTP request to the startup latency before your logic even starts.

Starting an application under pressure of a waiting incoming request means you need to know as quickly as possible when that app is ready to handle connections. But knowing this comes with a tradeoff: you either need to use aggressive health checks which add increased computational overhead during already resource intensive startup, or add custom logic to notify the platform from within the app after start, adding operational complexity instead.

All of this combined makes up the user-visible latency on first load, and may well hit limits outside your control like timeout values in browsers, or even worse, your user's patience and trust. The reputational damage to a user thinking your software is low quality because of long loading times translates directly to missed revenue and an increase in IT support tickets.

Implicit performance regression

Many applications keep some kind of runtime state for performance. Depending on what the software does, that may be anything, from caches of frequently accessed data to connection pools for outbound databases/APIs or incoming HTTP connections. When the instance is turned off, this ephemeral state is lost, alongside its performance benefits.

The problem is only compounded when viewed from a cost saving perspective: web applications are often not the components causing idle load, but data management systems like databases or caches are. Scaling only the web application to zero offers minimal savings compared to scaling its dependencies down with it. But restarting a database forcibly re-runs expensive integrity checks and cache refreshes. Additionally, databases don't consume resources on idle for fun, they are important background tasks and maintenance to optimize future query performance and storage footprint - shutting it down degrades database performance over time.

Staying at zero is difficult

Getting a service down to zero instances is manageable on modern platforms, but keeping it there is a different story. While incoming traffic is what causes a workload to scale back up, it may not originate from human users alone: backups, analytics queries, health checks, web scrapers/bots or internal tooling could all access the application and cause an expensive cold start for fairly little value.

More complex architecture can combat these to some extent, like filtering web traffic for non-human requests, limiting new version deployments to specific schedules and building monitoring and analytics systems around sparse data instead of regular checking intervals.

Such mitigations add more computing overhead to each request and slow down R&D and development productivity significantly, increasing total operational cost.

Intercepting incoming traffic

When an incoming request arrives while a workload has been scaled to zero instances, it needs to be halted until a new instance can handle it. Intercepting requests like this means deploying a reverse proxy service that checks with a database or control plane API for backend availability on every request, no matter how many instances actually exist.

By design, this adds more network hops and computing power for every request, increasing the cost per request permanently for the same business logic.

On a deeper layer, the increased complexity introduces more risks like reload/redirect loop errors, request timeouts before apps become ready and more difficult debugging when tracing errors through the larger component stack.

Observability and operational overhead

Modern businesses rely heavily on metrics to plan for future capacity, assess growth and make decisions. But what if those metrics become spotty, unreliable or ambiguous? That's the reality of scaling to zero in many cases.

How do you tell the difference between a deployment scaled to zero and a failed deployment from metrics alone? You don't, the data doesn't allow this inference anymore, breaking easy calculation paths for service uptime and reliability, delaying identification of quality regressions and logic errors.

Only receiving metrics while the application has traffic results in spotty metrics that are hard to reason about or break established analytics formulas, so data scientists need to be more careful with the metrics and derived insights from such services.

All these factors slow down decision making, reduce confidence in metrics-based assessments and delay insights, while adding work hours to make sense of the metrics at all.

When zero-scaling still makes sense

With all these drawbacks in mind, scaling to zero can be ruled out as a cost optimization strategy for the vast majority of workloads. That said, it is not entirely pointless either, you just need to understand the consequences. Instead of "saving money while idle", think of it more like "replacing idle time cost with increased runtime cost and decreased user experience".

There are use cases that do not suffer from the drawbacks while enjoying the benefits. Consider low-traffic tooling like wikis, admin/analytics UIs or recurring batch workloads. They may be accessed once a day or less, their low request volume will save money during idle times despite increased cost per request and reputation loss is a non-factor since only a subset of employees have access to them.

Internal tooling with more request volume can be partially scaled to zero on a schedule instead of traffic frequency, to save runtime cost outside of business hours and over the weekend, while remaining fully functional when employees need them.