Run & Scale
Scale to Zero

Scale-to-Zero

Scale-to-Zero allows you to automatically scale your Services using Standard CPU and GPU Instances down to zero when there is no incoming traffic. This feature allows you to optimize your costs by only paying for real compute usage.

⚠️
Note: Scale-to-Zero is currently in public preview.

To enable Scale-to-Zero on your Services, you need to use a Standard CPU or GPU Instance and set the minimum number of Instances to zero.

When your Service remains idle for a given period of time without receiving requests, it will automatically scale down your active Instances to zero and update your Deployment to the Sleeping status.

As soon as a new request is received, the Service wakes up and scaled up to at least one Instance or more depending on your autoscaling criteria.

How scale-to-zero works

Your Service will be scaled down to zero if all of the following conditions are met for a given period of time called idle period:

  • No traffic is received from the Internet.
  • No held connection (e.g. websocket or HTTP/2 stream) from the Internet to your Service.
  • No new deployment occurred.

Idle period

Standard CPU and GPU Instances default idle period is set to 5 minutes.

If your Organization is on the Pro, Scale, or Enterprise plan, you can override the default idle period based on your needs in your Service configuration. This can be useful for low-traffic applications with slow start times, such as some machine learning workloads.

The Koyeb Free Instance automatically scales down to zero when it doesn’t receive any traffic for 1 hour. Scale-to-zero on this Instance cannot be disabled, and the idle period cannot be customized.

When to use scale-to-zero?

Scale-to-zero is ideal for a wide range of use cases that involve handling intermittent traffic, like:

  • Inference Efficiency: Inference is compute intensive, you need high-performance GPUs to answer requests quickly, but you might only need them for a couple of minutes every few hours. Scale-to-zero dramatically improves costs and efficiency for inferencing tasks with intermittent traffic, without infrastructure management.
  • Dedicated Services for Multi-Tenant SaaS and Platforms: Scale-to-zero allows you to deploy dedicated and isolated services per tenant with controlled performance and costs. Operate fleets with thousands of services, paying only for real usage.
  • Infinite Development Environments: Software engineering teams need environments identical to production to run integration tests. Creating dozens of services to replicate your production is now cost-effective thanks to scale-to-zero and our automation tools (API, CLI, Terraform, Pulumi). Every developer in your team can have a full replica of the production setup, billed per second of usage.
  • Compute Efficiency: For apps with high CPU demands but intermittent traffic, scale-to-zero automatically optimizes your infrastructure and costs.
  • Global Deployments: Multi-region deployment can quickly become expensive. With scale-to-zero you can deploy globally without incurring a base fee for each additional region that you add.

Limitations

  • Inbound requests to a sleeping Service may be slower due to a cold start, which typically takes 1 to 5 seconds to create a new dedicated virtual machine
  • Scale-to-zero works only for Services exposed to the Internet.
  • HTTP/2 requests cannot be used to wake up a sleeping Service.
  • You can wake a Service up using a WebSocket connection, but that connection may only live for a few minutes.