GPU Autoscaling on Kubernetes with KEDA External Scalers
The $10,000-a-Month Idle GPU Problem
A single A100 GPU node on a major cloud provider runs roughly $3–5 per hour. Leave eight of them sitting idle overnight because your autoscaler doesn’t know the inference queue is empty, and you’ve burned $300 before breakfast. Multiply that across a week of unpredictable ML workloads and the bill becomes a board-level conversation.
The root cause is almost always the same: teams reach for the Horizontal Pod Autoscaler (HPA) out of habit, wire it to CPU utilization, and discover too late that GPU inference pods are mostly waiting—not computing. CPU stays near zero while the queue piles up, and HPA never fires.
KEDA (Kubernetes Event-Driven Autoscaling) solves this at the source by letting you scale on the signal that actually matters: queue depth, pending jobs, or any custom metric your platform already produces. Its external scaler interface takes this one step further, letting you build a purpose-fit scaler when no off-the-shelf trigger fits your stack.
How KEDA Works (the 60-Second Version)
KEDA sits alongside the standard HPA. It watches ScaledObject resources you define, periodically polls a scaler for the current metric value, and feeds that value into a standard HPA target metric. The HPA then adjusts replica counts as usual, while Cluster Autoscaler or Karpenter handles adding and removing the underlying GPU nodes.
KEDA ships with 50+ built-in scalers—Redis lists, SQS queues, Kafka lag, Prometheus queries, and more. When your metric lives somewhere nonstandard—say, a proprietary model-serving orchestrator, a Triton Inference Server queue, or a custom database table—you build an external scaler.
The External Scaler Contract
An external scaler is a gRPC server you run inside the cluster. KEDA calls three methods on it:
IsActive— returnstrueif there is any work pending. KEDA uses this to decide whether to scale from zero to one replica (the hardest part of cold-start autoscaling).GetMetricSpec— returns the metric name and target value per replica (e.g., “process 5 jobs per GPU pod”).GetMetrics— returns the current metric value (e.g., 47 jobs currently in queue).
KEDA divides current by target to compute the desired replica count. That’s the entire scaling math.
A minimal Go implementation looks like this:
func (s *Server) GetMetrics(ctx context.Context, req *pb.GetMetricsRequest) (*pb.GetMetricsResponse, error) {
depth, err := s.queue.Len(ctx)
if err != nil {
return nil, err
}
return &pb.GetMetricsResponse{
MetricValues: []*pb.MetricValue{{
MetricName: "inference-queue-depth",
MetricValue: depth,
}},
}, nil
}
Deploy it as a Deployment with a Service, then reference it in your ScaledObject:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-gpu-scaler
spec:
scaleTargetRef:
name: inference-worker
minReplicaCount: 0
maxReplicaCount: 20
triggers:
- type: external
metadata:
scalerAddress: gpu-queue-scaler.ml-platform:9090
jobsPerReplica: "5"
minReplicaCount: 0 is key—it enables true scale-to-zero when the queue drains.
Wiring in the GPU Node Pool
Scaling pods is only half the story. GPU nodes themselves must come and go. The two main options:
Cluster Autoscaler
Label your GPU node group and configure taints so only GPU-requesting pods land there:
# Node group taint
key: nvidia.com/gpu
value: "true"
effect: NoSchedule
When KEDA scales the inference-worker Deployment up and new pods request nvidia.com/gpu: 1, Cluster Autoscaler sees unschedulable pods and provisions nodes from the GPU group. When pods scale to zero, Cluster Autoscaler drains and terminates the idle nodes after the configured scale-down-delay.
Karpenter (preferred for AWS/GCP)
Karpenter reacts faster and supports mixed instance types. Define a NodePool targeting GPU instances:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5", "p3", "p4d"]
- key: nvidia.com/gpu
operator: Exists
limits:
nvidia.com/gpu: "64"
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
consolidateAfter: 5m means Karpenter removes a fully idle GPU node five minutes after its last pod exits—a meaningful lever for controlling your cloud bill.
Handling the Cold-Start Gap
The biggest operational headache with GPU autoscaling is cold-start latency. Spinning up a GPU node from scratch—driver initialization, image pull, model load—can take 3–8 minutes. Strategies to close that gap:
Warm pool / placeholder pods: Keep one low-priority placeholder pod on a GPU node at all times. It requests the GPU but does no work. When real inference traffic arrives, KEDA scales up workers and the node is already warm. Evict the placeholder with a PodDisruptionBudget so it doesn’t block real work.
Daemonset-based driver pre-caching: Use a DaemonSet on GPU nodes to pre-pull large model images onto node local storage using hostPath volumes. First-pod startup drops from minutes to seconds.
scaleDownStabilizationWindow: Prevent thrashing by telling KEDA not to scale down for at least two minutes after a scale-up event:
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 120
Metric Design: What to Actually Measure
The external scaler’s power is in metric choice. Three patterns work well for ML inference:
- Raw queue depth / jobs per replica: Simple and reliable. Works for any queue-backed inference system (SQS, Redis Streams, RabbitMQ). Set
jobsPerReplicato however many concurrent requests a single GPU pod can handle at acceptable P99 latency. - Pending-to-running ratio: If your orchestrator tracks job states, scaling on
pending / max_concurrency_per_podavoids over-provisioning during bursty batch arrivals. - Triton queue time: NVIDIA Triton exposes
nv_inference_queue_duration_usvia Prometheus. Feed this into a KEDA Prometheus scaler or external scaler to scale on observed wait time rather than raw count—a tighter SLO proxy.
What This Looks Like End-to-End
A realistic ML platform flow with this setup:
- A user submits an image-generation job; it lands in a Redis Stream.
- Your external scaler’s
GetMetricsreadsXLEN inference-jobsand returns depth 47. - With
jobsPerReplica: 5, KEDA targets 10 replicas. Current count: 2. Delta: +8. - Eight new
inference-workerpods are created, each requestingnvidia.com/gpu: 1. - Karpenter sees unschedulable pods and provisions two
g5.48xlargenodes (4 GPUs each). - Pods schedule, models load, jobs drain. Queue depth falls to 0.
IsActivereturnsfalse. KEDA scales to zero. Five minutes later, Karpenter consolidates the now-empty nodes.
Total idle GPU time: under 6 minutes. Compare that to a static node pool running 24/7.
Cost Reality Check
For a team running 100 GPU-hours of actual inference per day on a workload with 14 hours of genuine idle time, eliminating idle nodes saves roughly 56 GPU-hours daily. At $3.50/hr per A100 equivalent, that’s ~$196/day or ~$5,900/month—from a scaler that takes an afternoon to build and deploy.
The external scaler interface is deliberately simple. The gRPC contract is three methods. The configuration is a YAML stanza. The operational leverage relative to the implementation cost is unusually high, which is why it’s one of the more worthwhile afternoons a platform team can spend.