GPU Autoscaling on Kubernetes with KEDA External Scalers

Kubernetes
GPU Autoscaling on Kubernetes with KEDA External Scalers

The $10,000-a-Month Idle GPU Problem

A single A100 GPU node on a major cloud provider runs roughly $3–5 per hour. Leave eight of them sitting idle overnight because your autoscaler doesn’t know the inference queue is empty, and you’ve burned $300 before breakfast. Multiply that across a week of unpredictable ML workloads and the bill becomes a board-level conversation.

The root cause is almost always the same: teams reach for the Horizontal Pod Autoscaler (HPA) out of habit, wire it to CPU utilization, and discover too late that GPU inference pods are mostly waiting—not computing. CPU stays near zero while the queue piles up, and HPA never fires.

KEDA (Kubernetes Event-Driven Autoscaling) solves this at the source by letting you scale on the signal that actually matters: queue depth, pending jobs, or any custom metric your platform already produces. Its external scaler interface takes this one step further, letting you build a purpose-fit scaler when no off-the-shelf trigger fits your stack.

How KEDA Works (the 60-Second Version)

KEDA sits alongside the standard HPA. It watches ScaledObject resources you define, periodically polls a scaler for the current metric value, and feeds that value into a standard HPA target metric. The HPA then adjusts replica counts as usual, while Cluster Autoscaler or Karpenter handles adding and removing the underlying GPU nodes.

KEDA ships with 50+ built-in scalers—Redis lists, SQS queues, Kafka lag, Prometheus queries, and more. When your metric lives somewhere nonstandard—say, a proprietary model-serving orchestrator, a Triton Inference Server queue, or a custom database table—you build an external scaler.

The External Scaler Contract

An external scaler is a gRPC server you run inside the cluster. KEDA calls three methods on it:

  • IsActive — returns true if there is any work pending. KEDA uses this to decide whether to scale from zero to one replica (the hardest part of cold-start autoscaling).
  • GetMetricSpec — returns the metric name and target value per replica (e.g., “process 5 jobs per GPU pod”).
  • GetMetrics — returns the current metric value (e.g., 47 jobs currently in queue).

KEDA divides current by target to compute the desired replica count. That’s the entire scaling math.

A minimal Go implementation looks like this:

func (s *Server) GetMetrics(ctx context.Context, req *pb.GetMetricsRequest) (*pb.GetMetricsResponse, error) {
    depth, err := s.queue.Len(ctx)
    if err != nil {
        return nil, err
    }
    return &pb.GetMetricsResponse{
        MetricValues: []*pb.MetricValue{{
            MetricName:  "inference-queue-depth",
            MetricValue: depth,
        }},
    }, nil
}

Deploy it as a Deployment with a Service, then reference it in your ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-gpu-scaler
spec:
  scaleTargetRef:
    name: inference-worker
  minReplicaCount: 0
  maxReplicaCount: 20
  triggers:
  - type: external
    metadata:
      scalerAddress: gpu-queue-scaler.ml-platform:9090
      jobsPerReplica: "5"

minReplicaCount: 0 is key—it enables true scale-to-zero when the queue drains.

Wiring in the GPU Node Pool

Scaling pods is only half the story. GPU nodes themselves must come and go. The two main options:

Cluster Autoscaler

Label your GPU node group and configure taints so only GPU-requesting pods land there:

# Node group taint
key: nvidia.com/gpu
value: "true"
effect: NoSchedule

When KEDA scales the inference-worker Deployment up and new pods request nvidia.com/gpu: 1, Cluster Autoscaler sees unschedulable pods and provisions nodes from the GPU group. When pods scale to zero, Cluster Autoscaler drains and terminates the idle nodes after the configured scale-down-delay.

Karpenter (preferred for AWS/GCP)

Karpenter reacts faster and supports mixed instance types. Define a NodePool targeting GPU instances:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    spec:
      requirements:
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values: ["g5", "p3", "p4d"]
      - key: nvidia.com/gpu
        operator: Exists
  limits:
    nvidia.com/gpu: "64"
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

consolidateAfter: 5m means Karpenter removes a fully idle GPU node five minutes after its last pod exits—a meaningful lever for controlling your cloud bill.

Handling the Cold-Start Gap

The biggest operational headache with GPU autoscaling is cold-start latency. Spinning up a GPU node from scratch—driver initialization, image pull, model load—can take 3–8 minutes. Strategies to close that gap:

Warm pool / placeholder pods: Keep one low-priority placeholder pod on a GPU node at all times. It requests the GPU but does no work. When real inference traffic arrives, KEDA scales up workers and the node is already warm. Evict the placeholder with a PodDisruptionBudget so it doesn’t block real work.

Daemonset-based driver pre-caching: Use a DaemonSet on GPU nodes to pre-pull large model images onto node local storage using hostPath volumes. First-pod startup drops from minutes to seconds.

scaleDownStabilizationWindow: Prevent thrashing by telling KEDA not to scale down for at least two minutes after a scale-up event:

advanced:
  horizontalPodAutoscalerConfig:
    behavior:
      scaleDown:
        stabilizationWindowSeconds: 120

Metric Design: What to Actually Measure

The external scaler’s power is in metric choice. Three patterns work well for ML inference:

  • Raw queue depth / jobs per replica: Simple and reliable. Works for any queue-backed inference system (SQS, Redis Streams, RabbitMQ). Set jobsPerReplica to however many concurrent requests a single GPU pod can handle at acceptable P99 latency.
  • Pending-to-running ratio: If your orchestrator tracks job states, scaling on pending / max_concurrency_per_pod avoids over-provisioning during bursty batch arrivals.
  • Triton queue time: NVIDIA Triton exposes nv_inference_queue_duration_us via Prometheus. Feed this into a KEDA Prometheus scaler or external scaler to scale on observed wait time rather than raw count—a tighter SLO proxy.

What This Looks Like End-to-End

A realistic ML platform flow with this setup:

  1. A user submits an image-generation job; it lands in a Redis Stream.
  2. Your external scaler’s GetMetrics reads XLEN inference-jobs and returns depth 47.
  3. With jobsPerReplica: 5, KEDA targets 10 replicas. Current count: 2. Delta: +8.
  4. Eight new inference-worker pods are created, each requesting nvidia.com/gpu: 1.
  5. Karpenter sees unschedulable pods and provisions two g5.48xlarge nodes (4 GPUs each).
  6. Pods schedule, models load, jobs drain. Queue depth falls to 0.
  7. IsActive returns false. KEDA scales to zero. Five minutes later, Karpenter consolidates the now-empty nodes.

Total idle GPU time: under 6 minutes. Compare that to a static node pool running 24/7.

Cost Reality Check

For a team running 100 GPU-hours of actual inference per day on a workload with 14 hours of genuine idle time, eliminating idle nodes saves roughly 56 GPU-hours daily. At $3.50/hr per A100 equivalent, that’s ~$196/day or ~$5,900/month—from a scaler that takes an afternoon to build and deploy.

The external scaler interface is deliberately simple. The gRPC contract is three methods. The configuration is a YAML stanza. The operational leverage relative to the implementation cost is unusually high, which is why it’s one of the more worthwhile afternoons a platform team can spend.

Sources