Autoscaling with KEDA
Practicus AI supports Kubernetes autoscaling out of the box using Horizontal Pod Autoscaler (HPA).
For many workloads, CPU or memory-based scaling is sufficient and simple.
However, modern AI and event-driven workloads often need more advanced scaling signals—this is where KEDA comes in.
What is KEDA?
KEDA (Kubernetes Event-Driven Autoscaling) is a Kubernetes add-on that enables autoscaling based on external and custom metrics, not just CPU or memory.
Instead of you manually wiring metric adapters, external metrics APIs, and custom HPA logic, KEDA provides:
- A standard Kubernetes resource (
ScaledObject) - Built-in integrations for common metric sources
- Automatic HPA creation and management behind the scenes
When you apply a ScaledObject, KEDA creates and owns an HPA for the target workload and continuously feeds it the right metrics.
When should I use KEDA instead of HPA?
Use HPA when:
- CPU or memory utilization is a good proxy for load
- You want the simplest possible setup
- You don’t need scale-to-zero or external metrics
Use KEDA when:
- You need to scale on GPU utilization
- You want to scale based on request rate, queue depth, or custom metrics
- Your metrics come from Prometheus or external systems
- You want event-driven or burst-aware scaling
- You want consistent behavior across different metric types
In short: HPA is resource-based; KEDA is signal-based.
How KEDA works (high level)
- You deploy KEDA into the cluster (operator + metrics adapter)
- You define a
ScaledObjectthat points to a Deployment - You define one or more triggers (CPU, Prometheus, HTTP, queue, etc.)
- KEDA:
- Polls those triggers
- Exposes metrics to Kubernetes
- Creates and manages the underlying HPA
You do not need to create an HPA yourself when using KEDA.
Important behavior notes
- A Deployment should be managed by either HPA or KEDA, not both
- KEDA fully controls the HPA it creates
- You can still fine-tune scale behavior (rate limits, stabilization windows)
- KEDA supports multiple triggers at the same time (e.g. CPU and GPU)
Typical AI / Model Hosting use cases
KEDA is especially useful for model-serving workloads where:
- GPU utilization is the true bottleneck
- CPU usage stays low while GPUs saturate
- Load comes in bursts
- Different models have different scaling signals
Common setups include:
- CPU utilization + GPU utilization
- Request rate (RPS) + cool-down window
- Queue length + scale-to-zero
Next: Example configuration
Below is a manual KEDA ScaledObject example that:
- Replaces HPA
- Scales on CPU utilization
- Scales on GPU utilization via Prometheus
- Preserves familiar HPA scale-up behavior
You can apply it directly using kubectl apply -f.
# -----------------------------
# (Optional) Secret for Prometheus auth (if you need it)
# If Prometheus is open inside the cluster, you can skip this entire section.
# -----------------------------
apiVersion: v1
kind: Secret
metadata:
name: keda-prometheus-auth
namespace: NAMESPACE
type: Opaque
stringData:
bearerToken: "REPLACE_ME" # or remove if not needed
---
# -----------------------------
# (Optional) TriggerAuthentication referencing the Secret above
# -----------------------------
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: prometheus-trigger-auth
namespace: NAMESPACE
spec:
secretTargetRef:
- parameter: bearerToken
name: keda-prometheus-auth
key: bearerToken
---
# -----------------------------
# KEDA ScaledObject (the main object)
# This will create & manage an HPA for your Deployment.
# -----------------------------
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: prt-keda-modelhost-DEPLOYMENT_KEY
namespace: NAMESPACE
spec:
scaleTargetRef:
name: prt-depl-modelhost-DEPLOYMENT_KEY
# Similar semantics to HPA min/max
minReplicaCount: 1
maxReplicaCount: 10
# How often KEDA polls triggers, and how long it waits before scaling down
pollingInterval: 15 # seconds
cooldownPeriod: 60 # seconds
# (Optional) keep your familiar HPA behavior tuning
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 30
triggers:
# 1) CPU utilization trigger (works like your HPA cpu averageUtilization)
- type: cpu
metricType: Utilization
metadata:
value: "75"
# 2) GPU utilization via Prometheus scaler
#
# You MUST adjust the PromQL query to match your cluster's GPU metrics.
# Common sources: DCGM exporter metrics scraped by Prometheus.
#
# This example expects the query returns a single number (0..100).
- type: prometheus
metadata:
serverAddress: "http://prometheus-server.monitoring.svc.cluster.local:80"
metricName: "gpu_utilization_percent"
query: |
avg(
DCGM_FI_DEV_GPU_UTIL{namespace="NAMESPACE", pod=~"prt-depl-modelhost-DEPLOYMENT_KEY-.*"}
)
threshold: "70"
# If you don't need auth, remove "authenticationRef".
authenticationRef:
name: prometheus-trigger-auth
Scale from 0 with Istio (wake on incoming traffic)
When minReplicaCount: 0, there are no Pods, so your app cannot observe traffic or emit its own metrics.
To scale from 0 → 1, you need a demand signal that exists outside the Deployment.
With Istio, there are two practical patterns:
Option A (recommended for “no dropped requests”): HTTP request buffering in front of the service
Use a component that can accept requests while the backend is at 0, then trigger scale-up and buffer until at least 1 Pod is ready.
Typical approaches: - KEDA HTTP add-on (interceptor/proxy in front of the service) - Knative-style activator (if you already use Knative)
Pros - Requests don’t immediately fail when replicas = 0 - Better UX for synchronous inference endpoints
Cautions - Buffering is not free: you must cap queue size, max pending time, and timeouts - Cold start can be long for GPU modelhost (image pull + model load) - Decide what you want to happen under heavy burst (queue, shed load, 429, etc.)
Option B (simple, but first requests may fail): scale on Istio/Ingress traffic metrics (Prometheus)
If you have Prometheus scraping Istio metrics, you can scale based on request rate at the mesh / gateway layer (which still exists when your pods are 0).
Pros - No extra proxy/buffering component required - Simple to operate if you already have Istio + Prometheus
Cautions - When there are 0 endpoints, the first request(s) can return 503 until a Pod comes up - You must rely on client retries, or Istio retries, to hide cold starts - Prometheus scrape + KEDA polling add latency (not instant wake-up)
YAML sample: KEDA scale-from-zero using Istio request rate (Prometheus)
Adjust namespaces, gateway/workload labels, and the PromQL query to match your telemetry. This example assumes you scrape Istio metrics into Prometheus.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: prt-keda-modelhost-DEPLOYMENT_KEY
namespace: NAMESPACE
spec:
scaleTargetRef:
name: prt-depl-modelhost-DEPLOYMENT_KEY
# Allow scale-to-zero:
minReplicaCount: 0
maxReplicaCount: 10
# Poll & cooldown are critical for traffic-based triggers
pollingInterval: 15 # seconds
cooldownPeriod: 120 # seconds (increase for GPU/model cold start)
triggers:
- type: prometheus
metadata:
serverAddress: "http://prometheus-server.monitoring.svc.cluster.local:80"
metricName: "istio_rps_modelhost"
# Example: requests per second to a service inside the mesh.
# You may need different labels depending on your Istio telemetry setup.
query: |
sum(
rate(istio_requests_total{
destination_service_name="prt-svc-modelhost-DEPLOYMENT_KEY",
destination_workload_namespace="NAMESPACE",
response_code!~"5.."
}[1m])
)
# Normal scaling threshold once active (RPS per replica-ish; tune to your service)
threshold: "5"
# Wake-up threshold from 0 (prevents noise waking pods)
activationThreshold: "1"
Previous: Trino API Guide | Next: Container Images