Case study
GPU-as-Code on the Edge
Brought GPUs online as code — passthrough, readiness-gated, and reproducible across the fleet.
- GPU passthrough
- ESXi
- DCGM Exporter
- Prometheus
- Bash
- Watchdogs
Problem
GPUs are the most failure-prone part of an edge AI stack: passthrough has to be configured on the hypervisor, the device plugin has to be healthy in the cluster, and the workload has to refuse to start until both are true. Doing that by hand, per site, doesn’t scale.
Constraints
- As-code, not click-ops — GPU passthrough defined in code, not the ESXi UI.
- Fail safe — a not-ready GPU must block the workload, not crash it.
- Observable — GPU health has to be visible alongside the rest of the platform.
Design
GPU passthrough is configured through ESXi via code, with end-state manifests and Helm charts describing the desired node. In-cluster, Bash/Shell readiness probes and Kubernetes watchdogs gate inference pods on a healthy GPU device plugin and manage pod lifecycle from there. DCGM Exporter feeds GPU and container-workload health into Prometheus and AWX job-level reporting, so a degraded GPU surfaces the same way any other platform signal does.
Security & reliability decisions
- Readiness gating — pods wait for the hardware; no boot-time races.
- End-state manifests — the node’s GPU config is declarative and reproducible.
- DCGM telemetry — GPU failures are detected, not discovered.
Outcome
GPUs come online predictably across the fleet, the dangerous “pod started before the GPU” class of failure is designed out, and GPU health is a first-class metric.
Future improvements
Roll the readiness contract and DCGM thresholds into a single reusable module so any new GPU workload inherits the same guarantees by default.