Init-gating GPU readiness on Kubernetes
The most common way a GPU workload fails at the edge isn’t the model, the driver, or the
network. It’s timing. Kubernetes is eager — it will happily schedule your inference pod the
moment a node is Ready, which is often before the NVIDIA device plugin has advertised
nvidia.com/gpu. The pod starts, can’t see a GPU, crash-loops, and now your rollout is
poisoned across the fleet.
The fix is to make readiness explicit. Don’t trust node-Ready; gate on the GPU.
Gate the schedule, not just the start
A resource request is the first line — a pod that requests a GPU won’t schedule until the plugin advertises capacity:
resources:
limits:
nvidia.com/gpu: 1
But on a single-GPU edge node that’s recovering from a reboot, you still want a hard check before the workload does anything expensive. An init container that blocks until the device is real keeps the main container honest:
#!/usr/bin/env bash
set -euo pipefail
# Block until the GPU is visible AND healthy, or fail loudly after a bound.
for i in $(seq 1 30); do
if nvidia-smi -L | grep -q '^GPU 0'; then
echo "GPU ready"; exit 0
fi
echo "waiting for GPU ($i/30)"; sleep 5
done
echo "GPU never became ready" >&2
exit 1
Why this is the win
Once readiness is gated, the whole class of “pod started before the GPU” failures disappears — and it disappears the same way on every node. That consistency is the real prize at the edge, where no one is standing next to the box to nurse a bad rollout.
The principle generalises: at the edge, design the dependency, don’t hope for it. The GPU is just the first dependency worth gating; egress paths and model artifacts are next.