NVIDIA Driver/Library Version Mismatch

Problem

The nvidia-device-plugin pod crash-loops with hundreds of restarts. The container fails to start with:

failed to initialize NVML: Driver/library version mismatch

GPU workloads cannot be scheduled until the mismatch is resolved.

Root Cause

Two driver management systems competing on the same host:

GPU Operator manages NVIDIA drivers as a containerized DaemonSet with its own upgrade controller state machine (cordon → evict GPU pods → restart driver pod → validate → uncordon).
Unattended-upgrades (Debian/Ubuntu apt) can automatically upgrade host-level NVIDIA packages (nvidia-headless-*, libnvidia-*, cuda-*) outside the operator's control.

When unattended-upgrades bumps the userspace library while the kernel module remains from the previous version (or vice versa), NVML initialization fails with a version mismatch. The device plugin crash-loops until a node reboot happens to realign both components.

This only manifests on GPU nodes where both the Ansible nvidia-container-runtime role installs host NVIDIA packages and the GPU Operator deploys its own driver container.

How to Diagnose

# Check device plugin restarts and last failure reason
kubectl --context=grigri describe pod -n gpu-operator -l app=nvidia-device-plugin-daemonset | grep -A5 "Last State"

# Check if driver/library versions differ on the host
ssh prusik "cat /proc/driver/nvidia/version"
ssh prusik "dpkg -l | grep nvidia | head -20"

# Check unattended-upgrades log for recent NVIDIA package upgrades
ssh prusik "grep -i nvidia /var/log/unattended-upgrades/unattended-upgrades.log"

Fix

1. Blacklist NVIDIA packages from unattended-upgrades

The 50unattended-upgrades apt config must include NVIDIA packages in the blacklist:

Unattended-Upgrade::Package-Blacklist {
    "nvidia-.*";
    "libnvidia-.*";
    "cuda-.*";
    "xserver-xorg-video-nvidia-.*";
};

Deploy the updated config via Ansible (note: the tag lives in the prepare role, not cluster):

cd metal && ANSIBLE_EXTRA_ARGS="-t unattended-upgrades" make prepare

2. Driver upgrades should only go through the GPU Operator

Use the operator's upgrade controller to roll out new driver versions:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
    --type='json' \
    -p='[{"op": "replace", "path": "/spec/driver/version", "value":"<new-version>"}]'

Monitor per-node progress:

kubectl get node -l nvidia.com/gpu.present \
    -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}'

See NVIDIA GPU Operator - GPU Driver Upgrades for the full upgrade state machine and configuration options.