I am trying to get mixed mode MiG working a couple of DGXB300 nodes.
I have deployed version 0.19.0 of the device plugin using the helm chart. I have set the following items in the chart:
- config.name = nvidia-plugin-configs
- config.default = default
- runtimeClassName = nvidia
- gfd.enabled = true
The nvidia-plugin-configs ConfigMap contains the following item in the default key:
version: v1
flags:
migStrategy: "mixed"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
This seems to deploy just fine. I see a node-feature-discovery worker pod per node, a gpu-feature-discovery pod per node, and an nvida-device-plugin pod per node deployed.
I also see labels created on the nodes by both node-feature-discovery and gpu-feature-discovery. What I don't see is any labels created for MiG by the gpu-feature-discovery per this: https://github.com/NVIDIA/k8s-device-plugin/tree/main/docs/gpu-feature-discovery#generated-labels
Here are all the labels that start with "nvidia.com" that are added to the nodes:
"nvidia.com/cuda.driver-version.full": "580.126.20",
"nvidia.com/cuda.driver-version.major": "580",
"nvidia.com/cuda.driver-version.minor": "126",
"nvidia.com/cuda.driver-version.revision": "20",
"nvidia.com/cuda.driver.major": "580",
"nvidia.com/cuda.driver.minor": "126",
"nvidia.com/cuda.driver.rev": "20",
"nvidia.com/cuda.runtime-version.full": "13.0",
"nvidia.com/cuda.runtime-version.major": "13",
"nvidia.com/cuda.runtime-version.minor": "0",
"nvidia.com/cuda.runtime.major": "13",
"nvidia.com/cuda.runtime.minor": "0",
"nvidia.com/gfd.timestamp": "1775237087",
"nvidia.com/gpu.compute.major": "10",
"nvidia.com/gpu.compute.minor": "3",
"nvidia.com/gpu.count": "8",
"nvidia.com/gpu.family": "blackwell",
"nvidia.com/gpu.machine": "DGXB300",
"nvidia.com/gpu.memory": "275040",
"nvidia.com/gpu.mode": "compute",
"nvidia.com/gpu.present": "true",
"nvidia.com/gpu.product": "NVIDIA-B300-SXM6-AC",
"nvidia.com/gpu.replicas": "1",
"nvidia.com/gpu.sharing-strategy": "none",
"nvidia.com/mig.capable": "true",
"nvidia.com/mig.strategy": "mixed",
"nvidia.com/mps.capable": "false",
"nvidia.com/vgpu.present": "false"
And this is added to the capacity of the node as well:
"nvidia.com/gpu": "8",
"nvidia.com/gpu.shared": "0",
I see the following logs from the ctr container in the nvidia-device-plugin pod:
ERROR: init 250 result=11I0403 17:37:39.513185 363 main.go:250] "Starting NVIDIA Device Plugin" version=<
1ae8e5f0-amd64
commit: 1ae8e5f02a47ad80e2b4fcc6c35a757c61ddd81f
>
I0403 17:37:39.513261 363 main.go:253] Starting FS watcher for /var/lib/kubelet/device-plugins
I0403 17:37:39.513328 363 main.go:260] Starting OS watcher.
I0403 17:37:39.514428 363 main.go:275] Starting Plugins.
I0403 17:37:39.514502 363 main.go:332] Loading configuration.
I0403 17:37:39.517927 363 main.go:358] Updating config with default resource matching patterns.
I0403 17:37:39.724518 363 main.go:369]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "mixed",
"failOnInitError": true,
"mpsRoot": "/run/nvidia/mps",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdrcopyEnabled": true,
"gdsEnabled": true,
"mofedEnabled": true,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "1g.34gb",
"name": "nvidia.com/mig-1g.34gb"
},
{
"pattern": "2g.67gb",
"name": "nvidia.com/mig-2g.67gb"
},
{
"pattern": "3g.135gb",
"name": "nvidia.com/mig-3g.135gb"
},
{
"pattern": "4g.135gb",
"name": "nvidia.com/mig-4g.135gb"
},
{
"pattern": "7g.269gb",
"name": "nvidia.com/mig-7g.269gb"
},
{
"pattern": "1g.34gb+me",
"name": "nvidia.com/mig-1g.34gb.me"
},
{
"pattern": "1g.67gb",
"name": "nvidia.com/mig-1g.67gb"
}
]
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I0403 17:37:39.724542 363 main.go:372] Retrieving plugins.
I0403 17:37:39.995823 363 server.go:198] Starting GRPC server for 'nvidia.com/gpu'
I0403 17:37:39.998220 363 server.go:142] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0403 17:37:40.004108 363 server.go:149] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0403 17:37:40.054570 363 health.go:64] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]
and the following from the init container in the same pod:
ERROR: init 250 result=11W0403 17:37:36.975661 7 client_config.go:682] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0403 17:37:36.976170 7 main.go:247] Waiting for change to 'nvidia.com/device-plugin.config' label
I0403 17:37:36.976176 7 main.go:249] Label change detected: nvidia.com/device-plugin.config=
I0403 17:37:36.976228 7 main.go:361] No value set. Selecting default name: default
I0403 17:37:36.976232 7 main.go:305] Updating to config: default
I0403 17:37:36.976325 7 main.go:320] Successfully updated to config: default
The only error message that I see in any of the pods that this chart deploys is in ctr container of the gpu-feature-discovery pod:
ERROR: init 250 result=11I0403 17:24:47.544108 364 main.go:163] Starting OS watcher.
I0403 17:24:47.544839 364 main.go:168] Loading configuration.
I0403 17:24:47.546465 364 main.go:180]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "mixed",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdrcopyEnabled": null,
"gdsEnabled": null,
"mofedEnabled": null,
"useNodeFeatureAPI": true,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": null,
"nvidiaCTKPath": null,
"containerDriverRoot": "/driver-root"
},
"gfd": {
"oneshot": false,
"noTimestamp": false,
"sleepInterval": "1m0s",
"outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
"machineTypeFile": "/sys/class/dmi/id/product_name"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
},
"imex": {}
}
I0403 17:24:47.754212 364 factory.go:58] Using NVML manager
I0403 17:24:47.766243 364 main.go:214] Start running
2026/04/03 17:24:47 WARNING: unable to detect IOMMU FD for [0000:1a:00.0 open /sys/bus/pci/devices/0000:1a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:47 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:47 WARNING: unable to detect IOMMU FD for [0000:3c:00.0 open /sys/bus/pci/devices/0000:3c:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:47 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:47 WARNING: unable to detect IOMMU FD for [0000:62:00.0 open /sys/bus/pci/devices/0000:62:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:73:00.0 open /sys/bus/pci/devices/0000:73:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:9a:00.0 open /sys/bus/pci/devices/0000:9a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:bc:00.0 open /sys/bus/pci/devices/0000:bc:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:df:00.0 open /sys/bus/pci/devices/0000:df:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:f0:00.0 open /sys/bus/pci/devices/0000:f0:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
I0403 17:24:48.310539 364 main.go:281] Creating Labels
I0403 17:24:48.318302 364 output.go:165] Updating NodeFeature object nvidia-features-for-ngai-b300-01
I0403 17:24:48.329922 364 output.go:170] NodeFeature object updated: &{{ } {nvidia-features-for-ngai-b300-01 system-gpu-operator 359a11e9-4fb5-437d-99bf-5e068e79960c 2477652 9 2026-04-02 19:07:15 +0000 UTC <nil> <nil> map[nfd.node.kubernetes.io/node-name:ngai-b300-01] map[] [{apps/v1 DaemonSet nvidia-device-plugin-gpu-feature-discovery b9796695-9106-4939-942d-07b1d1513d20 0x1f2869b8209d <nil>} {v1 Pod nvidia-device-plugin-gpu-feature-discovery-l7qpr b6d64518-a336-449c-a59e-6e3cf723fc15 <nil> <nil>}] [] [{gpu-feature-discovery Update nfd.k8s-sigs.io/v1alpha1 2026-04-03 17:24:48 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:nfd.node.kubernetes.io/node-name":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"b6d64518-a336-449c-a59e-6e3cf723fc15\"}":{},"k:{\"uid\":\"b9796695-9106-4939-942d-07b1d1513d20\"}":{}}},"f:spec":{".":{},"f:features":{".":{},"f:attributes":{},"f:flags":{},"f:instances":{}},"f:labels":{".":{},"f:nvidia.com/cuda.driver-version.full":{},"f:nvidia.com/cuda.driver-version.major":{},"f:nvidia.com/cuda.driver-version.minor":{},"f:nvidia.com/cuda.driver-version.revision":{},"f:nvidia.com/cuda.driver.major":{},"f:nvidia.com/cuda.driver.minor":{},"f:nvidia.com/cuda.driver.rev":{},"f:nvidia.com/cuda.runtime-version.full":{},"f:nvidia.com/cuda.runtime-version.major":{},"f:nvidia.com/cuda.runtime-version.minor":{},"f:nvidia.com/cuda.runtime.major":{},"f:nvidia.com/cuda.runtime.minor":{},"f:nvidia.com/gfd.timestamp":{},"f:nvidia.com/gpu.compute.major":{},"f:nvidia.com/gpu.compute.minor":{},"f:nvidia.com/gpu.count":{},"f:nvidia.com/gpu.family":{},"f:nvidia.com/gpu.machine":{},"f:nvidia.com/gpu.memory":{},"f:nvidia.com/gpu.mode":{},"f:nvidia.com/gpu.product":{},"f:nvidia.com/gpu.replicas":{},"f:nvidia.com/gpu.sharing-strategy":{},"f:nvidia.com/mig.capable":{},"f:nvidia.com/mig.strategy":{},"f:nvidia.com/mps.capable":{},"f:nvidia.com/vgpu.present":{}}}} }]} {{map[] map[] map[]} map[nvidia.com/cuda.driver-version.full:580.126.20 nvidia.com/cuda.driver-version.major:580 nvidia.com/cuda.driver-version.minor:126 nvidia.com/cuda.driver-version.revision:20 nvidia.com/cuda.driver.major:580 nvidia.com/cuda.driver.minor:126 nvidia.com/cuda.driver.rev:20 nvidia.com/cuda.runtime-version.full:13.0 nvidia.com/cuda.runtime-version.major:13 nvidia.com/cuda.runtime-version.minor:0 nvidia.com/cuda.runtime.major:13 nvidia.com/cuda.runtime.minor:0 nvidia.com/gfd.timestamp:1775237087 nvidia.com/gpu.compute.major:10 nvidia.com/gpu.compute.minor:3 nvidia.com/gpu.count:8 nvidia.com/gpu.family:blackwell nvidia.com/gpu.machine:DGXB300 nvidia.com/gpu.memory:275040 nvidia.com/gpu.mode:compute nvidia.com/gpu.product:NVIDIA-B300-SXM6-AC nvidia.com/gpu.replicas:1 nvidia.com/gpu.sharing-strategy:none nvidia.com/mig.capable:true nvidia.com/mig.strategy:mixed nvidia.com/mps.capable:false nvidia.com/vgpu.present:false]}}
I0403 17:24:48.330242 364 main.go:294] Sleeping for 1m0s
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:1a:00.0 open /sys/bus/pci/devices/0000:1a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:3c:00.0 open /sys/bus/pci/devices/0000:3c:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:62:00.0 open /sys/bus/pci/devices/0000:62:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:73:00.0 open /sys/bus/pci/devices/0000:73:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:9a:00.0 open /sys/bus/pci/devices/0000:9a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:bc:00.0 open /sys/bus/pci/devices/0000:bc:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:df:00.0 open /sys/bus/pci/devices/0000:df:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:f0:00.0 open /sys/bus/pci/devices/0000:f0:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
I0403 17:25:48.836876 364 main.go:281] Creating Labels
I0403 17:25:48.842763 364 output.go:161] no changes in NodeFeature object nvidia-features-for-ngai-b300-01
I0403 17:25:48.842787 364 main.go:294] Sleeping for 1m0s
Which I assume means that it is not able to discover something about the GPU to create the MiG related labels.
With this setup I can create a pod using the following manifest which works just fine:
apiVersion: v1
kind: Pod
metadata:
name: nvidia-test
spec:
runtimeClassName: nvidia
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 4 # requesting 4 GPU
but when I try the following it fails on scheduling:
apiVersion: v1
kind: Pod
metadata:
name: nvidia-test
spec:
runtimeClassName: nvidia
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/mig-1g.34gb: 1
I assume that is due to the label not being made available on any node.
The details of this deployment are as follows:
Talos Linux v1.12.6 with:
Kubernetes API: v1.34
nvidia-container-toolkit: v1.18.2
nvidia-fabricmanager: 580.126.20
nvidia open kernel modules: 580.126.20
nvidia-gdrdrv-device: v2.5.1
Any help would be appreciated. Thank you.
I am trying to get mixed mode MiG working a couple of DGXB300 nodes.
I have deployed version 0.19.0 of the device plugin using the helm chart. I have set the following items in the chart:
The nvidia-plugin-configs ConfigMap contains the following item in the default key:
This seems to deploy just fine. I see a node-feature-discovery worker pod per node, a gpu-feature-discovery pod per node, and an nvida-device-plugin pod per node deployed.
I also see labels created on the nodes by both node-feature-discovery and gpu-feature-discovery. What I don't see is any labels created for MiG by the gpu-feature-discovery per this: https://github.com/NVIDIA/k8s-device-plugin/tree/main/docs/gpu-feature-discovery#generated-labels
Here are all the labels that start with "nvidia.com" that are added to the nodes:
And this is added to the capacity of the node as well:
I see the following logs from the ctr container in the nvidia-device-plugin pod:
and the following from the init container in the same pod:
The only error message that I see in any of the pods that this chart deploys is in ctr container of the gpu-feature-discovery pod:
Which I assume means that it is not able to discover something about the GPU to create the MiG related labels.
With this setup I can create a pod using the following manifest which works just fine:
but when I try the following it fails on scheduling:
I assume that is due to the label not being made available on any node.
The details of this deployment are as follows:
Talos Linux v1.12.6 with:
Kubernetes API: v1.34
nvidia-container-toolkit: v1.18.2
nvidia-fabricmanager: 580.126.20
nvidia open kernel modules: 580.126.20
nvidia-gdrdrv-device: v2.5.1
Any help would be appreciated. Thank you.