Skip to content

[Question]: mig labels are not being created #1682

@linkages

Description

@linkages

I am trying to get mixed mode MiG working a couple of DGXB300 nodes.

I have deployed version 0.19.0 of the device plugin using the helm chart. I have set the following items in the chart:

  • config.name = nvidia-plugin-configs
  • config.default = default
  • runtimeClassName = nvidia
  • gfd.enabled = true

The nvidia-plugin-configs ConfigMap contains the following item in the default key:

version: v1
flags:
  migStrategy: "mixed"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

This seems to deploy just fine. I see a node-feature-discovery worker pod per node, a gpu-feature-discovery pod per node, and an nvida-device-plugin pod per node deployed.

I also see labels created on the nodes by both node-feature-discovery and gpu-feature-discovery. What I don't see is any labels created for MiG by the gpu-feature-discovery per this: https://github.com/NVIDIA/k8s-device-plugin/tree/main/docs/gpu-feature-discovery#generated-labels

Here are all the labels that start with "nvidia.com" that are added to the nodes:

  "nvidia.com/cuda.driver-version.full": "580.126.20",
  "nvidia.com/cuda.driver-version.major": "580",
  "nvidia.com/cuda.driver-version.minor": "126",
  "nvidia.com/cuda.driver-version.revision": "20",
  "nvidia.com/cuda.driver.major": "580",
  "nvidia.com/cuda.driver.minor": "126",
  "nvidia.com/cuda.driver.rev": "20",
  "nvidia.com/cuda.runtime-version.full": "13.0",
  "nvidia.com/cuda.runtime-version.major": "13",
  "nvidia.com/cuda.runtime-version.minor": "0",
  "nvidia.com/cuda.runtime.major": "13",
  "nvidia.com/cuda.runtime.minor": "0",
  "nvidia.com/gfd.timestamp": "1775237087",
  "nvidia.com/gpu.compute.major": "10",
  "nvidia.com/gpu.compute.minor": "3",
  "nvidia.com/gpu.count": "8",
  "nvidia.com/gpu.family": "blackwell",
  "nvidia.com/gpu.machine": "DGXB300",
  "nvidia.com/gpu.memory": "275040",
  "nvidia.com/gpu.mode": "compute",
  "nvidia.com/gpu.present": "true",
  "nvidia.com/gpu.product": "NVIDIA-B300-SXM6-AC",
  "nvidia.com/gpu.replicas": "1",
  "nvidia.com/gpu.sharing-strategy": "none",
  "nvidia.com/mig.capable": "true",
  "nvidia.com/mig.strategy": "mixed",
  "nvidia.com/mps.capable": "false",
  "nvidia.com/vgpu.present": "false"

And this is added to the capacity of the node as well:

  "nvidia.com/gpu": "8",
  "nvidia.com/gpu.shared": "0",

I see the following logs from the ctr container in the nvidia-device-plugin pod:

ERROR: init 250 result=11I0403 17:37:39.513185     363 main.go:250] "Starting NVIDIA Device Plugin" version=<
        1ae8e5f0-amd64
        commit: 1ae8e5f02a47ad80e2b4fcc6c35a757c61ddd81f
 >
I0403 17:37:39.513261     363 main.go:253] Starting FS watcher for /var/lib/kubelet/device-plugins
I0403 17:37:39.513328     363 main.go:260] Starting OS watcher.
I0403 17:37:39.514428     363 main.go:275] Starting Plugins.
I0403 17:37:39.514502     363 main.go:332] Loading configuration.
I0403 17:37:39.517927     363 main.go:358] Updating config with default resource matching patterns.
I0403 17:37:39.724518     363 main.go:369]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "mixed",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdrcopyEnabled": true,
    "gdsEnabled": true,
    "mofedEnabled": true,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ],
    "mig": [
      {
        "pattern": "1g.34gb",
        "name": "nvidia.com/mig-1g.34gb"
      },
      {
        "pattern": "2g.67gb",
        "name": "nvidia.com/mig-2g.67gb"
      },
      {
        "pattern": "3g.135gb",
        "name": "nvidia.com/mig-3g.135gb"
      },
      {
        "pattern": "4g.135gb",
        "name": "nvidia.com/mig-4g.135gb"
      },
      {
        "pattern": "7g.269gb",
        "name": "nvidia.com/mig-7g.269gb"
      },
      {
        "pattern": "1g.34gb+me",
        "name": "nvidia.com/mig-1g.34gb.me"
      },
      {
        "pattern": "1g.67gb",
        "name": "nvidia.com/mig-1g.67gb"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0403 17:37:39.724542     363 main.go:372] Retrieving plugins.
I0403 17:37:39.995823     363 server.go:198] Starting GRPC server for 'nvidia.com/gpu'
I0403 17:37:39.998220     363 server.go:142] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0403 17:37:40.004108     363 server.go:149] Registered device plugin for 'nvidia.com/gpu' with Kubelet
I0403 17:37:40.054570     363 health.go:64] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]

and the following from the init container in the same pod:

ERROR: init 250 result=11W0403 17:37:36.975661       7 client_config.go:682] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0403 17:37:36.976170       7 main.go:247] Waiting for change to 'nvidia.com/device-plugin.config' label
I0403 17:37:36.976176       7 main.go:249] Label change detected: nvidia.com/device-plugin.config=
I0403 17:37:36.976228       7 main.go:361] No value set. Selecting default name: default
I0403 17:37:36.976232       7 main.go:305] Updating to config: default
I0403 17:37:36.976325       7 main.go:320] Successfully updated to config: default

The only error message that I see in any of the pods that this chart deploys is in ctr container of the gpu-feature-discovery pod:

ERROR: init 250 result=11I0403 17:24:47.544108     364 main.go:163] Starting OS watcher.
I0403 17:24:47.544839     364 main.go:168] Loading configuration.
I0403 17:24:47.546465     364 main.go:180]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "mixed",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdrcopyEnabled": null,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "useNodeFeatureAPI": true,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": null,
      "nvidiaCTKPath": null,
      "containerDriverRoot": "/driver-root"
    },
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0403 17:24:47.754212     364 factory.go:58] Using NVML manager
I0403 17:24:47.766243     364 main.go:214] Start running
2026/04/03 17:24:47 WARNING: unable to detect IOMMU FD for [0000:1a:00.0 open /sys/bus/pci/devices/0000:1a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:47 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:47 WARNING: unable to detect IOMMU FD for [0000:3c:00.0 open /sys/bus/pci/devices/0000:3c:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:47 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:47 WARNING: unable to detect IOMMU FD for [0000:62:00.0 open /sys/bus/pci/devices/0000:62:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:73:00.0 open /sys/bus/pci/devices/0000:73:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:9a:00.0 open /sys/bus/pci/devices/0000:9a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:bc:00.0 open /sys/bus/pci/devices/0000:bc:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:df:00.0 open /sys/bus/pci/devices/0000:df:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:24:48 WARNING: unable to detect IOMMU FD for [0000:f0:00.0 open /sys/bus/pci/devices/0000:f0:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:24:48 WARNING: unable to get device name: [failed to find device with id '3182']
I0403 17:24:48.310539     364 main.go:281] Creating Labels
I0403 17:24:48.318302     364 output.go:165] Updating NodeFeature object nvidia-features-for-ngai-b300-01
I0403 17:24:48.329922     364 output.go:170] NodeFeature object updated: &{{ } {nvidia-features-for-ngai-b300-01  system-gpu-operator  359a11e9-4fb5-437d-99bf-5e068e79960c 2477652 9 2026-04-02 19:07:15 +0000 UTC <nil> <nil> map[nfd.node.kubernetes.io/node-name:ngai-b300-01] map[] [{apps/v1 DaemonSet nvidia-device-plugin-gpu-feature-discovery b9796695-9106-4939-942d-07b1d1513d20 0x1f2869b8209d <nil>} {v1 Pod nvidia-device-plugin-gpu-feature-discovery-l7qpr b6d64518-a336-449c-a59e-6e3cf723fc15 <nil> <nil>}] [] [{gpu-feature-discovery Update nfd.k8s-sigs.io/v1alpha1 2026-04-03 17:24:48 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:nfd.node.kubernetes.io/node-name":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"b6d64518-a336-449c-a59e-6e3cf723fc15\"}":{},"k:{\"uid\":\"b9796695-9106-4939-942d-07b1d1513d20\"}":{}}},"f:spec":{".":{},"f:features":{".":{},"f:attributes":{},"f:flags":{},"f:instances":{}},"f:labels":{".":{},"f:nvidia.com/cuda.driver-version.full":{},"f:nvidia.com/cuda.driver-version.major":{},"f:nvidia.com/cuda.driver-version.minor":{},"f:nvidia.com/cuda.driver-version.revision":{},"f:nvidia.com/cuda.driver.major":{},"f:nvidia.com/cuda.driver.minor":{},"f:nvidia.com/cuda.driver.rev":{},"f:nvidia.com/cuda.runtime-version.full":{},"f:nvidia.com/cuda.runtime-version.major":{},"f:nvidia.com/cuda.runtime-version.minor":{},"f:nvidia.com/cuda.runtime.major":{},"f:nvidia.com/cuda.runtime.minor":{},"f:nvidia.com/gfd.timestamp":{},"f:nvidia.com/gpu.compute.major":{},"f:nvidia.com/gpu.compute.minor":{},"f:nvidia.com/gpu.count":{},"f:nvidia.com/gpu.family":{},"f:nvidia.com/gpu.machine":{},"f:nvidia.com/gpu.memory":{},"f:nvidia.com/gpu.mode":{},"f:nvidia.com/gpu.product":{},"f:nvidia.com/gpu.replicas":{},"f:nvidia.com/gpu.sharing-strategy":{},"f:nvidia.com/mig.capable":{},"f:nvidia.com/mig.strategy":{},"f:nvidia.com/mps.capable":{},"f:nvidia.com/vgpu.present":{}}}} }]} {{map[] map[] map[]} map[nvidia.com/cuda.driver-version.full:580.126.20 nvidia.com/cuda.driver-version.major:580 nvidia.com/cuda.driver-version.minor:126 nvidia.com/cuda.driver-version.revision:20 nvidia.com/cuda.driver.major:580 nvidia.com/cuda.driver.minor:126 nvidia.com/cuda.driver.rev:20 nvidia.com/cuda.runtime-version.full:13.0 nvidia.com/cuda.runtime-version.major:13 nvidia.com/cuda.runtime-version.minor:0 nvidia.com/cuda.runtime.major:13 nvidia.com/cuda.runtime.minor:0 nvidia.com/gfd.timestamp:1775237087 nvidia.com/gpu.compute.major:10 nvidia.com/gpu.compute.minor:3 nvidia.com/gpu.count:8 nvidia.com/gpu.family:blackwell nvidia.com/gpu.machine:DGXB300 nvidia.com/gpu.memory:275040 nvidia.com/gpu.mode:compute nvidia.com/gpu.product:NVIDIA-B300-SXM6-AC nvidia.com/gpu.replicas:1 nvidia.com/gpu.sharing-strategy:none nvidia.com/mig.capable:true nvidia.com/mig.strategy:mixed nvidia.com/mps.capable:false nvidia.com/vgpu.present:false]}}
I0403 17:24:48.330242     364 main.go:294] Sleeping for 1m0s
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:1a:00.0 open /sys/bus/pci/devices/0000:1a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:3c:00.0 open /sys/bus/pci/devices/0000:3c:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:62:00.0 open /sys/bus/pci/devices/0000:62:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:73:00.0 open /sys/bus/pci/devices/0000:73:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:9a:00.0 open /sys/bus/pci/devices/0000:9a:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:bc:00.0 open /sys/bus/pci/devices/0000:bc:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:df:00.0 open /sys/bus/pci/devices/0000:df:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
2026/04/03 17:25:48 WARNING: unable to detect IOMMU FD for [0000:f0:00.0 open /sys/bus/pci/devices/0000:f0:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
2026/04/03 17:25:48 WARNING: unable to get device name: [failed to find device with id '3182']
I0403 17:25:48.836876     364 main.go:281] Creating Labels
I0403 17:25:48.842763     364 output.go:161] no changes in NodeFeature object nvidia-features-for-ngai-b300-01
I0403 17:25:48.842787     364 main.go:294] Sleeping for 1m0s

Which I assume means that it is not able to discover something about the GPU to create the MiG related labels.

With this setup I can create a pod using the following manifest which works just fine:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-test
spec:
  runtimeClassName: nvidia
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 4 # requesting 4 GPU

but when I try the following it fails on scheduling:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-test
spec:
  runtimeClassName: nvidia
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:13.2.0-cudnn-runtime-ubuntu24.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/mig-1g.34gb: 1

I assume that is due to the label not being made available on any node.

The details of this deployment are as follows:

Talos Linux v1.12.6 with:

Kubernetes API: v1.34
nvidia-container-toolkit: v1.18.2
nvidia-fabricmanager: 580.126.20
nvidia open kernel modules: 580.126.20
nvidia-gdrdrv-device: v2.5.1

Any help would be appreciated. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions