For AI agents: A markdown version of this page is available at https://docs.datadoghq.com/gpu_monitoring/fleet.md. A documentation index is available at /llms.txt.

GPU Monitoring Fleet Page

This product is not supported for your selected Datadog site. ().

Overview

The GPU Fleet page provides a detailed inventory of all of your GPU-accelerated hosts for a specified time frame. Use this view to uncover inefficiencies through resource telemetry, ranging from performance and usage metrics to costs. This page also surfaces Datadog’s built-in provisioning and performance optimization recommendations for your devices, to help you maximize the value of your GPU spend.

Break down your fleet by any tag

Use quick filter dropdowns at the top of the page to filter by a specific Provider, Device Type, Cluster, Region, Service, Data Center, Environment, or Team.

You can also Search or Group by other tags using the search and group-by fields. For example, with Host selected, group by Team to view a table entry for each unique team. Click the > button next to any entry to see the hosts used by that team and the GPU devices accelerating those hosts.

Note: You can only Group by one additional tag.

If you select Cluster or Host, you can click on the > button next to each table entry to view a cluster’s hosts or a host’s devices, respectively.

GPU Fleet table showing services with their device types, with the row expand button highlighted

Note: The Cluster table is only populated if you use Kubernetes.

Filter dropdowns and Group by selector at the top of the GPU Fleet page

Use-case driven views

Datadog guides you through your provisioning and performance optimization workflows by providing two dedicated use-case driven views.

Provisioning

The Provisioning tab shows key recommendations and metrics insights for allocating and managing your capacity.

The Provisioning use-case driven view

Built-in recommendations:

  • Datadog proactively detects thermal throttling or hardware defects and instantly recommends remediation based on hardware errors like ECC/XID errors.
  • Datadog detects whether inactive devices should be provisioned to avoid having devices sit idle.

Metrics relevant for your provisioning workflow:

  • ECC Errors
  • XID Errors
  • Graphics Engine Activity
  • GPU Utilization
  • GPU Memory
  • Allocated Devices (Only available for Kubernetes users)
  • Active Devices
  • Idle Cost

Performance

The Performance tab helps you understand workload execution and tune GPU utilization to use your devices more effectively.

The Performance use-case driven view

Built-in recommendations:

  • If your workloads are CPU-intensive, Datadog flags hosts with CPU saturation and recommends solutions.
  • If your workloads aren’t effectively using their allocated GPU devices, Datadog provides recommendations for tuning workloads to get more value out of their capacity.

Metrics relevant for your performance workflow:

  • ECC Errors
  • XID Errors
  • Graphics Engine Activity
  • GPU Utilization
  • GPU Memory
  • Effective Devices
  • Power
  • Temperature
  • PCIe RX Throughput
  • PCIe TX Throughput
  • CPU Utilization

Summary Graph

After selecting Cluster, Host, or Device, the Summary Graph displays key resource telemetry across your entire GPU infrastructure grouped by that selection. Expand the section below to see a table of the available metrics and what they represent.

MetricDefinitionMetric Name
Provisioned DevicesBreakdown of provisioned devices by active and effective devices.gpu.device.total
Allocated Devices(Only available if using Kubernetes) Count of devices that have been allocated to a workload.gpu.device.total
Active DevicesCount of devices that are actively used for a workload or are busy. If using Kubernetes: count of allocated devices that are actively used for a workload.gpu.gr_engine_active
Effective DevicesCount of devices that are used and working for more than 50% of the selected time frame.gpu.sm_active
Core Utilization(Only available if System Probe enabled) Cores Used/Cores Limit for GPU processes. Measure of Temporal Core Utilization.gpu_core_utilization
GPU MemoryRatio (%) of GPU memory used to total GPU memory limit.100 - (gpu.memory.free / gpu.memory.limit * 100)
PCIe RX ThroughputBytes received through PCI from the GPU device per second.gpu.pci.throughput.rx
PCIe TX ThroughputBytes transmitted through PCI to the GPU device per second.gpu.pci.throughput.tx
Graphics Engine ActivityFraction of time the GPU was performing any compute work during the interval. A coarse signal of whether the GPU is busy or idle.gpu.gr_engine_active
GPU UtilizationAverage % of time each streaming multiprocessor was active (lower values indicate idle time).gpu.sm_active
PowerPower usage for the GPU device.
Note: On GA100 and previous architectures, this represents the instantaneous power at that moment.
For newer architectures, it represents the average power draw (Watts) over one second.
gpu.power.usage
TemperatureTemperature of a GPU device.gpu.temperature
SM ClockSM clock frequency in MHz.gpu.clock_speed.sm
Memory FreeAmount of available / free memory.gpu.memory.free
GPU SaturationMeasures how fully the GPU’s parallel execution capacity is being used during the time frame (average ratio of active warps to the maximum warps supported per streaming multiprocessor across all SMs).gpu.sm_occupancy
NVLink RXTotal RX of all NVLINK links.gpu.nvlink.throughput.raw.rx
NVLink TXTotal TX of all NVLINK links.gpu.nvlink.throughput.raw.tx
NVLink Active LinksNumber of active NVLINK links for the device.gpu.nvlink.count.active
ECC ErrorsTotal count of uncorrected ECC errors.gpu.errors.ecc.uncorrected.total
XID ErrorsCount of NVIDIA XID errors, indicating hardware or driver-level issues.gpu.errors.xid.total
CPU Utilization% of time the CPU spent running user space processes.system.cpu.user
Host UptimeTime since the host was last startedsystem.uptime
Host I/O Utilization% of CPU time during which I/O requests were issued to the GPU device.system.io.util
Host Memory% of usable memory in use.system.mem.pct_usable

If you’ve selected an additional tag to group by—for example, team—every unique timeseries in the Summary Graph corresponds to a team’s value for the selected metric.

Inventory of your GPU-powered infrastructure

This table breaks down your GPU-powered infrastructure by any tag of your choosing. If you haven’t specified an additional tag in the Group by field, results are grouped by your selected view: Cluster, Host, or Device.

By default, the table of results displays the following columns:

  • Device Name
  • Graphics Engine Activity
  • GPU Utilization (Only if System Probe is enabled)
  • Core Utilization
  • GPU Memory
  • Idle Cost
  • Recommendation

You can click on the gear icon to customize which metrics are displayed within the table. Expand the section below to see a full list of the available metrics.

CategoryMetricDefinitionMetric Name
Device NameType of GPU device.gpu_device
Hardware HealthTotal ErrorsTotal count of errors for the resource.gpu.errors.total
Hardware HealthECC ErrorsTotal count of uncorrected ECC errors.gpu.errors.ecc.uncorrected.total
Hardware HealthXID ErrorsCount of NVIDIA XID errors, indicating hardware or driver-level issues.gpu.errors.xid.total
UtilizationGraphics Engine ActivityFraction of time the GPU was performing any compute work during the interval. A coarse signal of whether the GPU is busy or idle.gpu.gr_engine_active
UtilizationGPU SaturationMeasures how fully the GPU’s parallel execution capacity is being used during the time frame (average ratio of active warps to the maximum warps supported per streaming multiprocessor across all SMs).gpu.sm_occupancy
UtilizationCore Utilization(Only available if System Probe enabled) Cores Used/Cores Limit for GPU processes. Measure of Temporal Core Utilization.gpu_core_utilization
UtilizationGPU Idle% of time the GPU device is idle.100-gpu.gr_engine_active
ProvisioningIdle Cost(Only nonzero for time frames longer than 2 days) The cost of GPU resources that are reserved and allocated, but not used.
ProvisioningAllocated Devices(Only available if using Kubernetes) Count of devices that have been allocated to a workload.gpu.device.total
ProvisioningUnallocated DevicesCount of devices not allocated and available for use during time frame.
ProvisioningActive DevicesCount of devices that are actively used for a workload or are busy. If using Kubernetes: count of allocated devices that are actively used for a workload.gpu.gr_engine_active
ProvisioningEffective DevicesCount of devices that are used and working for more than 50% of the selected time frame.gpu.sm_active
PerformanceCPU Utilization% of time the CPU spent running user space processes.system.cpu.user
PerformanceHost UptimeTime since the host was last startedsystem.uptime
PerformanceHost I/O Utilization% of CPU time during which I/O requests were issued to the GPU device.system.io.util
PerformanceHost Memory% of usable memory in use.system.mem.pct_usable
PerformanceGPU UtilizationAverage % of time each streaming multiprocessor was active (lower values indicate idle time).gpu.sm_active
PerformanceGPU MemoryRatio (%) of GPU memory used to total GPU memory limit.100 - (gpu.memory.free / gpu.memory.limit * 100)
PerformancePowerPower usage for the GPU device.
Note: On GA100 and previous architectures, this represents the instantaneous power at that moment.
For newer architectures, it represents the average power draw (Watts) over one second.
gpu.power.usage
PerformanceTemperatureTemperature of a GPU device.gpu.temperature
PerformanceSM ClockSM clock frequency in MHz.gpu.clock_speed.sm
PerformancePCIe RX ThroughputBytes received through PCI from the GPU device per second.gpu.pci.throughput.rx
PerformancePCIe TX ThroughputBytes transmitted through PCI to the GPU device per second.gpu.pci.throughput.tx
PerformanceNVLink RXTotal RX of all NVLINK links.gpu.nvlink.throughput.raw.rx
PerformanceNVLink TXTotal TX of all NVLINK links.gpu.nvlink.throughput.raw.tx
PerformanceNVLink Active LinksNumber of active NVLINK links for the device.gpu.nvlink.count.active

Details side panel

Clicking any row in the Fleet table opens a side panel with more details for the selected cluster, host, or device.

Connected Entities

Datadog’s GPU Monitoring doesn’t need to rely on NVIDIA’S DCGM Exporter. It uses the Datadog Agent to observe GPUs directly, providing insight into GPU usage and costs for pods and processes. Under the Connected Entities section in any detail view, you can see SM activity, GPU core utilization (only if System Probe is enabled), and the memory usage of pods, processes, and Slurm jobs. This helps you identify which workloads to cut or optimize to decrease total spend.

Note: The Pods tab is only available if you’re using Kubernetes.

Within this side panel, you have a cluster-specific funnel that identifies:

  • Number of Total, Allocated (Kubernetes users only) , Active, and Effective devices within that particular cluster

  • Estimated total and idle cost of that cluster

  • Connected entities of that cluster: pods, processes, and Slurm jobs

  • Four key metrics (customizable) for that cluster: Core Utilization (only if System probe is enabled), Memory Utilization, PCIe Throughput, and Graphics Activity

  • Table of hosts associated with that cluster

    Cluster specific side panel that breaks down idle devices, costs and connected entities

Within this side panel, you have a host-specific view that identifies:

  • Host-related metadata such as provider, instance type, CPU utilization, system memory used, system memory total, system IO util, SM activity, and temperature

  • (only available for Kubernetes users) The specific GPU devices allocated to that host sorted by Graphics Engine Activity

  • Connected Entities of that host: pods, processes, and Slurm jobs

    Host specific side panel that displays the GPU devices tied to that host and Connected Entities

Within this side panel, you have a device-specific view that identifies:

  • Recommendations (if any) for how to use this device more effectively

  • Device-related details: device type, SM activity, and temperature

  • Four key metrics tied to GPUs: SM Activity, Memory Utilization, Power, and Graphics Engine Activity

  • Connected Entities of that cluster: pods and processes

    Device specific side panel that displays recommendations for how to use the device more effectively and other key telemetry.

Installation recommendations

Datadog actively surveys your infrastructure and detects installation gaps that may diminish the value you get out of GPU Monitoring. In this modal, you can find installation update recommendations to get the optimal value of GPU Monitoring. For example, making sure your hosts have the latest version of the Datadog Agent installed, installing the latest version of the NVIDIA driver, and checking for misconfigured hosts.

To view advanced GPU Monitoring features such as attribution of GPU resources by related processes or SLURM jobs, you must enable Live Processes and the Slurm integration, respectively.

Modal containing installation guidance for smoother GPU Monitoring user experience.

Further reading

Additional helpful documentation, links, and articles: