DCGM Metrics Reference¶
KempnerPulse reads DCGM metrics via one
of two backends: dcgm-exporter
Prometheus HTTP endpoint (--backend prometheus, default) or dcgmi dmon
direct queries (--backend dcgm). This page lists every metric the dashboard
consumes, grouped by category.
NVIDIA docs: DCGM API Reference: Field Identifiers
Profiling Metrics (ratio 0 to 1, displayed as 0 to 100 %)¶
These come from hardware performance counters and give the most accurate picture of what the GPU is actually doing.
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
SM Active |
|
Fraction of cycles where at least one warp is active on an SM. Primary compute signal for Real Utilization. |
Real Util, Classification, Fleet / Focus / Plot views |
Tensor Active |
|
Fraction of cycles the tensor core pipeline is active. Critical for AI/LLM workloads. |
Real Util, Classification, Fleet / Focus / Plot views |
DRAM Active |
|
Fraction of cycles device memory (HBM) is actively moving data. Practical peak ~80 %. |
Real Util, Classification, Fleet / Focus / Plot views |
GR Engine Active |
|
Fraction of time the graphics/compute engine is active (profiling-level). Falls back to |
Real Util, Classification, Fleet / Focus / Plot views |
SM Occupancy |
|
Ratio of resident warps to theoretical maximum on SMs. |
Focus view |
FP16 Pipe |
|
Fraction of cycles FP16 (half-precision) pipe is active. |
Focus view |
FP32 Pipe |
|
Fraction of cycles FP32 (single-precision) pipe is active. |
Focus view |
FP64 Pipe |
|
Fraction of cycles FP64 (double-precision) pipe is active. Used in HPC classification. |
Classification, Focus view |
TC FP16/BF16 (HMMA) |
|
Tensor core FP16/BF16 HMMA activity (Hopper/Ada+). |
Focus view (if available) |
TC INT8 (IMMA) |
|
Tensor core INT8 IMMA activity. |
Focus view (if available) |
TC FP64 (DFMA) |
|
Tensor core FP64 DFMA activity. |
Focus view (if available) |
TC TF32/FP32 (DMMA) |
|
Tensor core TF32/FP32 DMMA activity. |
Focus view (if available) |
TC FP8 (QMMA) |
|
Tensor core FP8 QMMA activity. |
Focus view (if available) |
Device-Level Utilization (0 to 100 %)¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
GPU Util |
|
Percentage of time a kernel was running (same as |
Fleet / Focus / Plot views, fallback for GR Engine |
Mem Copy Util |
|
Memory-copy engine utilization; ≥ 40 % triggers I/O classification. |
Classification |
Memory (absolute MiB)¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
FB Used |
|
Frame-buffer (VRAM) in use, MiB. |
Memory bar, Fleet / Focus views |
FB Free |
|
Frame-buffer free, MiB. |
Memory total calculation |
FB Reserved |
|
Frame-buffer reserved by the driver, MiB. |
Memory total calculation |
Memory total = FB_USED + FB_FREE + FB_RESERVED.
Memory used % = 100 × FB_USED / total.
Temperature & Power¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
GPU Temp |
|
GPU core temperature, °C. |
Health check (HOT), Fleet / Focus views |
Memory Temp |
|
HBM / VRAM temperature, °C. |
Health check (HOT), Focus view |
Power Usage |
|
Current power draw, watts. Shown alongside |
Fleet / Focus views |
Clocks¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
SM Clock |
|
Streaming Multiprocessor clock, MHz. |
Focus view |
Mem Clock |
|
Memory clock, MHz. |
Focus view |
PCIe I/O (bytes/sec)¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
PCIe RX |
|
PCIe receive throughput. ≥ 1 GB/s triggers I/O classification. |
Classification, Fleet / Focus / Plot views |
PCIe TX |
|
PCIe transmit throughput. ≥ 1 GB/s triggers I/O classification. |
Classification, Fleet / Focus / Plot views |
NVLink (bytes, monotonic counter converted to rate)¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
NVLink BW Total |
|
Cumulative NVLink bytes. Dashboard computes Δ/s and displays GB/s. |
Fleet / Focus views |
Energy (monotonic counter)¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
Total Energy |
|
Cumulative energy in millijoules. Displayed as joules. |
Focus view |
Health & Error Counters¶
Metric |
DCGM Field |
Description |
Used In |
|---|---|---|---|
PCIe Replay |
|
Monotonic PCIe replay count. Rate > 0/s triggers WARN health. |
Health check |
Row Remap Failure |
|
Row-remap failures. > 0 triggers CRIT health. |
Health check |
Uncorrectable Remapped Rows |
|
Uncorrectable row remaps. > 0 triggers CRIT health. |
Health check |