Workload Classification & Health States¶

KempnerPulse classifies each GPU into one of 12 workload categories and one of 4 health states every refresh cycle. The classification uses DCGM profiling counters and follows thresholds recommended by NVIDIA’s DCGM profiling metric guidance.

Real Utilization¶

The dashboard computes a single Real Utilization score per GPU:

Real Util = clamp(0, 100,
              W_sm    × SM_ACTIVE
            + W_tensor × TENSOR_ACTIVE
            + W_dram   × DRAM_ACTIVE
            + W_gr     × GR_ENGINE_ACTIVE)

All four inputs are DCGM profiling-level hardware counters (0 to 1 range, displayed as 0 to 100 %). The weights are configurable via --weights or the convenience preset flags.

Weight Presets¶

Preset	Flag	W_sm	W_tensor	W_dram	W_gr	Best For
AI/ML (default)	`--ai-weights`	0.35	0.35	0.20	0.10	Deep-learning training, LLM inference, transformers
HPC	`--hpc-weights`	0.45	0.15	0.25	0.15	Scientific computing, mixed CUDA, simulations
Memory-bound	`--mem-weights`	0.35	0.10	0.40	0.15	Bandwidth-heavy workloads, stencil codes

Custom weights: --weights W_SM,W_TENSOR,W_DRAM,W_GR (auto-normalized to sum to 1).

What the Components Mean¶

Component	Metric	Meaning
SM Active	`DCGM_FI_PROF_SM_ACTIVE`	Fraction of cycles with work assigned to streaming multiprocessors. Main compute signal.
Tensor Active	`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	Fraction of cycles tensor cores are running. Critical for mixed-precision and AI workloads.
DRAM Active	`DCGM_FI_PROF_DRAM_ACTIVE`	Fraction of cycles HBM is moving data. Practical peak ~80 %.
GR Engine Active	`DCGM_FI_PROF_GR_ENGINE_ACTIVE`	Fraction of time the graphics/compute engine is active. Falls back to `GPU_UTIL` when unavailable.

NVIDIA Reference Points¶

The classification thresholds are derived from NVIDIA documentation:

Metric	Threshold	NVIDIA Guidance
SM Active	≥ 80 %	“Necessary, but not sufficient, for effective GPU use”
SM Active	< 50 %	“Likely indicates ineffective GPU usage”
DRAM Active	≥ 50 %	Heavy memory traffic (practical peak ~80 %)
Tensor Active	~93 %	Full saturation as measured by `dcgmproftester`

Workload Classification Table¶

Categories are evaluated in order; the first matching rule wins. This means a GPU running tensor-heavy compute will not also be labeled “compute-heavy”, even if SM Active ≥ 80 %.

#	Status	Bottleneck	Thresholds	Rationale
1	idle	idle	Real Util < 5 %, GR < 5 %, DRAM < 5 %, no I/O	Nothing is running on the GPU.
2	tensor-heavy compute	compute	Tensor ≥ 50 % and SM ≥ 60 %	DL training or large-scale inference at peak tensor throughput.
3	tensor compute	compute	Tensor ≥ 15 % and SM ≥ 40 %	Meaningful tensor-core activity: mixed precision, moderate load.
4	FP64 / HPC compute	compute	FP64 ≥ 20 % and SM ≥ 50 %	Scientific double-precision workload.
5	I/O or data-loading	io	(Memcpy ≥ 40 % or PCIe RX/TX ≥ 1 GB/s) and SM < 30 %	Heavy host ↔ device transfer; SMs mostly idle.
6	memory-bound	memory	DRAM ≥ 50 % and SM < 50 %	Bandwidth limited. NVIDIA says SM < 50 % is likely ineffective.
7	compute-heavy	compute	SM ≥ 80 %	SMs well utilized. NVIDIA says ≥ 80 % is necessary for effective use.
8	compute-active	compute	SM ≥ 50 %	Moderate SM use, no tensor dominance.
9	memory-active	memory	DRAM ≥ 40 %	Significant DRAM traffic with some SM activity.
10	busy, low SM use	mixed	GR ≥ 40 % and SM < 25 %	Engine active but SMs underutilized. Likely overhead, sync, or small kernels.
11	low utilization	mixed	GR < 15 % and SM < 15 % and DRAM < 15 %	Barely any measurable activity.
12	mixed / moderate	mixed	(fallthrough)	No single dominant pattern.

Bottleneck Key¶

The bottleneck key is used for color-coding in the dashboard:

Key	Color	Meaning
`idle`	dim	GPU is not doing work.
`compute`	green	GPU is primarily compute-bound.
`io`	cyan	GPU is transfer/copy-bound.
`memory`	magenta	GPU is memory-bandwidth-bound.
`mixed`	yellow	No single dominant workload pattern.

Metric Profiles¶

Each workload category has a distinctive metric signature across the six axes: SM Active, Tensor Active, DRAM Active, GR Engine Active, Memcpy/IO, and FP64 Active.

Overlay shows how all 12 categories compare on a single chart:

Classification radar overlay

Individual profiles for each category:

Classification radar grid

Health States¶

Health is evaluated independently from workload classification. It checks error counters and temperatures against per-model thresholds.

Health Status Levels¶

Conditions are evaluated in order; the first matching condition wins.

Status	Style	Condition	Action
CRIT	bold red	Row-remap failure > 0 or uncorrectable remapped rows > 0	GPU has hardware memory errors. Remove from production immediately.
WARN	yellow	PCIe replay rate > 0/s	PCIe link quality issue; retransmissions occurring. Monitor closely.
HOT	yellow	GPU temp ≥ warning threshold or memory temp ≥ warning threshold	Thermal throttling zone. Check cooling and airflow.
OK	green	(none of the above)	Normal operating condition.

Temperature Thresholds by GPU Model¶

GPU Model	Normal	Warning	Critical
A100	85 °C	93 °C	95 °C
H100	85 °C	95 °C	105 °C
H200	80 °C	95 °C	105 °C
RTX 6000	85 °C	92 °C	105 °C
Other / unknown	85 °C	93 °C	105 °C

Health Metrics¶

Check	DCGM Metric	Trigger	Health Status
ECC Row Remap Failure	`DCGM_FI_DEV_ROW_REMAP_FAILURE`	> 0	CRIT
Uncorrectable Remapped Rows	`DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS`	> 0	CRIT
PCIe Replay Rate	`DCGM_FI_DEV_PCIE_REPLAY_COUNTER` (rate)	> 0/s	WARN
GPU Temperature	`DCGM_FI_DEV_GPU_TEMP`	≥ model warning threshold	HOT
Memory Temperature	`DCGM_FI_DEV_MEMORY_TEMP`	≥ model warning threshold	HOT