kempnerpulse.reader.dcgmi¶

Layer 1 (Read) — direct DCGM backend (dcgmi dmon).

Streams hardware counters from a local dcgmi dmon subprocess and emits one RawRecord per GPU per sampling tick, keyed by DCGM field names (the source’s own vocabulary). An N/A reading becomes None; it is never coerced to 0. Assigning meaning to these fields (canonical names, units, scaling) is Layer 2’s job, not this module’s.

Entry points:

DcgmiBackend — a persistent dcgmi dmon -c 0 stream for live monitoring.
read_once — a single synchronous dcgmi dmon -c 2 collection, for one-shot queries where a long-lived stream is unnecessary.
parse_dmon_block — pure text -> RawRecord parser (the testable core).
resolve_dcgm_gpu_ids — map process-visible GPUs to physical dcgmi IDs.

Functions

`parse_dmon_block`(text, *[, source_version, ...])	Parse `dcgmi dmon` rows into `RawRecord` objects, one per `GPU <id>` row.
`read_once`([gpu_ids, interval_ms, timeout])	Collect a single tick from `dcgmi dmon` as `RawRecord` objects.
`resolve_dcgm_gpu_ids`(discovery_stdout)	Resolve the physical GPU IDs visible to this process via dcgmi discovery.
`run_dmon_once`([gpu_ids, interval_ms, timeout])	Run one `dcgmi dmon -c 2` collection and return its raw stdout text.

Classes

DcgmiBackend

Persistent dcgmi dmon -c 0 stream emitting RawRecord objects.

kempnerpulse.reader.dcgmi.resolve_dcgm_gpu_ids(discovery_stdout)[source]¶

Resolve the physical GPU IDs visible to this process via dcgmi discovery.

Inside a SLURM cgroup, CUDA_VISIBLE_DEVICES is remapped to 0, but dcgmi operates outside the cgroup and uses physical GPU indices. The two are reconciled by matching on GPU UUID.

Parameters:: discovery_stdout (str) – stdout from dcgmi discovery -l.
Returns:: (physical_ids, physical_to_local_map). physical_to_local_map maps each physical GPU ID to its local (cgroup) GPU ID, so that exports can match dcgmi GPU IDs against nvidia-smi process IDs.
Return type:: Tuple[List[str], Dict[str, str]]

kempnerpulse.reader.dcgmi.parse_dmon_block(text, *, source_version='dcgmi', timestamp=None, wallclock=None)[source]¶

Parse dcgmi dmon rows into RawRecord objects, one per GPU <id> row.

Each record’s fields map DCGM field names to raw values (None for N/A). Header (#), ID, and non-data lines are skipped. Every record produced from one call shares the same timestamp/wallclock (they belong to one sampling tick).

Layout (columns follow DCGM_DMON_FIELDS order):

#Entity   GPUTL  POWER  GTEMP  MTEMP  ...
ID
GPU 0     72     155.3  65     58     ...

Parameters:

text (str)
source_version (str)
timestamp (float | None)
wallclock (float | None)

Return type:

List[RawRecord]

kempnerpulse.reader.dcgmi.run_dmon_once(gpu_ids=None, interval_ms=100, timeout=15.0)[source]¶

Run one dcgmi dmon -c 2 collection and return its raw stdout text.

Two samples are requested because profiling fields (1001-1010) return N/A on the first sample of a cold invocation; the valid second tick lets a downstream last-non-None-wins merge recover the real values.

Parameters:

gpu_ids (List[str] | None)
interval_ms (int)
timeout (float)

Return type:

str

kempnerpulse.reader.dcgmi.read_once(gpu_ids=None, interval_ms=100, timeout=15.0)[source]¶

Collect a single tick from dcgmi dmon as RawRecord objects.

Returns the records from both requested samples in order; the caller’s last-non-None-wins merge keeps the valid second-tick values.

Parameters:

gpu_ids (List[str] | None)
interval_ms (int)
timeout (float)

Return type:

List[RawRecord]

class kempnerpulse.reader.dcgmi.DcgmiBackend[source]¶

Bases: object

Persistent dcgmi dmon -c 0 stream emitting RawRecord objects.

The reader is thread-free: stream_ticks is a blocking generator that iterates the subprocess stdout, groups rows into ticks, and yields one list of records per tick. stream flattens that to the per-record Backend contract. The first tick is dropped (profiling fields are N/A on a cold start). Concurrency, if needed, is the caller’s responsibility.

__init__()[source]¶

Return type:: None

open(config)[source]¶

Parameters:: config (ReaderConfig)
Return type:: None

close()[source]¶

Terminate the subprocess. Safe to call after a failed open.

Return type:: None

stream_ticks()[source]¶

Yield one list of RawRecord objects per sampling tick.

A tick ends when a GPU <id> row repeats an id already buffered for the current tick. The first tick is dropped. Raises DcgmStreamError if the subprocess exits non-zero (unless close was called).

Return type:: Iterator[List[RawRecord]]

stream()[source]¶

Return type:: Iterator[RawRecord]

property stderr¶: The subprocess stderr stream (for a consumer that drains it).

property caps: BackendCaps¶