kempnerpulse.reader.dcgmi

Layer 1 (Read) — direct DCGM backend (dcgmi dmon).

Streams hardware counters from a local dcgmi dmon subprocess and emits one RawRecord per GPU per sampling tick, keyed by DCGM field names (the source’s own vocabulary). An N/A reading becomes None; it is never coerced to 0. Assigning meaning to these fields (canonical names, units, scaling) is Layer 2’s job, not this module’s.

Entry points:
  • DcgmiBackend — a persistent dcgmi dmon -c 0 stream for live monitoring.

  • read_once — a single synchronous dcgmi dmon -c 2 collection, for one-shot queries where a long-lived stream is unnecessary.

  • parse_dmon_block — pure text -> RawRecord parser (the testable core).

  • resolve_dcgm_gpu_ids — map process-visible GPUs to physical dcgmi IDs.

Functions

parse_dmon_block(text, *[, source_version, ...])

Parse dcgmi dmon rows into RawRecord objects, one per GPU <id> row.

read_once([gpu_ids, interval_ms, timeout])

Collect a single tick from dcgmi dmon as RawRecord objects.

resolve_dcgm_gpu_ids(discovery_stdout)

Resolve the physical GPU IDs visible to this process via dcgmi discovery.

run_dmon_once([gpu_ids, interval_ms, timeout])

Run one dcgmi dmon -c 2 collection and return its raw stdout text.

Classes

DcgmiBackend

Persistent dcgmi dmon -c 0 stream emitting RawRecord objects.

kempnerpulse.reader.dcgmi.resolve_dcgm_gpu_ids(discovery_stdout)[source]

Resolve the physical GPU IDs visible to this process via dcgmi discovery.

Inside a SLURM cgroup, CUDA_VISIBLE_DEVICES is remapped to 0, but dcgmi operates outside the cgroup and uses physical GPU indices. The two are reconciled by matching on GPU UUID.

Parameters:

discovery_stdout (str) – stdout from dcgmi discovery -l.

Returns:

(physical_ids, physical_to_local_map). physical_to_local_map maps each physical GPU ID to its local (cgroup) GPU ID, so that exports can match dcgmi GPU IDs against nvidia-smi process IDs.

Return type:

Tuple[List[str], Dict[str, str]]

kempnerpulse.reader.dcgmi.parse_dmon_block(text, *, source_version='dcgmi', timestamp=None, wallclock=None)[source]

Parse dcgmi dmon rows into RawRecord objects, one per GPU <id> row.

Each record’s fields map DCGM field names to raw values (None for N/A). Header (#), ID, and non-data lines are skipped. Every record produced from one call shares the same timestamp/wallclock (they belong to one sampling tick).

Layout (columns follow DCGM_DMON_FIELDS order):

#Entity   GPUTL  POWER  GTEMP  MTEMP  ...
ID
GPU 0     72     155.3  65     58     ...
Parameters:
  • text (str)

  • source_version (str)

  • timestamp (float | None)

  • wallclock (float | None)

Return type:

List[RawRecord]

kempnerpulse.reader.dcgmi.run_dmon_once(gpu_ids=None, interval_ms=100, timeout=15.0)[source]

Run one dcgmi dmon -c 2 collection and return its raw stdout text.

Two samples are requested because profiling fields (1001-1010) return N/A on the first sample of a cold invocation; the valid second tick lets a downstream last-non-None-wins merge recover the real values.

Parameters:
Return type:

str

kempnerpulse.reader.dcgmi.read_once(gpu_ids=None, interval_ms=100, timeout=15.0)[source]

Collect a single tick from dcgmi dmon as RawRecord objects.

Returns the records from both requested samples in order; the caller’s last-non-None-wins merge keeps the valid second-tick values.

Parameters:
Return type:

List[RawRecord]

class kempnerpulse.reader.dcgmi.DcgmiBackend[source]

Bases: object

Persistent dcgmi dmon -c 0 stream emitting RawRecord objects.

The reader is thread-free: stream_ticks is a blocking generator that iterates the subprocess stdout, groups rows into ticks, and yields one list of records per tick. stream flattens that to the per-record Backend contract. The first tick is dropped (profiling fields are N/A on a cold start). Concurrency, if needed, is the caller’s responsibility.

__init__()[source]
Return type:

None

open(config)[source]
Parameters:

config (ReaderConfig)

Return type:

None

close()[source]

Terminate the subprocess. Safe to call after a failed open.

Return type:

None

stream_ticks()[source]

Yield one list of RawRecord objects per sampling tick.

A tick ends when a GPU <id> row repeats an id already buffered for the current tick. The first tick is dropped. Raises DcgmStreamError if the subprocess exits non-zero (unless close was called).

Return type:

Iterator[List[RawRecord]]

stream()[source]
Return type:

Iterator[RawRecord]

property stderr

The subprocess stderr stream (for a consumer that drains it).

property caps: BackendCaps