kempnerpulse.reader.dcgmi¶
Layer 1 (Read) — direct DCGM backend (dcgmi dmon).
Streams hardware counters from a local dcgmi dmon subprocess and emits one
RawRecord per GPU per sampling tick, keyed by DCGM field names (the source’s
own vocabulary). An N/A reading becomes None; it is never coerced to
0. Assigning meaning to these fields (canonical names, units, scaling) is
Layer 2’s job, not this module’s.
- Entry points:
DcgmiBackend— a persistentdcgmi dmon -c 0stream for live monitoring.read_once— a single synchronousdcgmi dmon -c 2collection, for one-shot queries where a long-lived stream is unnecessary.parse_dmon_block— pure text ->RawRecordparser (the testable core).resolve_dcgm_gpu_ids— map process-visible GPUs to physical dcgmi IDs.
Functions
|
Parse |
|
Collect a single tick from |
|
Resolve the physical GPU IDs visible to this process via dcgmi discovery. |
|
Run one |
Classes
Persistent |
- kempnerpulse.reader.dcgmi.resolve_dcgm_gpu_ids(discovery_stdout)[source]¶
Resolve the physical GPU IDs visible to this process via dcgmi discovery.
Inside a SLURM cgroup,
CUDA_VISIBLE_DEVICESis remapped to0, butdcgmioperates outside the cgroup and uses physical GPU indices. The two are reconciled by matching on GPU UUID.- Parameters:
discovery_stdout (str) – stdout from
dcgmi discovery -l.- Returns:
(physical_ids, physical_to_local_map).physical_to_local_mapmaps each physical GPU ID to its local (cgroup) GPU ID, so that exports can match dcgmi GPU IDs againstnvidia-smiprocess IDs.- Return type:
- kempnerpulse.reader.dcgmi.parse_dmon_block(text, *, source_version='dcgmi', timestamp=None, wallclock=None)[source]¶
Parse
dcgmi dmonrows intoRawRecordobjects, one perGPU <id>row.Each record’s
fieldsmap DCGM field names to raw values (NoneforN/A). Header (#),ID, and non-data lines are skipped. Every record produced from one call shares the sametimestamp/wallclock(they belong to one sampling tick).Layout (columns follow
DCGM_DMON_FIELDSorder):#Entity GPUTL POWER GTEMP MTEMP ... ID GPU 0 72 155.3 65 58 ...
- kempnerpulse.reader.dcgmi.run_dmon_once(gpu_ids=None, interval_ms=100, timeout=15.0)[source]¶
Run one
dcgmi dmon -c 2collection and return its raw stdout text.Two samples are requested because profiling fields (1001-1010) return
N/Aon the first sample of a cold invocation; the valid second tick lets a downstream last-non-None-wins merge recover the real values.
- kempnerpulse.reader.dcgmi.read_once(gpu_ids=None, interval_ms=100, timeout=15.0)[source]¶
Collect a single tick from
dcgmi dmonasRawRecordobjects.Returns the records from both requested samples in order; the caller’s last-non-
None-wins merge keeps the valid second-tick values.
- class kempnerpulse.reader.dcgmi.DcgmiBackend[source]¶
Bases:
objectPersistent
dcgmi dmon -c 0stream emittingRawRecordobjects.The reader is thread-free:
stream_ticksis a blocking generator that iterates the subprocess stdout, groups rows into ticks, and yields one list of records per tick.streamflattens that to the per-recordBackendcontract. The first tick is dropped (profiling fields areN/Aon a cold start). Concurrency, if needed, is the caller’s responsibility.- open(config)[source]¶
- Parameters:
config (ReaderConfig)
- Return type:
None
- stream_ticks()[source]¶
Yield one list of
RawRecordobjects per sampling tick.A tick ends when a
GPU <id>row repeats an id already buffered for the current tick. The first tick is dropped. RaisesDcgmStreamErrorif the subprocess exits non-zero (unlessclosewas called).
- property stderr¶
The subprocess stderr stream (for a consumer that drains it).
- property caps: BackendCaps¶