kempnerpulse.identification¶
Cross-cutting tier — startup-once device identity and capability.
Resolves, exactly once at startup, the static facts about the host and its GPUs that the rest of the pipeline treats as constants for the process lifetime: hostname, per-GPU UUID / model, power and bandwidth limits, the PCI bus-id -> index map, the set of accessible GPUs, the dcgmi physical-id mapping, and the SLURM/MPI job metadata.
Every external-command query here is best-effort: a missing command,
permission error, timeout, or non-zero exit degrades to an empty result and is
never raised (in contrast with the reader layer, which raises typed errors). The
top-level identify() likewise never raises — it degrades to empty maps so
the lifecycle always receives a usable Identity.
The slurm_metadata produced by gather_slurm_metadata() uses the
canonical-record metadata keys, so it can be passed straight into the translate
layer’s SourceContext.
Runtime dependencies are the standard library only.
Functions
Collect SLURM/MPI job metadata from the environment (best-effort). |
|
|
Resolve the full startup |
Set of GPU index strings the current process can access (best-effort). |
|
Map uppercased PCI bus id -> GPU index via nvidia-smi (best-effort). |
|
Map GPU index -> model name via |
|
Map GPU index -> UUID via nvidia-smi (best-effort). |
|
Map GPU index -> aggregate NVLink bandwidth in GB/s (best-effort). |
|
Map GPU index -> max bidirectional PCIe bandwidth (bytes/s) (best-effort). |
|
Map GPU index -> max power limit in watts via nvidia-smi (best-effort). |
Classes
Static host/GPU identity and capability, resolved once at startup. |
- class kempnerpulse.identification.Identity[source]¶
Bases:
objectStatic host/GPU identity and capability, resolved once at startup.
Per-GPU maps are provided both string-keyed (matching nvidia-smi / dcgmi id strings, for downstream lookups) and int-keyed (convenient for the translate layer’s
SourceContext). For the dcgm backend, the string-keyed capability maps are re-keyed onto the physical dcgmi indices so per-GPU lookups line up with the ids the reader emits.- __init__(hostname, gpu_uuid_by_index=<factory>, gpu_model_by_index=<factory>, power_limit_watts_by_index=<factory>, pcie_bw_limit_bytes_per_second_by_index=<factory>, pcie_info='', nvlink_bw_limit_gbps_by_index=<factory>, bus_id_to_index=<factory>, accessible_gpu_ids=None, dcgm_physical_gpu_ids=None, dcgm_phys_to_local=<factory>, slurm_metadata=<factory>, gpu_uuid_by_id=<factory>, gpu_model_by_id=<factory>, power_limit_watts_by_id=<factory>, pcie_bw_limit_bytes_per_second_by_id=<factory>, nvlink_bw_limit_gbps_by_id=<factory>)¶
- kempnerpulse.identification.query_gpu_uuids()[source]¶
Map GPU index -> UUID via nvidia-smi (best-effort).
- kempnerpulse.identification.query_gpu_models()[source]¶
Map GPU index -> model name via
nvidia-smi -L(best-effort).
- kempnerpulse.identification.query_power_limits()[source]¶
Map GPU index -> max power limit in watts via nvidia-smi (best-effort).
- kempnerpulse.identification.query_pcie_bandwidth()[source]¶
Map GPU index -> max bidirectional PCIe bandwidth (bytes/s) (best-effort).
Returns
(limits, info_string)whereinfo_stringsummarizes the fastest link, e.g."Gen5 x16 63.0 GB/s bidir". Bandwidth islane_rate(gen) * width * 2(the*2accounts for full-duplex).
- kempnerpulse.identification.query_nvlink_bandwidth()[source]¶
Map GPU index -> aggregate NVLink bandwidth in GB/s (best-effort).
Parses
nvidia-smi nvlink -sand sums per-link speeds, doubling the total (each link is full-duplex andDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALcounts TX+RX). Returns an empty dict if NVLink is unavailable.
- kempnerpulse.identification.query_bus_id_mapping()[source]¶
Map uppercased PCI bus id -> GPU index via nvidia-smi (best-effort).
- kempnerpulse.identification.query_accessible_gpus()[source]¶
Set of GPU index strings the current process can access (best-effort).
nvidia-smi respects cgroup/container restrictions, so this reflects the GPUs actually reachable by the process. Returns
Noneif nvidia-smi is unavailable (the caller then applies no accessibility filtering).
- kempnerpulse.identification.gather_slurm_metadata()[source]¶
Collect SLURM/MPI job metadata from the environment (best-effort).
Keys are the canonical-record metadata names (
record_slurm_job_id, …) so the result drops straight into the translate layer’sSourceContext. Job / step / array ids are kept as strings; restart count, node index, and MPI rank are coerced to int. Any variable that is unset or malformed is omitted.
- kempnerpulse.identification.identify(config)[source]¶
Resolve the full startup
Identityfor the configured backend.For the dcgm backend: probe dcgmi discovery, resolve the physical-id mapping, take the accessible set as those physical ids (falling back to nvidia-smi), and re-key the per-GPU capability maps from local cgroup ids onto physical dcgmi ids. For prometheus: the accessible set comes from nvidia-smi and the dcgmi mapping is left unset (the prometheus bus-id bridge lives elsewhere).
Never raises — any failed query degrades to an empty result.