kempnerpulse.identification¶

Cross-cutting tier — startup-once device identity and capability.

Resolves, exactly once at startup, the static facts about the host and its GPUs that the rest of the pipeline treats as constants for the process lifetime: hostname, per-GPU UUID / model, power and bandwidth limits, the PCI bus-id -> index map, the set of accessible GPUs, the dcgmi physical-id mapping, and the SLURM/MPI job metadata.

Every external-command query here is best-effort: a missing command, permission error, timeout, or non-zero exit degrades to an empty result and is never raised (in contrast with the reader layer, which raises typed errors). The top-level identify() likewise never raises — it degrades to empty maps so the lifecycle always receives a usable Identity.

The slurm_metadata produced by gather_slurm_metadata() uses the canonical-record metadata keys, so it can be passed straight into the translate layer’s SourceContext.

Runtime dependencies are the standard library only.

Functions

`gather_slurm_metadata`()	Collect SLURM/MPI job metadata from the environment (best-effort).
`identify`(config)	Resolve the full startup `Identity` for the configured backend.
`query_accessible_gpus`()	Set of GPU index strings the current process can access (best-effort).
`query_bus_id_mapping`()	Map uppercased PCI bus id -> GPU index via nvidia-smi (best-effort).
`query_gpu_models`()	Map GPU index -> model name via `nvidia-smi -L` (best-effort).
`query_gpu_uuids`()	Map GPU index -> UUID via nvidia-smi (best-effort).
`query_nvlink_bandwidth`()	Map GPU index -> aggregate NVLink bandwidth in GB/s (best-effort).
`query_pcie_bandwidth`()	Map GPU index -> max bidirectional PCIe bandwidth (bytes/s) (best-effort).
`query_power_limits`()	Map GPU index -> max power limit in watts via nvidia-smi (best-effort).

Classes

Identity

Static host/GPU identity and capability, resolved once at startup.

class kempnerpulse.identification.Identity[source]¶

Bases: object

Static host/GPU identity and capability, resolved once at startup.

Per-GPU maps are provided both string-keyed (matching nvidia-smi / dcgmi id strings, for downstream lookups) and int-keyed (convenient for the translate layer’s SourceContext). For the dcgm backend, the string-keyed capability maps are re-keyed onto the physical dcgmi indices so per-GPU lookups line up with the ids the reader emits.

hostname: str¶

gpu_uuid_by_index: Dict[int, str]¶

gpu_model_by_index: Dict[int, str]¶

power_limit_watts_by_index: Dict[int, float]¶

pcie_bw_limit_bytes_per_second_by_index: Dict[int, float]¶

pcie_info: str = ''¶

nvlink_bw_limit_gbps_by_index: Dict[int, float]¶

bus_id_to_index: Dict[str, str]¶

accessible_gpu_ids: Set[str] | None = None¶

dcgm_physical_gpu_ids: List[str] | None = None¶

dcgm_phys_to_local: Dict[str, str]¶

slurm_metadata: Dict[str, object]¶

gpu_uuid_by_id: Dict[str, str]¶

gpu_model_by_id: Dict[str, str]¶

power_limit_watts_by_id: Dict[str, float]¶

pcie_bw_limit_bytes_per_second_by_id: Dict[str, float]¶

nvlink_bw_limit_gbps_by_id: Dict[str, float]¶

__init__(hostname, gpu_uuid_by_index=<factory>, gpu_model_by_index=<factory>, power_limit_watts_by_index=<factory>, pcie_bw_limit_bytes_per_second_by_index=<factory>, pcie_info='', nvlink_bw_limit_gbps_by_index=<factory>, bus_id_to_index=<factory>, accessible_gpu_ids=None, dcgm_physical_gpu_ids=None, dcgm_phys_to_local=<factory>, slurm_metadata=<factory>, gpu_uuid_by_id=<factory>, gpu_model_by_id=<factory>, power_limit_watts_by_id=<factory>, pcie_bw_limit_bytes_per_second_by_id=<factory>, nvlink_bw_limit_gbps_by_id=<factory>)¶

Parameters:

hostname (str)
gpu_uuid_by_index (Dict[int, str])
gpu_model_by_index (Dict[int, str])
power_limit_watts_by_index (Dict[int, float])
pcie_bw_limit_bytes_per_second_by_index (Dict[int, float])
pcie_info (str)
nvlink_bw_limit_gbps_by_index (Dict[int, float])
bus_id_to_index (Dict[str, str])
accessible_gpu_ids (Set[str] | None)
dcgm_physical_gpu_ids (List[str] | None)
dcgm_phys_to_local (Dict[str, str])
slurm_metadata (Dict[str, object])
gpu_uuid_by_id (Dict[str, str])
gpu_model_by_id (Dict[str, str])
power_limit_watts_by_id (Dict[str, float])
pcie_bw_limit_bytes_per_second_by_id (Dict[str, float])
nvlink_bw_limit_gbps_by_id (Dict[str, float])

Return type:

None

kempnerpulse.identification.query_gpu_uuids()[source]¶

Map GPU index -> UUID via nvidia-smi (best-effort).

Return type:: Dict[str, str]

kempnerpulse.identification.query_gpu_models()[source]¶

Map GPU index -> model name via nvidia-smi -L (best-effort).

Return type:: Dict[str, str]

kempnerpulse.identification.query_power_limits()[source]¶

Map GPU index -> max power limit in watts via nvidia-smi (best-effort).

Return type:: Dict[str, float]

kempnerpulse.identification.query_pcie_bandwidth()[source]¶

Map GPU index -> max bidirectional PCIe bandwidth (bytes/s) (best-effort).

Returns (limits, info_string) where info_string summarizes the fastest link, e.g. "Gen5 x16 63.0 GB/s bidir". Bandwidth is lane_rate(gen) * width * 2 (the *2 accounts for full-duplex).

Return type:: Tuple[Dict[str, float], str]

kempnerpulse.identification.query_nvlink_bandwidth()[source]¶

Map GPU index -> aggregate NVLink bandwidth in GB/s (best-effort).

Parses nvidia-smi nvlink -s and sums per-link speeds, doubling the total (each link is full-duplex and DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counts TX+RX). Returns an empty dict if NVLink is unavailable.

Return type:: Dict[str, float]

kempnerpulse.identification.query_bus_id_mapping()[source]¶

Map uppercased PCI bus id -> GPU index via nvidia-smi (best-effort).

Return type:: Dict[str, str]

kempnerpulse.identification.query_accessible_gpus()[source]¶

Set of GPU index strings the current process can access (best-effort).

nvidia-smi respects cgroup/container restrictions, so this reflects the GPUs actually reachable by the process. Returns None if nvidia-smi is unavailable (the caller then applies no accessibility filtering).

Return type:: Set[str] | None

kempnerpulse.identification.gather_slurm_metadata()[source]¶

Collect SLURM/MPI job metadata from the environment (best-effort).

Keys are the canonical-record metadata names (record_slurm_job_id, …) so the result drops straight into the translate layer’s SourceContext. Job / step / array ids are kept as strings; restart count, node index, and MPI rank are coerced to int. Any variable that is unset or malformed is omitted.

Return type:: Dict[str, object]

kempnerpulse.identification.identify(config)[source]¶

Resolve the full startup Identity for the configured backend.

For the dcgm backend: probe dcgmi discovery, resolve the physical-id mapping, take the accessible set as those physical ids (falling back to nvidia-smi), and re-key the per-GPU capability maps from local cgroup ids onto physical dcgmi ids. For prometheus: the accessible set comes from nvidia-smi and the dcgmi mapping is left unset (the prometheus bus-id bridge lives elsewhere).

Never raises — any failed query degrades to an empty result.

Parameters:: config (Config)
Return type:: Identity