kempnerpulse.translate.schema

The canonical schema — the inter-layer contract.

CanonicalRecord is the single internal vocabulary that Layers 3 (Compute) and 4 (Present) depend on. Layer 2 (Translate) is the only layer that knows about source vocabularies (DCGM field IDs, Prometheus names), units, and backend quirks; it emits CanonicalRecord objects and nothing above it ever sees a vendor identifier again.

Field-naming convention (every field follows <scope>_<subsystem>_<aspect>_<unit>):

  • record_* — record-level metadata; entity_* — GPU / MIG identity; gpu_* — per-GPU hardware readings.

  • Ratios are ..._fraction in [0.0, 1.0] — never _pct. A presenter that wants 0–100 multiplies by 100 itself.

  • Throughputs carry _bytes_per_second; cumulative counters use the bare unit (_joules); event counts use _count.

  • None means “the source did not provide this reading” — never coerced to 0.

The names are long and explicit on purpose: a reader of gpu_streaming_multiprocessor_active_cycle_fraction needs no glossary.

Functions

canonical_field_names()

Every CanonicalRecord field name, in declaration order.

Classes

AggregationMode

How a record's metric values are integrated over time.

CanonicalRecord

One fully-translated reading for one entity, in canonical vocabulary.

Provenance

Where a record came from.

Exceptions

TranslateError

A canonical record violated a schema invariant (see validate).

exception kempnerpulse.translate.schema.TranslateError[source]

Bases: ValueError

A canonical record violated a schema invariant (see validate).

class kempnerpulse.translate.schema.AggregationMode[source]

Bases: Enum

How a record’s metric values are integrated over time.

POINT = 'point'
WINDOW = 'window'
class kempnerpulse.translate.schema.Provenance[source]

Bases: Enum

Where a record came from.

DCGMI = 'dcgmi'
PROMETHEUS = 'prometheus'
NVML_FALLBACK = 'nvml_fallback'
REPLAY = 'replay'
class kempnerpulse.translate.schema.CanonicalRecord[source]

Bases: object

One fully-translated reading for one entity, in canonical vocabulary.

All Optional[float] metric fields are None unless the source provided them. The required block (no defaults) is the metadata every record must carry; the optional block is the per-subsystem readings plus cluster and reserved metadata that tolerate absence.

record_schema_version: int
record_timestamp_monotonic_seconds: float
record_timestamp_wallclock_unix_seconds: float
record_aggregation_mode: AggregationMode
record_window_microseconds: int
record_freshness_microseconds: int
record_provenance: Provenance
record_hostname: str
entity_gpu_index: int
entity_gpu_uuid: str
entity_mig_instance_index: int | None = None
entity_process_id: int | None = None
entity_process_command_line_truncated: str | None = None
record_slurm_job_id: str | None = None
record_slurm_step_id: str | None = None
record_slurm_array_job_id: str | None = None
record_slurm_array_task_id: str | None = None
record_slurm_restart_count: int | None = None
record_node_index_in_job: int | None = None
record_mpi_rank: int | None = None
record_capture_clock_offset_microseconds: int | None = None
record_user_annotation_iteration_index: int | None = None
record_user_annotation_phase_label: str | None = None
record_user_annotation_step_count: int | None = None
record_user_annotation_request_id: str | None = None
record_user_annotation_token_count: int | None = None
gpu_streaming_multiprocessor_active_cycle_fraction: float | None = None
gpu_streaming_multiprocessor_warp_occupancy_fraction: float | None = None
gpu_tensor_core_pipe_active_cycle_fraction: float | None = None
gpu_tensor_core_half_precision_mma_active_cycle_fraction: float | None = None
gpu_tensor_core_integer_mma_active_cycle_fraction: float | None = None
gpu_tensor_core_double_precision_fma_active_cycle_fraction: float | None = None
gpu_tensor_core_double_mma_active_cycle_fraction: float | None = None
gpu_tensor_core_quarter_mma_active_cycle_fraction: float | None = None
gpu_cuda_core_floating_point_64bit_pipe_active_cycle_fraction: float | None = None
gpu_cuda_core_floating_point_32bit_pipe_active_cycle_fraction: float | None = None
gpu_cuda_core_floating_point_16bit_pipe_active_cycle_fraction: float | None = None
gpu_graphics_compute_engine_active_cycle_fraction: float | None = None
gpu_dram_controller_active_cycle_fraction: float | None = None
gpu_memory_copy_engine_busy_time_fraction: float | None = None
gpu_pcie_transmit_throughput_bytes_per_second: float | None = None
gpu_pcie_receive_throughput_bytes_per_second: float | None = None
gpu_pcie_replay_count: int | None = None
gpu_board_power_draw_watts: float | None = None
gpu_board_total_energy_joules: float | None = None
gpu_board_enforced_power_limit_watts: float | None = None
gpu_board_default_power_limit_watts: float | None = None
gpu_die_temperature_celsius: float | None = None
gpu_memory_die_temperature_celsius: float | None = None
gpu_streaming_multiprocessor_clock_frequency_megahertz: float | None = None
gpu_memory_clock_frequency_megahertz: float | None = None
gpu_framebuffer_used_mebibytes: float | None = None
gpu_framebuffer_free_mebibytes: float | None = None
gpu_framebuffer_reserved_mebibytes: float | None = None
gpu_framebuffer_total_mebibytes: float | None = None
gpu_nvml_busy_time_fraction: float | None = None
gpu_xid_error_count: int | None = None
gpu_uncorrectable_remapped_row_count: int | None = None
gpu_correctable_remapped_row_count: int | None = None
gpu_row_remap_failure_flag: bool | None = None
validate()[source]

Raise TranslateError if any single-record invariant is violated.

Single-record invariants only. The cross-record invariant — energy is monotonically non-decreasing per entity — is enforced upstream by the Translate differencer, which is the only component that sees the sequence; it cannot be checked from one record in isolation.

Return type:

None

__init__(record_schema_version, record_timestamp_monotonic_seconds, record_timestamp_wallclock_unix_seconds, record_aggregation_mode, record_window_microseconds, record_freshness_microseconds, record_provenance, record_hostname, entity_gpu_index, entity_gpu_uuid, entity_mig_instance_index=None, entity_process_id=None, entity_process_command_line_truncated=None, record_slurm_job_id=None, record_slurm_step_id=None, record_slurm_array_job_id=None, record_slurm_array_task_id=None, record_slurm_restart_count=None, record_node_index_in_job=None, record_mpi_rank=None, record_capture_clock_offset_microseconds=None, record_user_annotation_iteration_index=None, record_user_annotation_phase_label=None, record_user_annotation_step_count=None, record_user_annotation_request_id=None, record_user_annotation_token_count=None, gpu_streaming_multiprocessor_active_cycle_fraction=None, gpu_streaming_multiprocessor_warp_occupancy_fraction=None, gpu_tensor_core_pipe_active_cycle_fraction=None, gpu_tensor_core_half_precision_mma_active_cycle_fraction=None, gpu_tensor_core_integer_mma_active_cycle_fraction=None, gpu_tensor_core_double_precision_fma_active_cycle_fraction=None, gpu_tensor_core_double_mma_active_cycle_fraction=None, gpu_tensor_core_quarter_mma_active_cycle_fraction=None, gpu_cuda_core_floating_point_64bit_pipe_active_cycle_fraction=None, gpu_cuda_core_floating_point_32bit_pipe_active_cycle_fraction=None, gpu_cuda_core_floating_point_16bit_pipe_active_cycle_fraction=None, gpu_graphics_compute_engine_active_cycle_fraction=None, gpu_dram_controller_active_cycle_fraction=None, gpu_memory_copy_engine_busy_time_fraction=None, gpu_pcie_transmit_throughput_bytes_per_second=None, gpu_pcie_receive_throughput_bytes_per_second=None, gpu_pcie_replay_count=None, gpu_nvlink_aggregate_throughput_bytes_per_second=None, gpu_board_power_draw_watts=None, gpu_board_total_energy_joules=None, gpu_board_enforced_power_limit_watts=None, gpu_board_default_power_limit_watts=None, gpu_die_temperature_celsius=None, gpu_memory_die_temperature_celsius=None, gpu_streaming_multiprocessor_clock_frequency_megahertz=None, gpu_memory_clock_frequency_megahertz=None, gpu_framebuffer_used_mebibytes=None, gpu_framebuffer_free_mebibytes=None, gpu_framebuffer_reserved_mebibytes=None, gpu_framebuffer_total_mebibytes=None, gpu_nvml_busy_time_fraction=None, gpu_xid_error_count=None, gpu_uncorrectable_remapped_row_count=None, gpu_correctable_remapped_row_count=None, gpu_row_remap_failure_flag=None)
Parameters:
  • record_schema_version (int)

  • record_timestamp_monotonic_seconds (float)

  • record_timestamp_wallclock_unix_seconds (float)

  • record_aggregation_mode (AggregationMode)

  • record_window_microseconds (int)

  • record_freshness_microseconds (int)

  • record_provenance (Provenance)

  • record_hostname (str)

  • entity_gpu_index (int)

  • entity_gpu_uuid (str)

  • entity_mig_instance_index (int | None)

  • entity_process_id (int | None)

  • entity_process_command_line_truncated (str | None)

  • record_slurm_job_id (str | None)

  • record_slurm_step_id (str | None)

  • record_slurm_array_job_id (str | None)

  • record_slurm_array_task_id (str | None)

  • record_slurm_restart_count (int | None)

  • record_node_index_in_job (int | None)

  • record_mpi_rank (int | None)

  • record_capture_clock_offset_microseconds (int | None)

  • record_user_annotation_iteration_index (int | None)

  • record_user_annotation_phase_label (str | None)

  • record_user_annotation_step_count (int | None)

  • record_user_annotation_request_id (str | None)

  • record_user_annotation_token_count (int | None)

  • gpu_streaming_multiprocessor_active_cycle_fraction (float | None)

  • gpu_streaming_multiprocessor_warp_occupancy_fraction (float | None)

  • gpu_tensor_core_pipe_active_cycle_fraction (float | None)

  • gpu_tensor_core_half_precision_mma_active_cycle_fraction (float | None)

  • gpu_tensor_core_integer_mma_active_cycle_fraction (float | None)

  • gpu_tensor_core_double_precision_fma_active_cycle_fraction (float | None)

  • gpu_tensor_core_double_mma_active_cycle_fraction (float | None)

  • gpu_tensor_core_quarter_mma_active_cycle_fraction (float | None)

  • gpu_cuda_core_floating_point_64bit_pipe_active_cycle_fraction (float | None)

  • gpu_cuda_core_floating_point_32bit_pipe_active_cycle_fraction (float | None)

  • gpu_cuda_core_floating_point_16bit_pipe_active_cycle_fraction (float | None)

  • gpu_graphics_compute_engine_active_cycle_fraction (float | None)

  • gpu_dram_controller_active_cycle_fraction (float | None)

  • gpu_memory_copy_engine_busy_time_fraction (float | None)

  • gpu_pcie_transmit_throughput_bytes_per_second (float | None)

  • gpu_pcie_receive_throughput_bytes_per_second (float | None)

  • gpu_pcie_replay_count (int | None)

  • gpu_nvlink_aggregate_throughput_bytes_per_second (float | None)

  • gpu_board_power_draw_watts (float | None)

  • gpu_board_total_energy_joules (float | None)

  • gpu_board_enforced_power_limit_watts (float | None)

  • gpu_board_default_power_limit_watts (float | None)

  • gpu_die_temperature_celsius (float | None)

  • gpu_memory_die_temperature_celsius (float | None)

  • gpu_streaming_multiprocessor_clock_frequency_megahertz (float | None)

  • gpu_memory_clock_frequency_megahertz (float | None)

  • gpu_framebuffer_used_mebibytes (float | None)

  • gpu_framebuffer_free_mebibytes (float | None)

  • gpu_framebuffer_reserved_mebibytes (float | None)

  • gpu_framebuffer_total_mebibytes (float | None)

  • gpu_nvml_busy_time_fraction (float | None)

  • gpu_xid_error_count (int | None)

  • gpu_uncorrectable_remapped_row_count (int | None)

  • gpu_correctable_remapped_row_count (int | None)

  • gpu_row_remap_failure_flag (bool | None)

Return type:

None

kempnerpulse.translate.schema.canonical_field_names()[source]

Every CanonicalRecord field name, in declaration order.

Return type:

tuple