kempnerforge.profiling¶

Performance profiling for KempnerForge.

class kempnerforge.profiling.CUDATimer[source]¶

Bases: object

CUDA event-based timer for accurate GPU timing.

Uses CUDA events to measure elapsed time without CPU synchronization overhead (synchronizes only when reading the result).

__init__()[source]¶

Return type:: None

start()[source]¶

Record the start event on the current CUDA stream.

Return type:: None

stop()[source]¶

Record the end event on the current CUDA stream.

Return type:: None

elapsed_ms()[source]¶

Get elapsed time in milliseconds (synchronizes CUDA).

Return type:: float

class kempnerforge.profiling.CUDATimerCollection[source]¶

Bases: object

Collection of named CUDA timers for profiling multiple regions.

Manages timers for distinct training phases (forward, backward, comm, etc.) and reports all elapsed times as a dictionary.

When enabled=False, all operations are no-ops with zero overhead — start/stop calls return immediately without recording CUDA events.

Parameters:

regions – List of region names to track.
enabled – Whether timing is active. When False, all calls are no-ops.

__init__(regions, enabled=True)[source]¶

Parameters:

regions (list[str])
enabled (bool)

Return type:

None

property enabled: bool¶

start(region)[source]¶

Start timing a named region.

Parameters:: region (str)
Return type:: None

stop(region)[source]¶

Stop timing a named region.

Parameters:: region (str)
Return type:: None

elapsed_ms(region)[source]¶

Get elapsed time for a specific region in milliseconds.

Parameters:: region (str)
Return type:: float

elapsed_all()[source]¶

Get elapsed times for all regions in milliseconds.

Returns a dict mapping region name → elapsed_ms. Regions that were never started/stopped return 0.0.

Return type:: dict[str, float]

kempnerforge.profiling.build_profiler(config, rank=0)[source]¶

Build a torch.profiler instance from config.

Returns None if profiling is disabled.

Parameters:

config (ProfilingConfig) – Profiling configuration.
rank (int) – Current rank (for output directory naming).

Returns:

A torch.profiler.profile context manager, or None.

Return type:

torch.profiler.profile | None

kempnerforge.profiling.print_profiler_summary(prof, trace_dir=None)[source]¶

Print kernel-level GPU profiling summary and optionally save to file.

Prints top CUDA kernels by time and FLOPS, an aggregate GPU time breakdown (matmul, communication, memory, other), and achieved TFLOPS vs hardware peak.

If trace_dir is provided, writes a summary.md file alongside the traces.

Parameters:

prof (torch.profiler.profile) – A completed torch.profiler.profile instance.
trace_dir (str | None) – Optional directory to save summary.md report.

Return type:

None

Modules

`cuda_timer`	CUDA event-based timing utilities.
`profiler`	torch.profiler integration for KempnerForge.