kempnerforge.profiling¶
Performance profiling for KempnerForge.
- class kempnerforge.profiling.CUDATimer[source]¶
Bases:
objectCUDA event-based timer for accurate GPU timing.
Uses CUDA events to measure elapsed time without CPU synchronization overhead (synchronizes only when reading the result).
- class kempnerforge.profiling.CUDATimerCollection[source]¶
Bases:
objectCollection of named CUDA timers for profiling multiple regions.
Manages timers for distinct training phases (forward, backward, comm, etc.) and reports all elapsed times as a dictionary.
When
enabled=False, all operations are no-ops with zero overhead — start/stop calls return immediately without recording CUDA events.- Parameters:
regions – List of region names to track.
enabled – Whether timing is active. When False, all calls are no-ops.
- kempnerforge.profiling.build_profiler(config, rank=0)[source]¶
Build a torch.profiler instance from config.
Returns None if profiling is disabled.
- Parameters:
config (ProfilingConfig) – Profiling configuration.
rank (int) – Current rank (for output directory naming).
- Returns:
A torch.profiler.profile context manager, or None.
- Return type:
torch.profiler.profile | None
- kempnerforge.profiling.print_profiler_summary(prof, trace_dir=None)[source]¶
Print kernel-level GPU profiling summary and optionally save to file.
Prints top CUDA kernels by time and FLOPS, an aggregate GPU time breakdown (matmul, communication, memory, other), and achieved TFLOPS vs hardware peak.
If trace_dir is provided, writes a summary.md file alongside the traces.
- Parameters:
prof (torch.profiler.profile) – A completed torch.profiler.profile instance.
trace_dir (str | None) – Optional directory to save summary.md report.
- Return type:
None
Modules
CUDA event-based timing utilities. |
|
torch.profiler integration for KempnerForge. |