kempnerforge.model.hooks¶

Activation extraction hooks for mechanistic interpretability.

Provides tools for capturing intermediate activations, attention patterns, and hidden states during inference — essential for probing, CKA analysis, SVCCA, and other interpretability research.

Usage:

store = ActivationStore(model, layers=["layers.0.attention", "layers.5.mlp"])
store.enable()
model(input_ids)
act = store.get("layers.0.attention")  # (batch, seq, dim) on CPU
store.disable()

Functions

`extract_representations`(model, dataset, ...)	Run model over dataset and collect activations from specified layers.
`save_activations`(activations, path)	Save activations to a `.npz` file.

Classes

ActivationStore

class kempnerforge.model.hooks.ActivationStore[source]¶

Bases: object

Captured tensors are moved to CPU to avoid GPU memory pressure. Use enable() / disable() to control when hooks are active.

Parameters:

model – The model to instrument.
layers – List of module names (dot-separated FQNs) to capture. Example: ["layers.0.attention", "layers.5.mlp", "norm"]

__init__(model, layers=None)[source]¶

Parameters:

model (torch.nn.Module)
layers (list[str] | None)

Return type:

None

property enabled: bool¶

property layer_names: list[str]¶

property activations: dict[str, torch.Tensor]¶: Return a copy of captured activations.

enable()[source]¶

Return type:: None

disable()[source]¶

Remove all hooks and mark as disabled.

Return type:: None

clear()[source]¶

Clear captured activations (keeps hooks registered).

Return type:: None

get(name)[source]¶

Get captured activation for a layer, or None if not captured.

Parameters:: name (str)
Return type:: torch.Tensor | None

kempnerforge.model.hooks.extract_representations(model, dataset, layers, device, batch_size=32, max_samples=None)[source]¶

Run model over dataset and collect activations from specified layers.

Returns a dict mapping layer names to tensors of shape (num_samples, seq_len, hidden_dim) (or whatever the layer outputs).

Parameters:

model (torch.nn.Module) – Model to extract from (should already be on device).
dataset (torch.utils.data.Dataset) – Map-style dataset returning dicts with "input_ids".
layers (list[str]) – Module FQNs to capture (e.g. ["layers.0.attention"]).
device (torch.device) – Device to run inference on.
batch_size (int) – Batch size for extraction.
max_samples (int | None) – Stop after this many samples (None = full dataset).

Returns:

Dict of {layer_name: Tensor} with activations on CPU.

Return type:

dict[str, torch.Tensor]

kempnerforge.model.hooks.save_activations(activations, path)[source]¶

Save activations to a .npz file.

Parameters:

activations (dict[str, torch.Tensor]) – Dict mapping layer names to tensors (from ActivationStore or extract_representations()).
path (str | Path) – Output file path. .npz extension added if missing.

Return type:

None