kempnerforge.model.hooks

Activation extraction hooks for mechanistic interpretability.

Provides tools for capturing intermediate activations, attention patterns, and hidden states during inference — essential for probing, CKA analysis, SVCCA, and other interpretability research.

Usage:

store = ActivationStore(model, layers=["layers.0.attention", "layers.5.mlp"])
store.enable()
model(input_ids)
act = store.get("layers.0.attention")  # (batch, seq, dim) on CPU
store.disable()

Functions

extract_representations(model, dataset, ...)

Run model over dataset and collect activations from specified layers.

save_activations(activations, path)

Save activations to a .npz file.

Classes

ActivationStore

Register forward hooks on named modules to capture activations.

class kempnerforge.model.hooks.ActivationStore[source]

Bases: object

Register forward hooks on named modules to capture activations.

Captured tensors are moved to CPU to avoid GPU memory pressure. Use enable() / disable() to control when hooks are active.

Parameters:
  • model – The model to instrument.

  • layers – List of module names (dot-separated FQNs) to capture. Example: ["layers.0.attention", "layers.5.mlp", "norm"]

__init__(model, layers=None)[source]
Parameters:
Return type:

None

property enabled: bool
property layer_names: list[str]
property activations: dict[str, torch.Tensor]

Return a copy of captured activations.

enable()[source]

Register forward hooks on all target layers.

Return type:

None

disable()[source]

Remove all hooks and mark as disabled.

Return type:

None

clear()[source]

Clear captured activations (keeps hooks registered).

Return type:

None

get(name)[source]

Get captured activation for a layer, or None if not captured.

Parameters:

name (str)

Return type:

torch.Tensor | None

kempnerforge.model.hooks.extract_representations(model, dataset, layers, device, batch_size=32, max_samples=None)[source]

Run model over dataset and collect activations from specified layers.

Returns a dict mapping layer names to tensors of shape (num_samples, seq_len, hidden_dim) (or whatever the layer outputs).

Parameters:
  • model (torch.nn.Module) – Model to extract from (should already be on device).

  • dataset (torch.utils.data.Dataset) – Map-style dataset returning dicts with "input_ids".

  • layers (list[str]) – Module FQNs to capture (e.g. ["layers.0.attention"]).

  • device (torch.device) – Device to run inference on.

  • batch_size (int) – Batch size for extraction.

  • max_samples (int | None) – Stop after this many samples (None = full dataset).

Returns:

Dict of {layer_name: Tensor} with activations on CPU.

Return type:

dict[str, torch.Tensor]

kempnerforge.model.hooks.save_activations(activations, path)[source]

Save activations to a .npz file.

Parameters:
Return type:

None