kempnerforge.config.data

Data pipeline configuration.

Classes

DataConfig

Data pipeline settings.

DatasetSource

A single data source in a multi-dataset mixture.

TrainingPhase

A training phase with custom dataset weights and LR scaling.

class kempnerforge.config.data.DatasetSource[source]

Bases: object

A single data source in a multi-dataset mixture.

Either path (pre-tokenized) or hf_name (HuggingFace) must be set. weight controls the relative sampling probability.

path: str = ''
weight: float = 1.0
name: str = ''
hf_name: str = ''
hf_config: str = ''
__init__(path='', weight=1.0, name='', hf_name='', hf_config='')
Parameters:
Return type:

None

class kempnerforge.config.data.TrainingPhase[source]

Bases: object

A training phase with custom dataset weights and LR scaling.

Used for data annealing: at start_step, the mixture sampler switches to dataset_weights and the learning rate is multiplied by lr_scale.

start_step: int = 0
dataset_weights: dict[str, float]
lr_scale: float = 1.0
__init__(start_step=0, dataset_weights=<factory>, lr_scale=1.0)
Parameters:
Return type:

None

class kempnerforge.config.data.DataConfig[source]

Bases: object

Data pipeline settings.

dataset_path: str = ''
file_pattern: str = '*.npy'
tokenizer_path: str = ''
num_workers: int = 4
pin_memory: bool = True
prefetch_factor: int = 2
hf_dataset_name: str | None = None
hf_dataset_config: str | None = None
hf_dataset_split: str = 'train'
hf_dataset_text_field: str = 'text'
hf_streaming: bool = False
pack_sequences: bool = False
datasets: list[DatasetSource]
mix_temperature: float = 1.0
phases: list[TrainingPhase]
anneal_start_step: int = 0
anneal_weights: dict[str, float]
__init__(dataset_path='', file_pattern='*.npy', tokenizer_path='', num_workers=4, pin_memory=True, prefetch_factor=2, hf_dataset_name=None, hf_dataset_config=None, hf_dataset_split='train', hf_dataset_text_field='text', hf_streaming=False, pack_sequences=False, datasets=<factory>, mix_temperature=1.0, phases=<factory>, anneal_start_step=0, anneal_weights=<factory>)
Parameters:
Return type:

None