kempnerforge.config.data¶
Data pipeline configuration.
Classes
Data pipeline settings. |
|
A single data source in a multi-dataset mixture. |
|
A training phase with custom dataset weights and LR scaling. |
- class kempnerforge.config.data.DatasetSource[source]¶
Bases:
objectA single data source in a multi-dataset mixture.
Either
path(pre-tokenized) orhf_name(HuggingFace) must be set.weightcontrols the relative sampling probability.
- class kempnerforge.config.data.TrainingPhase[source]¶
Bases:
objectA training phase with custom dataset weights and LR scaling.
Used for data annealing: at
start_step, the mixture sampler switches todataset_weightsand the learning rate is multiplied bylr_scale.
- class kempnerforge.config.data.DataConfig[source]¶
Bases:
objectData pipeline settings.
- datasets: list[DatasetSource]¶
- phases: list[TrainingPhase]¶
- __init__(dataset_path='', file_pattern='*.npy', tokenizer_path='', num_workers=4, pin_memory=True, prefetch_factor=2, hf_dataset_name=None, hf_dataset_config=None, hf_dataset_split='train', hf_dataset_text_field='text', hf_streaming=False, pack_sequences=False, datasets=<factory>, mix_temperature=1.0, phases=<factory>, anneal_start_step=0, anneal_weights=<factory>)¶
- Parameters:
dataset_path (str)
file_pattern (str)
tokenizer_path (str)
num_workers (int)
pin_memory (bool)
prefetch_factor (int)
hf_dataset_name (str | None)
hf_dataset_config (str | None)
hf_dataset_split (str)
hf_dataset_text_field (str)
hf_streaming (bool)
pack_sequences (bool)
datasets (list[DatasetSource])
mix_temperature (float)
phases (list[TrainingPhase])
anneal_start_step (int)
- Return type:
None