Expand description
Deterministic JSON Lines export and simple train / test row splits (Phase 2).
This module does not implement tokenizers or model-specific chat templates. Callers align exported text with their trainer’s expected fields.
Functions§
- dataset_
to_ jsonl - Serialize each row as one JSON object per line (UTF-8), columns in
column_order. - filter_
rows_ max_ utf8_ chars - Keep only rows whose UTF-8 value in
columnhas at mostmax_charsUnicode scalars; other rows dropped. - train_
test_ row_ indices - Deterministic split: first
train_countrows are train, remaining rows are test, wheretrain_count = row_count - round(row_count * test_fraction.clamp(0..=1)).