Skip to main content

Module export

Module export 

Source
Expand description

Deterministic JSON Lines export and simple train / test row splits (Phase 2).

This module does not implement tokenizers or model-specific chat templates. Callers align exported text with their trainer’s expected fields.

Functions§

dataset_to_jsonl
Serialize each row as one JSON object per line (UTF-8), columns in column_order.
filter_rows_max_utf8_chars
Keep only rows whose UTF-8 value in column has at most max_chars Unicode scalars; other rows dropped.
train_test_row_indices
Deterministic split: first train_count rows are train, remaining rows are test, where train_count = row_count - round(row_count * test_fraction.clamp(0..=1)).