DataSet
mappingRunnable Python (not “high level only”): docs/python/SFT_PYTHON_EXAMPLES.md
— full scripts for: in-repo examples/sft/sample_alpaca.ndjson
(Alpaca columns), optional
datasets.load_dataset("tatsu-lab/alpaca", …),
databricks/databricks-dolly-15k (§4 in
that doc), a chat messages column as JSON
text, then profile / validate /
export_dataset_jsonl. Those examples are
also indexed from examples/sft/README.md
and included in the pdoc HTML site via
rust_data_processing.examples (see
python-wrapper/rust_data_processing/examples.py).
This library stays format-agnostic: you model
columns with Schema
and export with export::dataset_to_jsonl
or Parquet/CSV ingest paths.
messages (JSON string or
separate role / content columns if
flattened).messages as a JSON array string or nested object per your
trainer (TRL / HF Datasets accept dict columns).instruction, input, output)system.TransformSpec
rename/cast/select to match trainer template.dataset_to_jsonl with an explicit
column_order for stable diffs.export::train_test_row_indices
(deterministic tail holdout) or your own shuffle policy.Chat templates and tokenizers are the trainer’s
responsibility. This repo exports raw UTF-8
fields only; align columns with your target model’s template
before calling Trainer /
SFTTrainer.