Supervised fine-tuning (SFT) — common file shapes and `DataSet` mapping

Supervised fine-tuning (SFT) — common file shapes and DataSet mapping

Runnable Python (not “high level only”): docs/python/SFT_PYTHON_EXAMPLES.md — full scripts for: in-repo examples/sft/sample_alpaca.ndjson (Alpaca columns), optional datasets.load_dataset("tatsu-lab/alpaca", …), databricks/databricks-dolly-15k (§4 in that doc), a chat messages column as JSON text, then profile / validate / export_dataset_jsonl. Those examples are also indexed from examples/sft/README.md and included in the pdoc HTML site via rust_data_processing.examples (see python-wrapper/rust_data_processing/examples.py).

This library stays format-agnostic: you model columns with Schema and export with export::dataset_to_jsonl or Parquet/CSV ingest paths.

Chat / messages (ShareGPT-style)

Alpaca (instruction, input, output)

JSONL export

Bold warning (trainers)

Chat templates and tokenizers are the trainer’s responsibility. This repo exports raw UTF-8 fields only; align columns with your target model’s template before calling Trainer / SFTTrainer.