Running Training
Launch the end-to-end SFT pipeline
Training is driven entirely by config.yaml. The pipeline has three steps, run end to end by scripts/train.sh (or wrapped with archiving by scripts/start.sh).
The pipeline
STEP 0 Data conversion raw trajectories → IM JSONL → LF JSON
STEP 1 Dataset registration LF JSON → dataset_info.json
STEP 2 Train generate YAML → python -m llamafactory.cli train- STEP 0 converts the source
job_dirfor the configured scaffold into LF data underartifacts/data/lf_data/. See Data Pipeline. - STEP 1 registers the LF file in LLaMA-Factory's
dataset_info.jsonunder the dataset key derived fromconversion.data_name. - STEP 2 generates the LLaMA-Factory training YAML from
runtime_info.input.trainingand launches it through torchrun.
Launch
The recommended entry point runs the dry run first, then the pipeline, and archives the run on exit:
bash scripts/start.shTo run the pipeline without the archive wrapper:
bash scripts/train.shBoth read all runtime config from config.yaml → runtime_info.input and assume you are already on a node with the GPUs available.
Run inside tmux
Training runs for a long time. Start it inside a tmux session on the GPU node so it survives shell disconnects, and watch progress through the dashboard.
How it launches
scripts/train.sh launches LLaMA-Factory through its torchrun path on the current node, with NPROC_PER_NODE wired from infrastructure.n_gpus_per_node. FORCE_TORCHRUN=1 is set so distributed training is enabled even for a single process group. The block currently targets single-node launches; keep n_gpus_per_node aligned with the actual visible GPU count.
Parallelism uses DeepSpeed ZeRO-3 (artifacts/training_config/deepspeed/ds_z3_config.json) — required for 30B+ models and safe for smaller ones, so it is the default for all 8-GPU runs.
Generated, not checked in
The executed LLaMA-Factory YAML is generated at runtime from config.yaml, not stored as a checked-in file. This keeps the block configuration and the executed training config aligned and prevents a stale YAML from drifting away from the recorded inputs. After a successful run, config.yaml → runtime_info.output is rewritten in place with the results — see Results & Artifacts.
Data-only and cleanup
scripts/dataprep.sh— run STEP 0 only, to prepare and inspect data without training.scripts/clean.sh— remove script-local__pycache__. Artifact deletion requires--artifacts --yes; repo-cache cleanup requires--repo-cache --yes.