Running Training

Training is driven entirely by config.yaml. The pipeline has three steps, run end to end by scripts/train.sh (or wrapped with archiving by scripts/start.sh).

The pipeline

STEP 0  Data conversion        raw trajectories → IM JSONL → LF JSON
STEP 1  Dataset registration   LF JSON → dataset_info.json
STEP 2  Train                  generate YAML → python -m llamafactory.cli train

STEP 0 converts the source job_dir for the configured scaffold into LF data under artifacts/data/lf_data/. See Data Pipeline.
STEP 1 registers the LF file in LLaMA-Factory's dataset_info.json under the dataset key derived from conversion.data_name.
STEP 2 generates the LLaMA-Factory training YAML from runtime_info.input.training and launches it through torchrun.

Launch

The recommended entry point runs the dry run first, then the pipeline, and archives the run on exit:

bash scripts/start.sh

To run the pipeline without the archive wrapper:

bash scripts/train.sh

Both read all runtime config from config.yaml → runtime_info.input and assume you are already on a node with the GPUs available.

Run inside tmux

Training runs for a long time. Start it inside a tmux session on the GPU node so it survives shell disconnects, and watch progress through the dashboard.

scripts/train.sh launches LLaMA-Factory through its torchrun path on the current node, with NPROC_PER_NODE wired from infrastructure.n_gpus_per_node. FORCE_TORCHRUN=1 is set so distributed training is enabled even for a single process group. The block currently targets single-node launches; keep n_gpus_per_node aligned with the actual visible GPU count.

Parallelism uses DeepSpeed ZeRO-3 (artifacts/training_config/deepspeed/ds_z3_config.json) — required for 30B+ models and safe for smaller ones, so it is the default for all 8-GPU runs.

Generated, not checked in

The executed LLaMA-Factory YAML is generated at runtime from config.yaml, not stored as a checked-in file. This keeps the block configuration and the executed training config aligned and prevents a stale YAML from drifting away from the recorded inputs. After a successful run, config.yaml → runtime_info.output is rewritten in place with the results — see Results & Artifacts.

Data-only and cleanup

scripts/dataprep.sh — run STEP 0 only, to prepare and inspect data without training.
scripts/clean.sh — remove script-local __pycache__. Artifact deletion requires --artifacts --yes; repo-cache cleanup requires --repo-cache --yes.

Running Training

The pipeline

Launch

How it launches

Generated, not checked in

Data-only and cleanup

On this page