sft

Getting Started

Set up the block and run your first training job

This page walks through preparing the sft block and running a training job end to end. sft wraps a pinned LLaMA-Factory checkout and the swe_data_process converter package, so most setup is about getting those into a single uv environment.

Prerequisites

  • 8× A100/H100 GPUs — full fine-tuning with DeepSpeed ZeRO-3 targets a single 8-GPU node.
  • uv — used to build and run the training environment.
  • CUDA 12.8 toolchain available on the node (the installer pulls CUDA 12.8 PyTorch wheels).
  • Raw trajectories from trajgen — a Harbor job_dir of agent rollouts (see Core Concepts).
  • A base model — a local path to the model to fine-tune (e.g. Qwen3-8B).

Where to run

sft trains on the node declared in config.yaml (meta_info.resources.ip). Run its scripts inside a tmux session on that host so a long training job survives shell disconnects.

1. Build the environment

Create the uv training environment at meta_info.environment.sft_uv (default artifacts/env/lf):

bash scripts/install_env.sh

This installs repos/swe_data_process[llm], the CUDA 12.8 PyTorch 2.8.0 wheels, repos/LLaMA-Factory[torch,metrics,deepspeed,liger-kernel] (with --no-build-isolation), the pinned flash-attn 2.8.3 wheel, and wandb. See Inputs & Outputs for the env layout.

2. Point the config at your inputs

Edit config.yamlruntime_info.input:

  • source.type — the dataset source: harbor_job (convert Harbor trajectories, the default), hf_lf (a ready-made LF dataset on the HuggingFace Hub), or local_lf (a local LF json). See Input sources.
  • For harbor_job: source.scaffold + source.job_dir — the trajectory scaffold and the Harbor job directory to convert. For hf_lf: source.hf_hub_url. For local_lf: source.lf_path.
  • conversion.data_name — the dataset name (the LF file and registered dataset key derive from it).
  • model.model_name_or_path — the local base-model path.
  • training.* — hyperparameters (template, cutoff length, batch size, learning rate, epochs, …).
  • infrastructure.n_gpus_per_node — the GPU count on this node.

See Configuration for what each field controls.

3. Validate the config

Run the dry-run preflight. It checks the pinned repos, the uv environment, the converter modules, the source job dir, the model path, the output directory, and the GPU count — without side effects:

bash scripts/dryrun.sh

Fix anything it reports before launching a run.

4. (Optional) Prepare data only

To convert and inspect the dataset before committing to a training run, run the data-only pipeline:

bash scripts/dataprep.sh

This runs STEP 0 (conversion) and writes the LF dataset under artifacts/data/lf_data/ without registering it or training. See Data Pipeline.

5. Launch a run

scripts/start.sh runs the dry run, then the full pipeline (conversion → dataset registration → training), and archives the run on exit:

bash scripts/start.sh

To run the pipeline directly without the archive wrapper, use bash scripts/train.sh. See Training for what each step does.

6. Inspect results

Training writes to:

artifacts/model/<run>/      # checkpoints, trainer_log.jsonl, trainer_state.json, *_results.json
artifacts/logs/<run>_<ts>.log   # console log

After a successful run, config.yamlruntime_info.output is updated with the checkpoint path, metrics, and artifact paths. For a live view, open the dashboard.

Operating with the agent plugin

If you operate the block through its Claude plugin, the same lifecycle maps to slash commands:

/root:check sft     # preflight: config, repos, env, source data, GPUs
/sft:setup           # build the uv environment and validate the config
/root:run sft       # execute scripts/start.sh and archive
/sft:run             # sft-specific run procedure + post-run bookkeeping
/sft:dashboard       # launch / summarize the training dashboard

On this page