Core Concepts

sft has the following core concepts:

Trajectory

A trajectory is the complete, replayable log of one agent rollout — the raw material sft trains on. trajgen produces them per task under a Harbor job_dir:

artifacts/jobs/<job>/<task>/agent/litellm-trajectory.jsonl

sft does not author trajectories — it consumes a job_dir of them, wired from the upstream trajgen block via meta_info.dependencies.

A scaffold is the agent harness that produced a trajectory — Claude Code, OpenCode, OpenHands SDK, or Terminus-2. The scaffold determines the raw trajectory shape, which selects the swe_data_process converter used to turn it into training data. See Scaffolds.

IM format (intermediate)

IM is the intermediate representation: OpenAI-style messages with tool_calls, one JSONL row per trajectory, in PangUML v2 shape. Per-instance metadata (_instance_id, _agent_type, _score) is stored under meta_info.unique_info. Conversion to IM auto-invokes rule scoring, so every IM row carries a quality score. See Scoring.

LF format (LLaMA-Factory)

LF is the training-ready representation: ShareGPT-style messages as a JSON array, written to artifacts/data/lf_data/<dataset>.json. This is what LLaMA-Factory loads. The LF file and its dataset_info.json mapping live together so the generated training YAML can point dataset_dir at them.

Dataset registration

Before training, the LF file is registered in LLaMA-Factory's dataset_info.json under a dataset name (derived from conversion.data_name). Registration is what lets the training YAML reference the data by key. This is STEP 1 of scripts/train.sh.

Run

A run is one execution of the pipeline driven by a single config.yaml: convert → register → train. Each run writes its checkpoints and logs to artifacts/model/<run>/, where <run> is training.output_dir (a relative value resolves to artifacts/model/<basename>). Run history is appended to artifacts/index.yaml by scripts/archive_run.sh.

Checkpoint

A checkpoint is a saved model state under artifacts/model/<run>/checkpoint-<step>/, written every save_steps. The latest checkpoint path is recorded in config.yaml → runtime_info.output.checkpoint_path and is the block's primary handoff to downstream consumers (rl, eval).

Repo exclusion filter

artifacts/data/excluded_repos.txt lists owner/repo entries from the SWE-bench benchmark sets (Verified, Pro, Multilingual). All converters filter these out by default so eval repos never leak into training data. Regenerate it with scripts/generate_excluded_repos.py; disable per-experiment with --exclude-repos-file "".

Core Concepts

On this page