sft

Data Pipeline

Conversion Pipeline

How raw trajectories become LLaMA-Factory datasets

The data pipeline converts raw agent trajectories into a training-ready LLaMA-Factory dataset. It is STEP 0 of scripts/train.sh and the entirety of scripts/dataprep.sh.

Input sources

config.yamlruntime_info.input.source.type selects where the LF dataset comes from. Conversion (the flow below) only runs for harbor_job; the other two skip it and feed a ready-made LF/ShareGPT dataset straight into training.

source.typeInputConversionRequired fields
harbor_job (default)Raw Harbor trajectoriesRuns STEP 0source.scaffold, source.job_dir
hf_lfAn LF/ShareGPT dataset on the HuggingFace HubSkippedsource.hf_hub_url (+ optional hf_subset, hf_split)
local_lfAn existing local LF/ShareGPT jsonSkippedsource.lf_path

For hf_lf, the dataset is registered with LLaMA-Factory's native hf_hub_url and pulled from the Hub at train time; set credentials.hf_token for private datasets. For local_lf, the json at source.lf_path is registered as-is. In both cases conversion.data_name is still used as the registered dataset key, and conversion.max_instances (if > 0) becomes the dataset's num_samples (a random subsample).

Scoring and repo filtering only apply to harbor_job

The quality scoring and eval-repo exclusion happen during conversion. A hf_lf / local_lf dataset is consumed as-is, so make sure it is already clean (no SWE-bench benchmark repos) and pre-filtered upstream.

The flow

Harbor job trajectories (job_dir, per-scaffold)
    └─▶  python -m swe_data_process.<subpackage>.convert_*_to_im
              └─▶  Intermediate "IM" format (PangUML v2 JSONL, score in meta_info.unique_info)
                        └─▶  rule_score.py (auto-invoked) + optional llm_score.py
                                  └─▶  LLaMA-Factory "LF" format (ShareGPT JSON)
                                            └─▶  artifacts/data/lf_data/<dataset>.json

Conversion is invoked as python -m swe_data_process.<subpackage>... with PYTHONPATH=repos/swe_data_process/src, using the uv environment at meta_info.environment.sft_uv. The converter is selected from source.scaffold — see Scaffolds.

Run it standalone

scripts/dataprep.sh runs conversion only — no dataset registration, no training — so you can inspect and validate the data before committing GPUs to a run:

bash scripts/dataprep.sh

It reads runtime_info.input.source and runtime_info.input.conversion from config.yaml and writes the IM JSONL and LF JSON under artifacts/data/. For source.type of hf_lf or local_lf there is nothing to convert, so dataprep.sh exits early — run scripts/train.sh directly to register the dataset and train.

Conversion settings

From config.yamlruntime_info.input.conversion:

FieldMeaning
max_instancesCap on the number of converted training instances
exclude_repos_fileRepo exclusion list (default artifacts/data/excluded_repos.txt)
data_nameBase name for the LF file and the registered dataset key

Output layout

artifacts/data/
├── excluded_repos.txt          # eval-repo filter (owner/repo per line)
└── lf_data/
    ├── <dataset>.json          # LF / ShareGPT training data
    └── dataset_info.json       # LLaMA-Factory dataset registration

Both the LF file and dataset_info.json live in artifacts/data/lf_data/; the generated training YAML sets dataset_dir to point there. See Dataset registration for how the data is wired into training.

Partial outputs

If a converter is interrupted, a partial IM-only or LF-only file may remain. Delete the partial file (or restore the missing pair) before rerunning, or the converter may skip work it thinks is already done.

On this page