sft

Data Pipeline

Scaffolds

Supported agent scaffolds and their converters

The agent scaffold that produced a trajectory determines its raw shape, which in turn selects the swe_data_process converter module used to turn it into IM data. Set the active scaffold in config.yamlruntime_info.input.source.scaffold.

Converter matrix

ScaffoldConverter module
claude-codeswe_data_process.claudecode_opencode.convert_cc_to_im
open-codeswe_data_process.claudecode_opencode.convert_oc_to_im
openhands-sdkswe_data_process.openhands.convert_openhands_sdk_to_im
terminus2swe_data_process.terminus2.convert_terminus2_to_im

Only these job-dir converters are wired. The refactored swe_data_process package removed the older source-specific converter scripts, so a scaffold outside this matrix has no conversion path.

IM format

Every converter emits the same intermediate (IM) shape — one JSONL row per trajectory:

{
  "version": "2.0.0",
  "meta_info": {
    "unique_info": {
      "_instance_id": "owner__repo-123",
      "_agent_type": "main",
      "_score": {"composite_score": 0.72, "...": "..."}
    }
  },
  "tools": ["..."],
  "messages": [
    {
      "role": "user/assistant/tool",
      "content": "...",
      "reasoning_content": "...",
      "tool_calls": ["..."]
    }
  ]
}

_instance_id, _agent_type, and _score are stored under meta_info.unique_info on disk. The package's load_jsonl() expands them back to the legacy top-level shape for internal scoring and filtering helpers.

Reasoning content and templates

When a trajectory carries reasoning_content (chain-of-thought), train with the qwen3 chat template to keep it. The default qwen3_nothink template disables thinking mode and drops it. See Configuration.

On this page