sft

Dashboard

Monitor training runs in real time

sft ships a real-time, read-only web dashboard for monitoring LLaMA-Factory (HuggingFace Trainer) training runs. It is a React frontend + a Python server.py backend under dashboard/, parses the standard training output directly, and can compare multiple runs or connect to wandb.

Live dashboard: cement-here-cross-quotations.trycloudflare.com

The live URL is a quick tunnel

Unlike a Cloudflare Pages site, this link is a Cloudflare quick tunnel in front of a locally running start_dashboard.sh. The URL changes every time the tunnel restarts and only works while that process is up — re-point this link (and the nav "dashboard" link in docs/src/lib/layout.shared.tsx) whenever the tunnel is restarted. For a stable address, use a named tunnel + Cloudflare Access.

Monitoring only

This dashboard monitors training — it never launches or controls it. To launch training, use the pipeline scripts. To drive training interactively, use the built-in Gradio LLaMA Board (llamafactory-cli webui), which is a separate tool.

What it monitors

For each run it reads, from artifacts/model/<run>/:

FileUsed for
trainer_log.jsonllive per-step loss, lr, grad norm, timing (primary source)
trainer_state.jsoneval points and extra metric keys from log_history
*_results.jsonfinal summaries (loss, runtime, samples/s, FLOPs)

and the raw console logs from artifacts/logs/.

Run it

From subblock/sft/:

cd dashboard
./start_dashboard.sh          # builds the frontend (first run) then serves :8091

Open http://localhost:8091. By default it reads ../artifacts/model (runs) and ../artifacts/logs (console logs) — the same locations the training scripts write to.

How a "run" is detected

Any immediate sub-directory of the save dir containing trainer_log.jsonl or trainer_state.json is treated as a run. Its state is running (jsonl updated < 3 min ago and percentage < 100), finished (all_results.json present or percentage ≈ 100), or unknown.

Panels

  • Overview — step, epoch, progress, train/eval loss, lr, grad norm, step time, ETA
  • Training — loss, learning-rate schedule, gradient norm, epoch progress
  • Evaluationeval_loss over steps, best checkpoint (empty if no eval split)
  • Performance — samples/s, steps/s, runtime, total FLOPs + per-step timing
  • Compare — overlay metrics from multiple runs on shared charts (PNG/CSV export)
  • AI Analysis — LLM-generated diagnostic report (configure a profile in Settings)
  • Logs — raw train_*.log console viewer
  • Explorer — plot any available metric key
  • Settings — LLM API profiles (stored in browser localStorage only)

The UI supports a Chinese/English toggle and dark/light themes.

Expose externally (Cloudflare quick tunnel)

To reach the dashboard from outside the network, forward the local port through a Cloudflare quick tunnel — no DNS, account, or login required:

TUNNEL=true ./start_dashboard.sh

The script auto-discovers cloudflared from PATH and prints a temporary https://<random>.trycloudflare.com URL once the tunnel is up.

Quick tunnels are unauthenticated

Anyone with the URL can read the dashboard (logs, metrics, run names). Treat the URL as a secret and stop the tunnel when done. For persistent, access-controlled exposure, use a named tunnel + Cloudflare Access instead.

wandb integration (optional)

Point the server at a wandb project to pull remote run metrics alongside the local files:

export WANDB_API_KEY=...  WANDB_ENTITY=...  WANDB_PROJECT=llama-factory
python server.py --port 8091 --save-dir ../artifacts/model \
  --wandb-entity "$WANDB_ENTITY" --wandb-project "$WANDB_PROJECT" --static-dir dist

See dashboard/README.md for the full CLI options and environment variables.

Two different sites

This dashboard is a live local server that reads training output. The documentation site you are reading is a separate static fumadocs site under docs/; see Build & deploy the docs site.

On this page