Dashboard
Monitor training runs in real time
sft ships a real-time, read-only web dashboard for monitoring LLaMA-Factory (HuggingFace Trainer) training runs. It is a React frontend + a Python server.py backend under dashboard/, parses the standard training output directly, and can compare multiple runs or connect to wandb.
Live dashboard: cement-here-cross-quotations.trycloudflare.com
The live URL is a quick tunnel
Unlike a Cloudflare Pages site, this link is a Cloudflare quick tunnel in front of a locally running start_dashboard.sh. The URL changes every time the tunnel restarts and only works while that process is up — re-point this link (and the nav "dashboard" link in docs/src/lib/layout.shared.tsx) whenever the tunnel is restarted. For a stable address, use a named tunnel + Cloudflare Access.
Monitoring only
This dashboard monitors training — it never launches or controls it. To launch training, use the pipeline scripts. To drive training interactively, use the built-in Gradio LLaMA Board (llamafactory-cli webui), which is a separate tool.
What it monitors
For each run it reads, from artifacts/model/<run>/:
| File | Used for |
|---|---|
trainer_log.jsonl | live per-step loss, lr, grad norm, timing (primary source) |
trainer_state.json | eval points and extra metric keys from log_history |
*_results.json | final summaries (loss, runtime, samples/s, FLOPs) |
and the raw console logs from artifacts/logs/.
Run it
From subblock/sft/:
cd dashboard
./start_dashboard.sh # builds the frontend (first run) then serves :8091Open http://localhost:8091. By default it reads ../artifacts/model (runs) and ../artifacts/logs (console logs) — the same locations the training scripts write to.
How a "run" is detected
Any immediate sub-directory of the save dir containing trainer_log.jsonl or trainer_state.json is treated as a run. Its state is running (jsonl updated < 3 min ago and percentage < 100), finished (all_results.json present or percentage ≈ 100), or unknown.
Panels
- Overview — step, epoch, progress, train/eval loss, lr, grad norm, step time, ETA
- Training — loss, learning-rate schedule, gradient norm, epoch progress
- Evaluation —
eval_lossover steps, best checkpoint (empty if no eval split) - Performance — samples/s, steps/s, runtime, total FLOPs + per-step timing
- Compare — overlay metrics from multiple runs on shared charts (PNG/CSV export)
- AI Analysis — LLM-generated diagnostic report (configure a profile in Settings)
- Logs — raw
train_*.logconsole viewer - Explorer — plot any available metric key
- Settings — LLM API profiles (stored in browser localStorage only)
The UI supports a Chinese/English toggle and dark/light themes.
Expose externally (Cloudflare quick tunnel)
To reach the dashboard from outside the network, forward the local port through a Cloudflare quick tunnel — no DNS, account, or login required:
TUNNEL=true ./start_dashboard.shThe script auto-discovers cloudflared from PATH and prints a temporary https://<random>.trycloudflare.com URL once the tunnel is up.
Quick tunnels are unauthenticated
Anyone with the URL can read the dashboard (logs, metrics, run names). Treat the URL as a secret and stop the tunnel when done. For persistent, access-controlled exposure, use a named tunnel + Cloudflare Access instead.
wandb integration (optional)
Point the server at a wandb project to pull remote run metrics alongside the local files:
export WANDB_API_KEY=... WANDB_ENTITY=... WANDB_PROJECT=llama-factory
python server.py --port 8091 --save-dir ../artifacts/model \
--wandb-entity "$WANDB_ENTITY" --wandb-project "$WANDB_PROJECT" --static-dir distSee dashboard/README.md for the full CLI options and environment variables.
Two different sites
This dashboard is a live local server that reads training output. The documentation site you are reading is a separate static fumadocs site under docs/; see Build & deploy the docs site.