Arena
Compare two eval runs and see what actually moved.
Arena should be a real operator tool, not just a leaderboard. Pick a baseline and candidate run, inspect the score deltas, and open the underlying reports when the comparison needs more evidence.
Loaded runs
0
Compared dimensions
0
Select runs
Run at least two evals before using Arena. The fastest path is to launch a recorded-output eval from the workspace homepage, then come back here to compare revisions.
Baseline
No run selected yet.
Candidate
No run selected yet.
Comparing
Loading run comparison...
Saved comparisons
Compare snapshots
0 snapshots
Save your first comparison and it will appear here as a reusable release-train artifact.