Aegis: Closed-Loop Intelligence Engine
Ground behavior, improve it, and defend every ship decision with evidence.
Mode
Eval-first
Release
Gate-aware
Reports
Shareable
Access
Accounts enabled
Shell policy
The workspace chrome does not inject sample benchmark rows, synthetic scores, or decorative regression traces. Live evidence belongs in the closed loop, research runs, review queue, and release train after a real workspace is populated.
Closed Loop
Import traces, run the strict loop, and open the dossier.
Research Runs
Measure benchmark deltas and investigate candidate behavior.
Review Queue
Attach ownership, severity, and operator judgment.
Release Train
Persist gate state beside the same artifact lineage.
Launch-grade proof should be grounded in persisted artifacts, not shell placeholders.
surface
purpose
required
owner
dataset
fixed benchmark contract
yes
research
comparison
baseline vs candidate delta
yes
operator
review
annotated release judgment
yes
human
promotion
gate outcome + lineage
yes
release
Arena

Compare two eval runs and see what actually moved.

Arena should be a real operator tool, not just a leaderboard. Pick a baseline and candidate run, inspect the score deltas, and open the underlying reports when the comparison needs more evidence.

Loaded runs
0
Compared dimensions
0
Select runs
Run at least two evals before using Arena. The fastest path is to launch a recorded-output eval from the workspace homepage, then come back here to compare revisions.
Baseline
No run selected yet.
Candidate
No run selected yet.
Comparing
Loading run comparison...
Saved comparisons
Compare snapshots
0 snapshots
Save your first comparison and it will appear here as a reusable release-train artifact.