Workspace

Run evals, explain what broke, and make every release decision defensible.

Aegis should feel like a research control room: benchmark the workflow, compare what changed, inspect the evidence behind weak behavior, and hand the team a result that survives release review.

Launch a research run Open datasets View release train

Recent runs

Recent research runs in this workspace.

Average score

0.0%

Current mean across the active research window.

Dimensions reviewed

Scored dimensions flowing into reports and reviews.

Blocked / watch

Runs that still need attention before they ship.

Operating loop

Where the workspace stands right now.

Stage 1

Start

Benchmark locked

Start with one saved suite so the eval loop has a stable benchmark.

Stage 2

Pending

Research underway

Launch the first run so the workspace has real evidence to inspect.

Stage 3

Pending

Human review active

Move from scoring to judgment with templates, notes, and ownership.

Stage 4

Pending

Release posture

Promote a baseline and run a release gate once the report is trustworthy.

Review queue Compare runs Inspect traces

Eval launchpad

Restoring your workspace controls...

Launch checklist

Loading workspace status...

Release posture

The shipping picture in one glance.

Approved

Reports ready to support a release decision.

Baselines

Runs currently promoted as saved release baselines.

Need attention

Candidate runs that need another human pass.

Blocked

Runs that should not ship in their current state.

Recent research runs

Latest evidence from the loop

Launch another eval

Pulling your latest research runs...

System map

Three motions, one workspace

Benchmark and compare

Use saved datasets, compare snapshots, and release gates to make model changes measurable instead of anecdotal.

Datasets Compare Runs Release Train

Review and defend

Keep the report tied to operator judgment with reusable templates, annotations, and explicit release decisions.

Review Queue Rubrics Release decisions

Inspect the runtime path

Open traces, memory, and intervention labs only after the eval tells you what actually failed and why.

Traces Memory Training Lab