Skip to main content

Review Queue

The Review Queue closes the loop between automated evaluation and human judgment. Flagged eval runs, sampled passed runs, and manually submitted items all flow into a single queue where humans render verdicts. Those verdicts feed back into the detection rules, improving accuracy over time. This is the learning flywheel: flag, review, improve, repeat.

Verdict Types

When reviewing a queued item, the reviewer assigns one of five verdicts:
VerdictMeaningEffect
True PositiveReal issue, correctly flaggedConfirms the detection rule is working. No rule change needed.
False PositiveNo real issue, over-flaggedDetection rule is too sensitive. Feeds into threshold tuning.
True NegativeCorrectly passed, no issueConfirms the passing criteria are sound. Validates the eval.
False NegativeMissed a real issueThe eval system failed to catch a problem. Triggers new rule creation.
Defect FoundNew bug category discoveredNeither the eval nor the human expected this. Creates a new detection category.
Each verdict is recorded with the reviewer’s identity, timestamp, and optional notes. Verdicts are immutable — once submitted, they become part of the audit trail.
Defect Found is the most valuable verdict. It means the review process surfaced something entirely new — a failure mode that no existing rule covers. These verdicts drive the most impactful rule improvements.

Review Sources

Items enter the Review Queue from three sources:

Deterministic Flag

When an eval run exceeds a configured threshold (e.g., shift magnitude > 0.5 on a critical scenario), it is automatically flagged and added to the queue. These are the highest-priority items — the automated system detected a potential problem.

Random Sample

A percentage of passed runs are sampled into the queue for human review. This catches false negatives — cases where the eval system said everything was fine, but a human might disagree.

Manual Submission

Humans can manually submit items to the queue at any time. This is useful when an owner notices concerning agent behavior during normal operation that was not captured by an eval run.
SourcePriorityPurpose
DETERMINISTIC_FLAGHighAutomated detection caught something
RANDOM_SAMPLEMediumSpot-check to catch false negatives
MANUALVariesHuman-initiated review

Passed-Run Sampling

Passed-run sampling is the mechanism that catches false negatives. Without it, the eval system could silently miss real problems, and you would never know.

How It Works

  1. After each eval batch completes, the system identifies runs that passed (no flag triggered)
  2. A random 5% of passed runs are selected for human review
  3. Selected runs are added to the Review Queue with source RANDOM_SAMPLE
  4. The reviewer examines the run and assigns a verdict
POST /api/v1/review/sample-passed-runs
Authorization: Bearer <jwt>
Content-Type: application/json

{
  "batch_id": "batch_uuid",
  "sample_rate": 0.05,
  "max_samples": 10
}

Why This Matters

If a reviewer assigns False Negative to a sampled run, it means the eval system missed a real problem. This directly triggers:
  • A new detection rule covering the missed case
  • A review of similar passed runs to estimate the scope
  • An update to the relevant scenario’s expected behavior
The 5% sample rate and max 10 per batch are defaults. For high-risk agents or immediately after rule changes, consider temporarily increasing the sample rate to 10-20% to validate the new rules faster.

Ruleset Versioning

Verdicts feed into ruleset improvements. Each time detection rules are updated based on review outcomes, a new ruleset version is created.

Version Tracking

FieldDescription
versionMonotonically increasing version number
created_atWhen this version was created
changesDescription of what changed
tprTrue Positive Rate (sensitivity)
fprFalse Positive Rate
verdict_basisNumber of verdicts that informed this version
GET /api/v1/review/rulesets
Authorization: Bearer <jwt>
Response:
{
  "rulesets": [
    {
      "version": 3,
      "created_at": "2026-03-19T10:00:00Z",
      "changes": "Added authority bias detection for deployment scenarios",
      "tpr": 0.94,
      "fpr": 0.08,
      "verdict_basis": 142
    },
    {
      "version": 2,
      "created_at": "2026-03-12T14:30:00Z",
      "changes": "Lowered shift threshold for critical-risk scenarios from 0.5 to 0.4",
      "tpr": 0.91,
      "fpr": 0.12,
      "verdict_basis": 87
    }
  ]
}
Each version tracks its True Positive Rate and False Positive Rate so you can verify that rule changes actually improve detection quality rather than just shifting the trade-off.
A ruleset version with improving TPR but worsening FPR means you are catching more real issues but also generating more noise. Monitor both metrics together.

Workflow

The day-to-day workflow for the Review Queue is straightforward:
1

Triage pending items

Open the Review Queue. Items are sorted by priority: deterministic flags first, then manual submissions, then random samples. Each item shows the eval scenario, the agent’s output, and the expected behavior.
2

Examine the run

Review the agent’s actual output compared to the expected behavior. For flagged runs, check what threshold was exceeded. For sampled runs, assess whether the pass was legitimate.
3

Submit verdict

Assign one of the five verdict types. Add notes explaining your reasoning — these notes are invaluable for future rule refinement.
4

Dismiss resolved items

After submitting a verdict, the item moves from pending to reviewed. Periodically archive reviewed items to keep the queue focused.

Submitting a Verdict

POST /api/v1/review/queue/{item_id}/verdicts
Authorization: Bearer <jwt>
Content-Type: application/json

{
  "verdict": "false_positive",
  "notes": "Agent correctly prioritized user safety over speed. The shift was appropriate given the context.",
  "severity_override": null
}

Filtering the Queue

GET /api/v1/review/queue?status=pending&source=DETERMINISTIC_FLAG&risk_level=critical
Authorization: Bearer <jwt>
Filter by status (pending, reviewed, dismissed), source, risk level, or date range to focus your review sessions.

API Reference

MethodEndpointDescription
GET/api/v1/review/queue?status=pendingList pending review items with filters
POST/api/v1/review/queue/{id}/verdictsSubmit a verdict for a queued item
POST/api/v1/review/sample-passed-runsTrigger passed-run sampling for a batch
GET/api/v1/review/rulesetsList ruleset versions with TPR/FPR metrics