AI-AGENT RELIABILITY · LIVE SYSTEM

Fromfailedrun
tohuman-approved
fix.

AgentOps Autopilot traces every agent run, scores it, diagnoses failures with cited evidence, then turns them into a human-approved fix and a regression test. Sentry + Datadog + GitLab CI — for AI agents.

Enter the live system

Open the console

31→92

groundedness

8/12

pattern hits

setup steps

Built on Gemini · Vertex AI Arize Phoenix GitLab

agent run · acme failed

search support tickets

search internal docs

check deployment historymissing

draft customer update

“…caused by last night’s deployment.” — never verified.

Autopilot diagnosis verified

post-fix

groundedness

31% → 92%

human-approved · regression passed

telemetry skipped

2 tools added

GitLab MR opened

A live run · fractured, diagnosed, and proven

AUTONOMOUS DIAGNOSISHUMAN-APPROVED REMEDIATIONGROUNDED IN TRACE + EVAL EVIDENCEGITLAB MR ON APPROVALREGRESSION-VERIFIED FIXESGEMINI · PHOENIX · GITLABAUTONOMOUS DIAGNOSISHUMAN-APPROVED REMEDIATIONGROUNDED IN TRACE + EVAL EVIDENCEGITLAB MR ON APPROVALREGRESSION-VERIFIED FIXESGEMINI · PHOENIX · GITLAB

[ THE RELIABILITY LOOP ]

One run, from confusion to clarity.

Scroll the live trace. This is the exact path AgentOps walks every failed run — autonomous diagnosis, human-approved remediation, proof it improved.

01
Agent runs
It looked like a clean success.
An escalation agent investigates Acme’s checkout outage and ships a confident customer reply. Green across the board.
run_acme_hero passed?
› plan investigation
› searchSupportTickets → 3 hits
› searchInternalDocs → 2 hits
› draftCustomerUpdate → sent
“The checkout failures were caused by last night’s deployment…”
02
Failure detected
The success state fractures.
Evals catch what a human skim wouldn’t: a customer-facing causal claim with zero verifying evidence behind it.
eval: groundedness 31% fail
fail
required-tool
1
claims
0
evidence
deployment skipped telemetry skipped
03
Evidence collected
What was checked — and what wasn’t.
Every source the agent touched becomes a node. Deployment history and telemetry light up as dashed red gaps.
04
Root cause diagnosed
Grounded, cited, no invented facts.
Autopilot reconstructs the causal chain from trace + eval evidence and names the exact step that broke.
Diagnosis
The agent made a customer-facing causal claim without calling checkDeploymentHistory or checkRuntimeTelemetry. Required-tool eval scored 0%.
grounded · 9 evidence citations · pattern 8/12
05
Remediation proposed
A fix that reads like a merge request.
A policy diff lands: verify deployment + telemetry before any root-cause claim, or express calibrated uncertainty.
agent-policies/escalation-agent.md
## Customer-facing root-cause claims
Draft the update from tickets and docs.
Require checkDeploymentHistory + checkRuntimeTelemetry
before asserting any cause. Else state uncertainty.
06
Human approves
The one deliberate, controlled action.
Nothing self-modifies. A reviewer approves — and a real GitLab remediation opens with the evidence attached.
Human approvaltool-call firewall
tool · createGitlabIssue
risk · medium · approval required
ApproveReject
07
Agent rerun
Replay the exact scenario.
Post-fix, the agent is forced through verification. This time it checks deployment and telemetry before it speaks.
post-fix rerun replaying
› searchSupportTickets
✓ checkDeploymentHistory
✓ checkRuntimeTelemetry
› draftCustomerUpdate → calibrated
08
Improvement proven
31 → 92, on the record.
Before/after isn’t a claim — it’s a regression test. Failed traces become a human-approved fix that’s verified to hold.
0
post-fix
groundedness31% → 92%
required-toolfail → pass
claims1 → 0
3/3 regression cases passed

[ MEANINGFUL MCP INTEGRATION ]

Grounded in Arize Phoenix.

Autopilot doesn’t guess. It treats Arize Phoenix as its source of truth — exporting every trace and eval, then querying them back through a real MCP tool contract before it says a single word about why a run failed.

Emit

Every agent run streams its trace spans + eval scores to Arize Phoenix as it executes.

Query via MCP

Autopilot pulls that observability data back through a Phoenix MCP tool contract — tools/list, tools/call.

Ground

Diagnoses cite only Phoenix-sourced spans and eval evidence. No invented facts, ever.

POST /api/mcp/phoenix

mcp

› { "method": "tools/list" }

query_run_observability(runId) — spans + eval scores

list_exported_runs() — everything in Phoenix

list_failing_runs() — anything not passing

tools/call → grounded diagnosis,
cited against { spans, evals } from Phoenix.

Phoenix observability — connectedArize Phoenix · MCP

Failed traces become a human-approved fix
and a regression test.

Run the guided demo

Explore the console

Fromfailedruntohuman-approvedfix.

One run, from confusion to clarity.

It looked like a clean success.

The success state fractures.

What was checked — and what wasn’t.

Grounded, cited, no invented facts.

A fix that reads like a merge request.

The one deliberate, controlled action.

Replay the exact scenario.