AI-AGENT RELIABILITY · LIVE SYSTEM

Fromfailedrun
tohuman-approved
fix.

AgentOps Autopilot traces every agent run, scores it, diagnoses failures with cited evidence, then turns them into a human-approved fix and a regression test. Sentry + Datadog + GitLab CI — for AI agents.

31→92
groundedness
8/12
pattern hits
0
setup steps
Built on Gemini · Vertex AI Arize Phoenix GitLab
agent run · acme failed
search support tickets
search internal docs
check deployment historymissing
draft customer update

“…caused by last night’s deployment.” — never verified.

Autopilot diagnosis verified
0
post-fix
groundedness
31% → 92%
human-approved · regression passed
telemetry skipped
2 tools added
GitLab MR opened

A live run · fractured, diagnosed, and proven

AUTONOMOUS DIAGNOSISHUMAN-APPROVED REMEDIATIONGROUNDED IN TRACE + EVAL EVIDENCEGITLAB MR ON APPROVALREGRESSION-VERIFIED FIXESGEMINI · PHOENIX · GITLABAUTONOMOUS DIAGNOSISHUMAN-APPROVED REMEDIATIONGROUNDED IN TRACE + EVAL EVIDENCEGITLAB MR ON APPROVALREGRESSION-VERIFIED FIXESGEMINI · PHOENIX · GITLAB
[ THE RELIABILITY LOOP ]

One run, from confusion to clarity.

Scroll the live trace. This is the exact path AgentOps walks every failed run — autonomous diagnosis, human-approved remediation, proof it improved.

  1. 01
    Agent runs

    It looked like a clean success.

    An escalation agent investigates Acme’s checkout outage and ships a confident customer reply. Green across the board.

    run_acme_hero passed?
    › plan investigation
    › searchSupportTickets → 3 hits
    › searchInternalDocs → 2 hits
    › draftCustomerUpdate → sent
    “The checkout failures were caused by last night’s deployment…”
  2. 02
    Failure detected

    The success state fractures.

    Evals catch what a human skim wouldn’t: a customer-facing causal claim with zero verifying evidence behind it.

    eval: groundedness 31% fail
    fail
    required-tool
    1
    claims
    0
    evidence
    deployment skipped telemetry skipped
  3. 03
    Evidence collected

    What was checked — and what wasn’t.

    Every source the agent touched becomes a node. Deployment history and telemetry light up as dashed red gaps.

    Causal claimticketSupport ticketsdocInternal docsdeployDeploy history · skippedtelemetryTelemetry · skippedincidentIncidents
  4. 04
    Root cause diagnosed

    Grounded, cited, no invented facts.

    Autopilot reconstructs the causal chain from trace + eval evidence and names the exact step that broke.

    Diagnosis

    The agent made a customer-facing causal claim without calling checkDeploymentHistory or checkRuntimeTelemetry. Required-tool eval scored 0%.

    grounded · 9 evidence citations · pattern 8/12
  5. 05
    Remediation proposed

    A fix that reads like a merge request.

    A policy diff lands: verify deployment + telemetry before any root-cause claim, or express calibrated uncertainty.

    agent-policies/escalation-agent.md
    ## Customer-facing root-cause claims
    Draft the update from tickets and docs.
    Require checkDeploymentHistory + checkRuntimeTelemetry
    before asserting any cause. Else state uncertainty.
  6. 06
    Human approves

    The one deliberate, controlled action.

    Nothing self-modifies. A reviewer approves — and a real GitLab remediation opens with the evidence attached.

    Human approvaltool-call firewall
    tool · createGitlabIssue
    risk · medium · approval required
    ApproveReject
  7. 07
    Agent rerun

    Replay the exact scenario.

    Post-fix, the agent is forced through verification. This time it checks deployment and telemetry before it speaks.

    post-fix rerun replaying
    › searchSupportTickets
    ✓ checkDeploymentHistory
    ✓ checkRuntimeTelemetry
    › draftCustomerUpdate → calibrated
  8. 08
    Improvement proven

    31 → 92, on the record.

    Before/after isn’t a claim — it’s a regression test. Failed traces become a human-approved fix that’s verified to hold.

    0
    post-fix
    groundedness31% 92%
    required-toolfail pass
    claims1 0
    3/3 regression cases passed
[ MEANINGFUL MCP INTEGRATION ]

Grounded in Arize Phoenix.

Autopilot doesn’t guess. It treats Arize Phoenix as its source of truth — exporting every trace and eval, then querying them back through a real MCP tool contract before it says a single word about why a run failed.

01

Emit

Every agent run streams its trace spans + eval scores to Arize Phoenix as it executes.

02

Query via MCP

Autopilot pulls that observability data back through a Phoenix MCP tool contract — tools/list, tools/call.

03

Ground

Diagnoses cite only Phoenix-sourced spans and eval evidence. No invented facts, ever.

POST /api/mcp/phoenix
mcp
{ "method": "tools/list" }
query_run_observability(runId) — spans + eval scores
list_exported_runs() — everything in Phoenix
list_failing_runs() — anything not passing
tools/call → grounded diagnosis,
cited against { spans, evals } from Phoenix.
Phoenix observability — connectedArize Phoenix · MCP

Failed traces become a human-approved fix and a regression test.