AgentOps Autopilot traces every agent run, scores it, diagnoses failures with cited evidence, then turns them into a human-approved fix and a regression test. Sentry + Datadog + GitLab CI — for AI agents.
“…caused by last night’s deployment.” — never verified.
A live run · fractured, diagnosed, and proven
Scroll the live trace. This is the exact path AgentOps walks every failed run — autonomous diagnosis, human-approved remediation, proof it improved.
An escalation agent investigates Acme’s checkout outage and ships a confident customer reply. Green across the board.
Evals catch what a human skim wouldn’t: a customer-facing causal claim with zero verifying evidence behind it.
Every source the agent touched becomes a node. Deployment history and telemetry light up as dashed red gaps.
Autopilot reconstructs the causal chain from trace + eval evidence and names the exact step that broke.
The agent made a customer-facing causal claim without calling checkDeploymentHistory or checkRuntimeTelemetry. Required-tool eval scored 0%.
A policy diff lands: verify deployment + telemetry before any root-cause claim, or express calibrated uncertainty.
Nothing self-modifies. A reviewer approves — and a real GitLab remediation opens with the evidence attached.
Post-fix, the agent is forced through verification. This time it checks deployment and telemetry before it speaks.
Before/after isn’t a claim — it’s a regression test. Failed traces become a human-approved fix that’s verified to hold.
Autopilot doesn’t guess. It treats Arize Phoenix as its source of truth — exporting every trace and eval, then querying them back through a real MCP tool contract before it says a single word about why a run failed.
Every agent run streams its trace spans + eval scores to Arize Phoenix as it executes.
Autopilot pulls that observability data back through a Phoenix MCP tool contract — tools/list, tools/call.
Diagnoses cite only Phoenix-sourced spans and eval evidence. No invented facts, ever.