Autonomous Pipeline Remediation

Data pipelines break in quiet ways. A schema changes upstream. A connector drops a field. A dbt model references a table that no longer exists. The data keeps flowing, but the numbers are wrong, and nobody notices until a stakeholder asks why the dashboard looks off.

This system was built to catch those failures early and, where possible, fix them without human intervention.

How it works

The system operates in two halves: monitoring and remediation.

Monitoring runs four times daily across 100+ customer-specific marketplace feeds in a multi-tenant BigQuery environment. It generates coverage snapshots at every layer of the pipeline — ingestion, staging, transformation, unified schema, and analytics — and computes adaptive staleness thresholds per customer and schema based on historical update cadence. When something crosses a threshold, the system creates a ticket with full diagnostic context: which accounts are affected, which schemas, how long the data has been stale, and what changed.

Seven alert types cover the spectrum: stale data, empty schemas, pipeline failures, freshness drops, operational drops, and API health issues. Each alert groups related problems into a single issue to avoid alert fatigue.

Remediation picks up those tickets with a team of AI agents built on the Claude Agent SDK.

A Research Agent investigates each issue autonomously — querying BigQuery for coverage data, reading dbt model source code via GitHub, checking recent commits for breaking changes, and following per-alert-type investigation protocols. It produces a structured root cause analysis, classifies whether the issue is agent-fixable, and posts its findings back to the ticket.

For issues it can fix (primarily dbt model bugs), a Resolution Agent takes over. It clones the target repository, creates a branch, implements a minimal fix, and validates it against four checks: dbt parse, dbt compile, SQL linting, and code formatting. If everything passes, it opens a pull request with a structured description: root cause, changes made, validation results, and impact assessment.

When a data engineer reviews the PR and requests changes, a Revision Agent picks up the feedback and pushes a revised commit — up to two revision cycles per PR.

What made it interesting

The cost discipline was essential for production viability. Each investigation costs roughly $0.75, each fix about $0.25, and each revision around $0.20. Batch-level safeguards cap spending at $50 per run and 10 issues per run, whichever is hit first. The system processes remaining issues on the next cycle.

The monitoring layer’s adaptive staleness thresholds eliminated the biggest source of false alerts. Instead of a fixed “stale after 3 days” rule, each customer-schema pair gets its own threshold based on its actual update frequency. A table that updates daily triggers after 2 days. A table that updates weekly doesn’t fire until day 14.

The agent architecture enforces a clear separation between read-only investigation and write-capable resolution. The Research Agent can query data and read code but cannot modify anything. The Resolution Agent can write code but operates within a disposable workspace with hard validation gates before any PR is opened.

Tech stack

  • Python, Claude Agent SDK, Claude Sonnet for agent orchestration
  • BigQuery for pipeline monitoring, coverage snapshots, and audit logging
  • dbt as the transformation layer being monitored and repaired
  • Linear for issue tracking and agent-to-human communication
  • GitHub for code access, PR creation, and revision management
  • Slack for real-time alerting on new issues