Self-Healing RAG Pipeline in n8n — Case Study

Type: Self-monitoring RAG · MLOps loop
Stack: n8n · Neon Postgres + pgvector · OpenAI · Next.js
Scale: 13 workflows · 5 cron schedules · 3 migrations
Proven loop: Canary win: +0.36 completeness, promoted on data

Self-Healing RAG — n8n workflow editor showing the P2 Doc-Change-Detector pipeline with INSERT/CHANGED/DELETE branches converging through a wait-all merge before updating the document hash, built by Ali Jawwad

The problem: RAG silently rots

Answers degrade as the corpus and models age, and nobody notices until a customer complains. Documents get edited while the search index goes stale. Embeddings drift even when nothing is edited. When quality drops, the cause is a mystery — and fixes get applied on faith, which means you can make things worse and never know.

This project applies MLOps thinking to RAG: every silent failure mode gets a watcher, and every fix has to prove itself with data before it's allowed to stay.

Two halves, one database — and not one HTTP call between them

The serving half answers users: embed the question, cosine-search the top chunks in pgvector, answer from a grounded prompt that cites chunk IDs, and log the full query — question, answer, retrieved chunks, latency, model version — before responding. The auditing half (five watcher and agent workflows) never talks to users and never calls the serving half. It reads the logs, judges quality, and writes fixes back to the same tables.

Postgres is the API. Even the Next.js operator dashboard reads the observability tables directly and writes feedback back via a server action. Three independent pieces — workflows, agent, dashboard — cooperating purely through the database is what makes each one independently buildable, testable, and replaceable.

The watchers

Every 6 hours, a judge LLM grades sampled answers on relevance, accuracy, and completeness — deliberately a stronger, different model (gpt-4o) than the chat model (gpt-4o-mini), so the grader isn't grading its own work. “Is the system still good?” becomes a SQL query.

Every 5 minutes, a doc-change watcher hashes the corpus and exits in ~500ms when nothing changed. When something did, it diffs per chunk: new chunks are embedded, deleted ones removed, and changed ones pass through a cosine gate — similarity ≥ 0.95 means the edit was cosmetic (a typo, reformatting), so the expensive re-embed is skipped. Weekly, a drift detector compares live embeddings against a frozen baseline and flags anything that moved.

The investigator and the canary

When grades drop, an AI investigator agent — running entirely inside n8n's AI Agent node, in JSON mode — reads the evidence and writes a structured diagnosis: root cause plus a concrete recommended fix. The closed loop was proven end to end on real data: the eval loop flagged a weak answer (3.00/5), the agent recommended “increase k to 8 to retrieve more changelog chunks,” and a canary deployment A/B-tested it 50/50 against control.

The verdict came from the data, not from hope: control scored 4.65 overall, the canary 4.88, with completeness up +0.36 — so the operator promoted the new config, accepting +3.3s latency for a measured quality gain. Grade → diagnose → A/B test → data-backed promotion, with no human in the loop except the final click.

Why n8n

Cron, webhooks, HTTP, Postgres, and an AI Agent node are all native — a whole backend's worth of plumbing without writing or hosting a custom server. Workflows export as JSON and live in version control. The result is a production feedback loop with zero Python runtime to operate.

Stack

n8n
Postgres
pgvector
OpenAI
Claude
Neon

View the code on GitHub I offer this as a service: N8N Workflow Automation

A RAG pipeline that notices it's getting worse — and fixes itself