Langley: a multi-agent AI that catches Solana token scams — and proves it
Give Langley a Solana token address and it tells you how risky it is — with a clear verdict, a confidence score, and the exact on-chain data behind every claim. Instead of one big prompt, it works like a small intelligence agency: three specialist agents that collaborate behind a trust design where plain code can override the AI, but only ever toward the safer answer.
- Type
- Multi-agent AI system
- Stack
- Python · OpenAI Agents SDK · GPT-4o · FastAPI
- Evaluation
- F1 ≈ 0.94, zero fatal errors (held-out set)
- Quality gates
- 49 tests · Pyright strict · Ruff

The problem
Cheap Solana memecoins are full of rug pulls (creators drain the liquidity) and honeypots (the contract lets you buy but never sell). Regular people lose real money to these, and checking a token by hand is slow and needs expert knowledge.
One rule shaped every design decision in this project: the worst possible mistake is telling someone a scam is “safe.” So the system is built to prefer an honest “not sure” over a confident wrong answer.
A team of agents, not one big prompt
Langley splits the job across three specialists built on the OpenAI Agents SDK, each with its own tools and structured output schema:
- The Risk Guardian (the judge) — issues the verdict: safe, caution, unsafe, or abstain. Every risk it names must point at a real data field and value; no evidence, no claim.
- On-Chain Forensics (the investigator) — reports neutral facts (liquidity, holders, mint authority, age) and is structurally forbidden from saying “safe” or “unsafe.”
- The Synthesis Orchestrator (the editor) — runs both agents in parallel and fuses their reports into one briefing. Crucially, the final verdict is copied verbatim from the judge: the fusing model can narrate, but it can never overrule.
The three-layer trust design
To guarantee the AI never ships a confident wrong “safe,” three independent layers check each other. The prompt instructs the model to cite evidence and abstain on thin data. Pydantic schemas enforce structure — a verdict must carry evidence, an abstention must carry a reason. And a deterministic safety gate — plain, non-AI code — re-checks the final answer and can override it: forcing “unsafe” on an obvious scam pattern, or forcing “not sure” when the model cited data that wasn't actually there. The gate only ever moves the answer in the safer direction.
Think of it like a newsroom: the reporter writes the story, but a fact-checker can pull or correct it before publication.
Honest evaluation: from a lying 100% to a real 0.94
Early tests on synthetic data scored a perfect 100% — and that number was a lie. The fix was building an outcome-verified dataset: real tokens labeled by what actually happened to them, with a held-out test split the agent was never tuned on. The first honest score was F1 ≈ 0.63, and it exposed a real bug: the agent treated holder concentration as a scam signal, but that metric normally includes exchange and pool wallets, so it was over-warning on healthy tokens.
After tuning the instructions on a training split and re-measuring once on the blind set, the agent reached F1 ≈ 0.94 with zero fatal errors — no scam ever called safe. The number is lower than a demo benchmark would claim, and that's the point: it's true.
Engineering notes
The codebase is a uv-managed Python monorepo where every agent follows the same assembly line — data providers → tools → agent → service → safety gate — so learning one package means understanding all of them. Market data comes from DexScreener and contract-level data from Helius, both behind a swappable provider interface: Helius was added later with zero changes to the agents. A rate-limited FastAPI app serves the “classified dossier” demo UI. Quality is enforced with 49 tests, strict Pyright type-checking, and Ruff.