The Twilio Incident
A code bug triggered a runaway SMS process through Twilio. In five days:
Total messages sent: 1,039,939 Failed: 528,000 Undelivered: 394,000 Actually delivered: 117,000 Still queued when caught: 33,900
The code had a loop that was supposed to send follow-up SMS messages to leads who hadn't responded. The loop didn't have a proper termination condition. It also didn't check whether a message had already been sent. It ran against the production database.
Over a million messages went out in hours, not days. The database buckled under the write load — connections overwhelmed. Chase flagged the charges as fraud. First a $100 decline, then $200. But Twilio's queue doesn't stop when your credit card declines. The messages kept queuing and retrying. 33,900 were sitting in the pipeline when we finally killed the process.
Leads who had opted out received messages. Leads who had already converted received re-engagement texts. Some leads got the same message dozens of times. The TCPA exposure — sending unsolicited messages to people who had explicitly opted out — is the kind of thing that generates class action lawsuits.
The engineering team's response: "The deploy didn't land in production."
It had. The evidence was in a million message logs.
Twilio offered $5,119 back — 75% of the charges. I took it immediately.
That was the moment I stopped trusting humans with production systems. Not because humans are bad at engineering. But because the cost of a mistake in a system processing 30,000 conversations a day is catastrophic, and humans make mistakes at a rate that's incompatible with that scale.
The engineers who manually intervened to kill the queue — the same team I was about to fire — did solid crisis response. They canceled the 33,900 queued messages via the Twilio API. I was grateful. I told them so. They went right back to doing nothing the next week. They fixed the acute crisis but never built anything to prevent the next one. That pattern — heroic firefighting, zero prevention — is why the team no longer exists.
Everything I built after traces back to this incident:
# From bash-guard.sh — the PreToolUse hook born from this incident # # THREE jobs, each deterministic: # 1. BLOCK destructive commands (SQL, kubectl mutations, rm -rf) # 2. DETECT deploy commands (set IS_DEPLOY flag) # 3. GATE git push → require fresh-eyes + syntax check # # Exit codes: 0 = allow, 2 = BLOCK # This hook NEVER prints instructions for Claude to follow. # It either BLOCKS or ALLOWS. That's it. # # STATE PHILOSOPHY (Mar 22, 2026): # All checks are STATELESS or session-scoped. # No persistent marker files.
But the hook is the seatbelt. The infrastructure limits are the crash barriers on the highway. Twilio's daily send cap, set in the Twilio console, is something no code bug can override. Postgres connection limits and statement timeouts mean the database says no when the application goes insane. The external watchdog — a standalone Python script on the Mac Studio in its own tmux session, querying the database every five minutes via a completely separate Twilio number — texts my personal phone if anything looks wrong.
The AI that builds my code never sees the watchdog script. Never edits it. Never deploys it. It exists in a different tmux session, on a different code path, using different credentials. Defense in depth, where each layer is physically separate from the others.
Every safety mechanism in my system is a monument to this incident. Every hook is a scar. The CLAUDE.md database safety section, the bottom-up analysis law, the fresh-eyes review gate, the pre-push check — all of it starts here, in a million text messages and a database on its knees.