Deployment
Four containers in one pod, Cloud Build, and 16 Discord channels watching everything.
When you're the only person operating a system that processes 30,000 leads a day, deployment infrastructure isn't optional. It's the difference between a bad deploy costing you an hour and costing you 8 days. I know this because an 8-day doom loop once cost me $4,000-$5,000 per day in lost revenue.
That story is at the bottom of this page. Read the architecture first. Then you'll understand why every piece exists.
The 4-Container GKE Pod
One Kubernetes pod. Four containers. All running the same Docker image with different entrypoints. This is the core of the Landlord Rep agent — the system that operates as the leasing office for 8 buildings with zero human staff.
The pod is named homeeasy-ai-service-v3. Legacy name from a time when this was the third rewrite. It stuck because renaming a GKE deployment mid-production is a great way to lose an afternoon you'll never get back.
gunicorn -w 1 -b 0.0.0.0:5012 app:app --log-level debug --timeout 60250m CPU, 512Mi RAM. The lightest container. Receives Twilio webhooks, serves API endpoints. One worker because the real work gets handed off to Celery immediately. The 60-second timeout matters: Twilio will retry if your webhook doesn't respond in time, and duplicate inbound processing is worse than a slow response.
celery -A app.celery beat --loglevel=info500m CPU request, 2Gi RAM limit. The cron scheduler. Fires periodic tasks: followup checks every 30 minutes, stale context cleanup, SLA nag sequences, daily owner reports. Beat doesn't execute tasks — it drops them into the RabbitMQ queue. But it needs enough CPU to not drift on timing. A beat that fires 3 minutes late compounds into missed followup windows across hundreds of leads.
celery -A app.celery worker --loglevel=info -c 6 --queues=ai_client_service_v3 --time-limit=1700 --soft-time-limit=1600 --max-tasks-per-child=10 --prefetch-multiplier=11000m CPU, 2Gi RAM. The workhorse. 6 concurrent workers processing AI conversations in parallel. Every flag on that command line exists because of a production incident.
Identical configuration to Container 3. Redundancy and throughput. If one worker container OOM-kills, the other keeps processing. Without it, a single container restart means 6 workers go dark simultaneously — and every lead in mid-conversation gets silence.
The service fronting this pod:
apiVersion: v1
kind: Service
metadata:
name: homeeasy-ai-service-v3
spec:
selector:
app: homeeasy-ai-service-v3
ports:
- protocol: TCP
port: 5012
targetPort: 5012
type: LoadBalancer
LoadBalancer type means GKE provisions a public IP. Twilio's webhooks hit that IP directly. No ingress controller, no nginx reverse proxy, no API gateway. One fewer thing to break.
Why Those Celery Flags
Every flag on the worker command exists because something went wrong in production without it.
--time-limit=1700 and --soft-time-limit=1600 — these are seconds, not milliseconds. The soft limit (1600 seconds, about 26.6 minutes) sends a SIGTERM so the task can clean up: close database connections, flush partial state, send a Discord alert. The hard limit (1700 seconds, about 28.3 minutes) kills the process. The gap gives 100 seconds for graceful shutdown. Without this pair, a stuck LLM call (Anthropic or Gemini timing out on their end) would hold a worker forever. Six stuck workers means zero capacity.
--max-tasks-per-child=10 — after 10 tasks, the child process restarts. This is the memory leak seatbelt. Python's garbage collector doesn't always reclaim everything, especially with large LLM response objects and 500-message chat histories that get parsed into dictionaries of dictionaries. After 10 conversations, restart the process. Clean slate. The 2Gi limit is tight enough that without this flag, you get OOM kills within hours.
--prefetch-multiplier=1 — each worker only grabs one task at a time from the queue. Default is 4, which means a worker pulls 4 tasks, starts processing one, and holds the other 3 hostage. If that worker OOM-kills during task 1, tasks 2-4 vanish. With multiplier=1, each task is only pulled when the worker is ready. Slower throughput, but no phantom task loss.
-c 6 — six concurrent workers per container. Two containers means 12 total. Each conversation takes 3-15 seconds depending on LLM response time. At peak, I've seen 40+ tasks queued simultaneously during a lead batch drop from the listing aggregator. 12 workers drain that in under a minute.
Why These Sizes
Web server at 250m CPU, 512Mi RAM: it receives a Twilio webhook (a few KB of JSON), validates it, drops a task into RabbitMQ, and returns a 200. That's it. No heavy compute. The RAM limit is generous for what it does, but Flask + gunicorn + the imported module tree takes about 180Mi on startup. The remaining 330Mi is headroom for request spikes.
Workers at 1000m CPU, 2Gi RAM: LLM calls take wall-clock time, not CPU. But response parsing — extracting tool calls, building context windows, serializing conversation history — needs CPU. And it needs RAM. A lead with 500 messages has a chat history that, when loaded and parsed into the context window, can consume 800Mi. Multiply by 6 concurrent workers and the 2Gi limit is the floor, not the ceiling.
I've seen OOM kills when a worker handles a lead with 500+ message history. The container restarts. The in-flight task is lost. The lead doesn't get a response. And the only evidence is a single line in kubectl describe pod:
State: Running Last State: Terminated Reason: OOMKilled Exit Code: 137 Restart Count: 1
Exit code 137 is the kernel killing your process. There is no stack trace. There is no error log. There is no Discord alert. The process is gone. The only way to know it happened is to check restart counts or watch the pod events. That's why the Discord error pipeline exists — to catch everything upstream of the kill.
Beat at 500m CPU request, 2Gi RAM limit: the RAM limit looks oversized for a scheduler. It is. But beat imports the same codebase as the workers (same Docker image, same Python modules), so the baseline memory footprint is the same ~180Mi. The CPU request of 500m ensures the scheduler doesn't get starved during pod resource contention. A beat that drifts on timing means followup sequences fire late, SLA checks miss their windows, and the daily owner report arrives at 11am instead of 7am.
Cloud Build Pipeline
Six steps. Push to the test-deployment branch triggers the build. The trigger name is homeeasy-amyservice-v1 — another legacy name. Renaming it would break the trigger ID references in deployment scripts.
steps:
# Step 1: Fetch ConfigMap values for tests
- name: 'gcr.io/cloud-builders/kubectl'
id: FetchConfig
entrypoint: 'bash'
args:
- '-c'
- |
gcloud container clusters get-credentials \
[CLUSTER_NAME] --zone=us-central1
kubectl get configmap [CONFIGMAP_NAME] \
-o jsonpath='{.data.GOOGLE_API_KEY}' > /workspace/google_api_key.txt
kubectl get configmap [CONFIGMAP_NAME] \
-o jsonpath='{.data.MAY_AI_MODEL}' > /workspace/may_ai_model.txt
# Step 2: Run tests
- name: 'python:3.11'
id: Tests
entrypoint: bash
args:
- -lc
- |
python -m pip install -U pip
pip install -r requirements.txt
pip install pytest pytest-mock requests-mock
export GOOGLE_API_KEY=$(cat /workspace/google_api_key.txt 2>/dev/null || echo "")
export MAY_AI_MODEL=$(cat /workspace/may_ai_model.txt 2>/dev/null || echo "gemini-2.5-flash")
python run_tests.py
# Step 3: Build Docker image
- name: 'gcr.io/cloud-builders/docker'
args: ['build', '--no-cache', '-t',
'us-central1-docker.pkg.dev/[PROJECT_ID]/homeeasy-repos/homeeasy-ai-service-v3:latest',
'.', '-f', './Dockerfile.prod']
id: Build
# Step 4: Push to Artifact Registry
- name: 'gcr.io/cloud-builders/docker'
args: ['push',
'us-central1-docker.pkg.dev/[PROJECT_ID]/homeeasy-repos/homeeasy-ai-service-v3:latest']
id: Push
# Step 5: Stamp IMAGE_TAG in deployment.yaml
- name: 'alpine'
entrypoint: 'sh'
args:
- '-c'
- |
sed -i "s/IMAGE_TAG/latest/g" ./deployment.yaml
# Step 6: Deploy to GKE
- name: 'gcr.io/cloud-builders/kubectl'
entrypoint: 'bash'
args:
- '-c'
- |
gcloud container clusters get-credentials \
[CLUSTER_NAME] --zone=us-central1
kubectl replace -f ./deployment.yaml || kubectl apply -f ./deployment.yaml
env:
- 'CLOUDSDK_COMPUTE_ZONE=us-central1'
- 'CLOUDSDK_CONTAINER_CLUSTER=[CLUSTER_NAME]'
Step 1 is unusual. Most CI pipelines don't reach into the live cluster during the test phase. Mine does because the tests need real API keys — specifically the Gemini API key to test the LLM fallback path. The keys live in the Kubernetes ConfigMap, not in the repo. So the pipeline pulls them from the cluster, writes them to files in /workspace/ (shared across all Cloud Build steps), and the test step reads them.
When Step 2 fails, the entire pipeline stops. No partial deploys. No "the tests failed but the image built anyway." This is the seatbelt. Before Cloud Build existed, I was doing kubectl apply from my laptop. One bad deploy that passes through a broken test gate is exactly how the 8-day doom loop started.
Step 5 does a sed replacement on the deployment YAML. The image tag in the repo says IMAGE_TAG as a placeholder. Cloud Build replaces it with latest. In a proper setup, this would be a commit SHA. Using latest means I can't pin to a specific image version for rollback — I have to rebuild from the git commit. That's a known weakness. The tradeoff is simplicity: one tag, one image, no tag garbage collection.
Step 6 runs kubectl replace first, falling back to kubectl apply. Replace is a full object replacement — it overwrites the entire deployment spec. Apply is a merge patch — it only changes what's different. Replace is more predictable: you get exactly what's in the YAML file, no leftover fields from previous applies. The fallback to apply handles the case where the deployment doesn't exist yet (first deploy to a new cluster).
20+ Environment Variables
Every container in the pod gets the same set of environment variables. They come from a Kubernetes ConfigMap named [CONFIGMAP_NAME]. The b4hc suffix is a hash — Kubernetes generates it to track ConfigMap versions.
Separation of config from code. The Docker image contains the application. The ConfigMap contains the credentials and feature flags. Changing a flag doesn't require a rebuild. kubectl edit configmap and restart the pods.
# Database DATABASE_URL # Production Postgres (read-write) NEW_DATABASE_URL # Newer Postgres instance (read-write) READONLY_DATABASE_URL # Read-only replica (analytics, probes) # Message broker CELERY_BROKER_URL # RabbitMQ connection CELERY_RESULT_BACKEND # Redis for task results # SMS TWILIO_SERVICE_HOMEEASY # Twilio messaging service SID # LLM providers ANTHROPIC_API_KEY # Claude Opus 4.6 (primary brain) GOOGLE_API_KEY # Gemini (fallback, bulk work) OPENAI_API_KEY # Legacy, still referenced MAY_AI_MODEL # Which Gemini model for fallback # Monitoring DISCORD_BOT_TOKEN # Bot auth for 16 alert channels DISCORD_WEBHOOK # Legacy webhook (being migrated to bot) # Integrations FUB_API # Follow Up Boss CRM API key FUB_TEXTING_SERVICE # FUB texting endpoint BUILDING_SERVICE_API # Internal building data API VOICE_AI_API_KEY # Voice call provider ASANA_ACCESS_TOKEN # Task management (Landlord Rep only) # Tracing LANGSMITH_TRACING # LangSmith trace enabled flag LANGSMITH_API_KEY # LangSmith auth LANGSMITH_ENDPOINT # LangSmith API URL LANGSMITH_PROJECT # LangSmith project name # Feature flags USE_AGENT_CORE # true|false|shadow - V2 agent architecture HITL_ENABLED # Human-in-the-loop gate # Staff OWNER_PHONE # Escalation phone BLUELAKE_STAFF_EMAIL # Staff email FEEDBACK_FORM_URL # Discord feedback form URL HOMEEASY_DIALER_URL # Outbound dialer endpoint # Google Drive / Gmail YGL_DRIVE_PARENT_FOLDER_ID # Doc storage folder GDRIVE_SERVICE_ACCOUNT_JSON # Service account (from K8s Secret) GMAIL_SERVICE_ACCOUNT_JSON # Same SA for email monitoring
Important detail: adding a key to the ConfigMap does NOT automatically expose it to the pod. Each variable needs an explicit configMapKeyRef entry in the deployment YAML. I've lost 2 hours twice to "I added the key to the ConfigMap, why isn't the code seeing it?" The variable existed in Kubernetes but the pod spec didn't reference it.
Two variables come from a Kubernetes Secret instead of the ConfigMap: GDRIVE_SERVICE_ACCOUNT_JSON and GMAIL_SERVICE_ACCOUNT_JSON. These are the Google service account credentials for reading doc submissions from Gmail and storing files in Drive. They're marked optional: true because the pod should start even if the secret doesn't exist — the doc processing features just won't work.
Discord Error Logging Pipeline
16 channels. One Discord bot token. Each channel receives a different event type. I watch these from my phone. When something breaks, I know within seconds. Not minutes. Seconds.
Channel ID Function Name What It Receives
----------------------------------------------------------------------
[CHANNEL_ID] sendDiscordYNYErrorAlert Unhandled exceptions, OOM context
[CHANNEL_ID] sendDiscordYNYEvent Typed events: message_received,
ai_response, sms_sent, error,
dead_lead, qualification, etc.
[CHANNEL_ID] sendDiscordMessage General system messages
[CHANNEL_ID] sendDiscordMessageAmy Locator agent events
[CHANNEL_ID] sendDiscordLangAgentAlert Agent brain reasoning traces
[CHANNEL_ID] sendDiscordFollowUpMessage Followup sequence events
[CHANNEL_ID] sendDiscordDeadClientAlert Lead death events
[CHANNEL_ID] sendDiscordTourRequest Tour scheduling requests
[CHANNEL_ID] sendDiscordBuildingOptionsAlert Building option generation
[CHANNEL_ID] sendDiscordRequirementsCheck Requirements gathering events
[CHANNEL_ID] sendDiscordAgentResponse Agent responses (all agents)
[CHANNEL_ID] sendDiscordChatHistoryNotFound Missing chat history warnings
[CHANNEL_ID] sendDiscordUnauthorizedMessage Auth failures
[CHANNEL_ID] sendDiscordRequirementsNoteAlert CRM note updates
[CHANNEL_ID] sendDiscordStageSuggestionNoteAlert Stage transition suggestions
[CHANNEL_ID] sendDiscordSanityCheckNoteAlert Sanity check results
[CHANNEL_ID] sendDiscordMessageWithFeedbackButton HITL escalation w/ feedback
The core posting function is the same pattern repeated 16 times:
def sendDiscordYNYErrorAlert(textContent, channel_id='[CHANNEL_ID]'):
try:
url = f"https://discord.com/api/v10/channels/{channel_id}/messages"
payload = {
"content": f"YNY Service Error\n{str(textContent)}"
}
headers = {
"Authorization": f"Bot {DISCORD_BOT_TOKEN}",
"Content-Type": "application/json"
}
response = requests.post(url, headers=headers, json=payload)
return response
except Exception as e:
print(f"Error sending Discord YNY error message: {e}")
return None
The error handler in the main service wraps this:
def log_yny_error(error_message, context="", notify_discord=True):
full_message = (f"YNY Service Error - {context}: {error_message}"
if context
else f"YNY Service Error: {error_message}")
logger.error(full_message)
if notify_discord:
try:
discord_message = (f"**Context:** {context}\n**Error:** {error_message}"
if context
else f"**Error:** {error_message}")
sendDiscordYNYErrorAlert(discord_message)
except Exception as discord_error:
logger.error(f"Failed to send Discord notification: {discord_error}")
Notice the try/except inside log_yny_error. If Discord itself is down, the error logging shouldn't crash the error handler. Errors about errors are the most dangerous kind — they mask the original problem.
The typed event system (sendDiscordYNYEvent) maps event types to context fields. A message_received event carries client_id, phone, name. A dead_lead event carries the reason. Discord has a 2000-character limit per message, so the function truncates at 1950 characters with a ... (truncated) suffix. I've had production stack traces that exceeded 2000 characters — the truncation prevents the Discord API from rejecting the alert entirely, which would mean the error disappears into silence.
Pre/Post Deployment Checklists
Seven steps after every deploy. No exceptions. These are not suggestions.
# 1. Pod is running
kubectl get pods -l app=homeeasy-ai-service-v3
# Look for: 1/1 Running, 0 restarts, age < 5m
# 2. All containers healthy
kubectl describe pod -l app=homeeasy-ai-service-v3 | grep -A 3 "State:"
# FAIL if any container shows: CrashLoopBackOff, Error, OOMKilled
# 3. LoadBalancer has external IP
kubectl get svc homeeasy-ai-service-v3
# EXTERNAL-IP must not be <pending>
# 4. Webhook endpoint responds
curl -s -o /dev/null -w "%{http_code}" http://<EXTERNAL-IP>:5012/health
# Must return 200
# 5. Celery beat is scheduling
kubectl exec <pod-name> -c homeeasy-ai-service-v3-celery-beat \
-- celery -A app.celery inspect scheduled
# Must show upcoming tasks
# 6. Workers are consuming
kubectl exec <pod-name> -c homeeasy-ai-service-v3-celery-worker \
-- celery -A app.celery inspect active
# Must show worker registered on ai_client_service_v3 queue
# 7. End-to-end test
# Send a test SMS to the Twilio number
# Verify: webhook received -> task queued -> worker processed -> response sent
# Check Discord channels for the event trace
CrashLoopBackOff is the most common failure mode. It means the container started, crashed, restarted, crashed again, and Kubernetes is now backing off on restart attempts. Each restart doubles the wait: 10s, 20s, 40s, 80s, up to 5 minutes. During that backoff, the container is dead. No tasks process. No webhooks respond.
The usual cause: a Python import error. A new module references a dependency that's not in requirements.txt. Or a circular import. The container starts Python, hits the import error in the first 2 seconds, exits with code 1, and Kubernetes restarts it. The fix is always in the build step, never in the cluster. But you won't know it's an import error until you read the logs:
kubectl logs <pod-name> -c homeeasy-ai-service-v3-web --previous # --previous shows logs from the LAST container instance (before crash) # Without --previous, you get the current instance, which might be mid-crash
The --previous flag is the one that matters. Without it, you see the current container's logs, which might be 0.5 seconds of startup before the next crash. With it, you see the full output from the container that actually failed. Every crash investigation starts here.
SLA Monitoring
Five internal SLAs. Not aspirational targets. Hard deadlines that trigger automated nag sequences when breached.
SLA_DOC_REVIEW_H = 24 # Doc review: 24 hours from receipt SLA_APP_DECISION_H = 48 # Application decision: 48 hours from complete docs SLA_INVOICE_DAYS = 30 # Invoice delivery: 30 days from move-in SLA_SHOWING_SCHEDULE_H = 48 # Showing schedule: 48 hours from request # First response: 5 minutes from inbound lead (handled by Celery task priority)
When an SLA is breached, the system doesn't just log it. It starts nagging. At 50% of the SLA window, a gentle reminder hits the Asana ticket. At 100%, an overdue alert fires to Discord and Asana. Every 8 hours after that, another nag, up to 3 times. Then it escalates to me.
def _sla_check(lead, ctx, now, sla_hours, send_sms, send_discord,
first_name, db=None, start_key="ticket_created_at",
lead_msg_gentle=None, lead_msg_overdue=None):
start = _parse_dt(ctx.get(start_key))
if not start:
return None
hours_waiting = _hours_since(start, now)
ticket_id = ctx.get("ticket_id", "")
nag_count = int(ctx.get("staff_nag_count", "0"))
hours_since_nag = _hours_since(
_parse_dt(ctx.get("staff_last_nagged_at")), now
)
result = {"nagged": 0, "updated_lead": 0}
# Staff nag at 100% SLA, min 8h between nags
if hours_waiting >= sla_hours and nag_count < 3 and hours_since_nag >= 8:
nag = (f"OVERDUE ({hours_waiting:.0f}h): "
f"{lead.full_name} waiting {hours_waiting:.0f}h "
f"(SLA: {sla_hours}h)")
if ticket_id and not ticket_id.startswith("dry-run"):
add_comment_to_ticket(ticket_id, nag)
if send_discord:
send_discord(f"STAFF NAG: {nag}")
ctx["staff_last_nagged_at"] = now.isoformat()
ctx["staff_nag_count"] = str(nag_count + 1)
_save_ctx(lead.id, ctx, db)
result["nagged"] = 1
# Gentle nag at 50% SLA
elif hours_waiting >= sla_hours * 0.5 and nag_count == 0:
nag = (f"Reminder: {lead.full_name} waiting "
f"{hours_waiting:.0f}h (SLA: {sla_hours}h)")
if ticket_id and not ticket_id.startswith("dry-run"):
add_comment_to_ticket(ticket_id, nag)
ctx["staff_last_nagged_at"] = now.isoformat()
ctx["staff_nag_count"] = "1"
result["nagged"] = 1
# Update lead if no contact in 24h
hours_since_update = _hours_since(
_parse_dt(ctx.get("lead_last_update_at")), now
)
if hours_since_update >= 24:
msg = lead_msg_overdue if hours_waiting >= 48 else lead_msg_gentle
if msg:
_send_sms(send_sms, lead.id, lead.phone,
msg.format(name=first_name))
ctx["lead_last_update_at"] = now.isoformat()
result["updated_lead"] = 1
The dry-run prefix on ticket IDs is a testing safeguard. During simulations, ticket IDs start with "dry-run" so the nag system won't spam real Asana tickets. The code checks for this prefix before writing comments. Without it, every test run would pollute production task threads with fake overdue alerts.
The 8-hour minimum between nags prevents alert fatigue. Three nags and then silence — if three nags didn't work, a fourth won't either. That's when it comes to me.
The Full Pod Spec
This is the actual deployment YAML. Not a simplified version. Not pseudocode. The file that kubectl apply reads.
apiVersion: apps/v1
kind: Deployment
metadata:
name: homeeasy-ai-service-v3
spec:
replicas: 1
selector:
matchLabels:
app: homeeasy-ai-service-v3
template:
metadata:
labels:
app: homeeasy-ai-service-v3
spec:
containers:
- name: homeeasy-ai-service-v3-web
image: us-central1-docker.pkg.dev/[PROJECT_ID]/homeeasy-repos/\
homeeasy-ai-service-v3:IMAGE_TAG
command: ["gunicorn"]
args: ["-w", "1", "-b", "0.0.0.0:5012", "app:app",
"--log-level", "debug", "--timeout", "60"]
ports:
- containerPort: 5012
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "250m"
env:
- name: DATABASE_URL
valueFrom:
configMapKeyRef:
name: [CONFIGMAP_NAME]
key: DATABASE_URL
# ... 25+ more env vars from ConfigMap ...
- name: homeeasy-ai-service-v3-celery-beat
image: us-central1-docker.pkg.dev/[PROJECT_ID]/homeeasy-repos/\
homeeasy-ai-service-v3:IMAGE_TAG
command: ["celery"]
args: ["-A", "app.celery", "beat", "--loglevel=info"]
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "500m"
env:
# ... same env vars as web container ...
- name: homeeasy-ai-service-v3-celery-worker
image: us-central1-docker.pkg.dev/[PROJECT_ID]/homeeasy-repos/\
homeeasy-ai-service-v3:IMAGE_TAG
command: ["celery"]
args: ["-A", "app.celery", "worker", "--loglevel=info",
"-c", "6", "--queues=ai_client_service_v3",
"--time-limit=1700", "--soft-time-limit=1600",
"--max-tasks-per-child=10", "--prefetch-multiplier=1"]
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
# ... same env vars as web container ...
Replicas: 1. One pod. The second Celery worker runs as a second container within the same pod, not as a second pod. This means all four containers share the same node. If the node goes down, everything goes down. The tradeoff: simplicity. One pod to monitor, one set of logs, one restart command. At 30,000 leads per day, the volume doesn't justify multi-node redundancy yet. It will. When it does, the architecture changes to a Deployment with 2+ replicas and a separate Redis/RabbitMQ StatefulSet.
What 8 Days of Silence Looks Like
February 14-21, 2026.
A deploy went out on the 14th. It broke the lead processing pipeline. Leads came in from listing services. They landed in the database. The system acknowledged receipt. But responses didn't go out. The agent brain wasn't firing. Leads texted us and got nothing back.
The monitoring didn't catch it because the monitoring deployed with the bad code. The Discord alert functions imported from the same codebase that was broken. When the import failed, the alert functions failed. When the alert functions failed, nobody knew the alerts were failing. The snake ate its tail.
So I fixed it. Pushed another deploy on the 15th. That fix introduced a new regression — the CI pipeline had its own bugs, and the "fix" passed tests that weren't testing the right thing. Pushed another fix on the 16th. That one created a duplicate message problem: leads started getting the same SMS 3 times. Fixed the dedup on the 16th. That broke the Gemini timeout handling. Fixed Gemini on the 16th. That exposed a dormant bug in the YGL inbox processing. And so on.
Eight consecutive days. Each fix introduced new breakage. No session knew what previous sessions had tried. The session memory system didn't exist yet. DEPLOYMENT_STATE.md didn't exist yet. Each AI coding session started fresh, read the code, saw something broken, fixed it, and deployed — not knowing that the exact same approach had been tried and failed 48 hours earlier.
# From DEPLOYMENT_STATE.md — written after the doom loop ended:
What Was Tried And Failed:
- 2026-02-14: nightly slop PRD fix
- 2026-02-15: CI fix, codebase decomposition PR7, verification rootcause
- 2026-02-16: duplicate message dedup, Gemini timeout hotfix
- 2026-02-17: YGL audit inbox rescue
- 2026-02-18: 17-lead recovery, inventory recovery overnight,
lead intelligence Gemini fix
- 2026-02-19: overnight system health fix
- 2026-02-20: CRM three features design, overnight full simulation
- 2026-02-21: infinite loop fix
Pattern: Each fix introduced new breakage.
No session knew what previous sessions tried.
$4,000-$5,000 per day. Eight days. Call it $36,000 in lost revenue.
This is why three things now exist that didn't exist before:
1. DEPLOYMENT_STATE.md — a plain text file at the repo root. It gets read at the start of every AI coding session, regardless of what code is deployed. It says what's running, what's broken, what was tried and failed. The file is outside the application code. If the application code is broken, the state file still works. If every Python file in the repo has an import error, DEPLOYMENT_STATE.md still tells the next session what happened.
2. Session memory — every session auto-saves what it did, what broke, and what to do next. The next session reads the last 2-3 session files before starting work. No more flying blind. No more repeating failed approaches.
3. The 7-step verification checklist — mandatory after every deploy. Not "run if you feel like it." Mandatory. Enforced by hooks. The checklist exists because during the doom loop, I was deploying without verifying. Push code, assume it works, move on. Seven times in a row, it didn't work. Now nothing gets marked "deployed" until all 7 steps pass.
The code is just the latest attempt. The state file is the ground truth. When those two things are the same file, a bad deploy kills both. When they're separate, the ground truth survives.
Infrastructure is scar tissue. Every piece of this deployment system — the Celery flags, the Discord channels, the verification checklist, the state file — is a scar from something that went wrong. Systems don't get built from first principles. They get built from consequences.
This deployment infrastructure runs the Landlord Rep agent — the system that operates as the leasing office for 8 buildings in South Shore, Chatham, and Chicago Heights. The same architectural patterns (GKE pod, Celery workers, Discord monitoring, Cloud Build pipeline) apply to the Locator Agent and Ken Insurance, with different container counts and queue names. The principles are the same: test before you deploy, watch everything, and keep the ground truth outside the blast radius.