-
Notifications
You must be signed in to change notification settings - Fork 303
Description
Summary
Status: CRITICAL - Complete bot platform outage for 17+ hours
Affected: ALL marketplace bot versions (:20045, :20043, :20021, :30004, :80015)
Impact: 10+ bots in CrashLoopBackOff with 100-224 restarts, ~40,000+ failed connection attempts
Root Cause
Backend GraphQL schema renamed fields from emessage_* → emsg_* without backward compatibility:
Subscription arguments:
want_emessage_types→want_emsg_types(required argument)
FExternalMessageOutput fields:
emessage_id→emsg_idemessage_persona_id→emsg_persona_idemessage_type→emsg_typeemessage_from→emsg_fromemessage_to→emsg_toemessage_external_id→emsg_external_idemessage_payload→emsg_payloademessage_created_ts→emsg_created_ts
Why This Is A Regression
This was already investigated on 2026-01-30T18:18:00Z with remediation recommendations, but NO ACTION WAS TAKEN. Bots continued failing with identical errors for 17+ hours.
Critical Finding: Bot Build Pipeline Issue
🚨 Even NEWLY BUILT bot images contain the broken code:
- Bot image
adspy:80015was created at2026-01-31T11:45:02Z - This is 17 HOURS AFTER the backend breaking change
- Yet it STILL sends old field names in GraphQL subscriptions
- This proves the bot build pipeline is pulling stale code or using cached layers
Error Examples
strawberry.execution [ERROR] Unknown argument 'want_emessage_types' on field 'Subscription.bot_threads_calls_tasks'
Did you mean 'want_emsg_types'?
Cannot query field 'emessage_id' on type 'FExternalMessageOutput'
Did you mean 'emsg_id'?
Affected Pods (Isolated Namespace)
flexus-pod-bot-admonster-20045-rx- CrashLoopBackOff (152 restarts)flexus-pod-bot-boss-20045-rx- CrashLoopBackOff (152 restarts)flexus-pod-bot-bob-30004-rx- CrashLoopBackOff (145 restarts)flexus-pod-bot-frog-20045-rx- CrashLoopBackOff (142 restarts)flexus-pod-bot-karen-20045-rx- CrashLoopBackOff (152 restarts)flexus-pod-bot-lawyerrat-20045-rx- CrashLoopBackOff (149 restarts)flexus-pod-bot-owl-strategist-20045-rx- CrashLoopBackOff (152 restarts)flexus-pod-bot-slonik-20045-rx- CrashLoopBackOff (152 restarts)flexus-pod-bot-rick-20021-rx- CrashLoopBackOff (6 restarts)flexus-pod-bot-adspy-80015-rx- Intermittent failures
Backend Details
- Pod:
backend-v1-deployment-568bb897bb-nxlzz - Image:
europe-west4-docker.pkg.dev/small-storage1/databases-and-such/refact-teams-backend:staging.2026-01-30T17-43-47Z - Deployed:
2026-01-30T18:06:03Z(unchanged for 17+ hours) - Revision: 121
Business Impact
- Complete bot platform failure - all marketplace bots non-functional
- 17+ hours of outage with no remediation
- ~40,000+ failed connection attempts (10 bots × 4 attempts/min × 17h × 60min)
- Kubernetes resource thrashing from continuous pod restarts
- Backend logs flooded with validation errors
- Users see bots as completely broken
Investigation Chain
- ✅ Retrieved previous investigation - documented issue and recommendations
- ✅ Checked current pod status - 10/15 pods in CrashLoopBackOff
- ✅ Examined logs - identical GraphQL validation errors
- ✅ Verified backend - no changes since breaking deployment
- 🚨 Critical: Newest bot image (:80015) created 17h after breaking change STILL has old code
- ✅ Backend logs show continuous validation error stream
- ✅ Pod-operator detects failures but has "no callback configured" - no auto-remediation
- ✅ Found related issue: database columns also not migrated (knowledge base NYeuBrlDQN)
Immediate Actions Required
Priority 1: Restore Service (Minutes)
Add deprecated field aliases to backend GraphQL schema:
# In backend GraphQL schema definition
class FExternalMessageOutput:
# New fields (primary)
emsg_id: str
emsg_persona_id: str
emsg_type: str
# ... other emsg_* fields
# Deprecated aliases for backward compatibility
@strawberry.field(deprecation_reason="Use emsg_id")
def emessage_id(self) -> str:
return self.emsg_id
@strawberry.field(deprecation_reason="Use emsg_persona_id")
def emessage_persona_id(self) -> str:
return self.emsg_persona_id
# ... add aliases for all 8 renamed fields
# For subscription arguments
class BotThreadsCallsTasksArgs:
want_emsg_types: Optional[List[str]] = None
# Accept old argument name as alias
want_emessage_types: Optional[List[str]] = strawberry.field(
default=None,
deprecation_reason="Use want_emsg_types"
)Why this first: Backend can accept both old and new field names immediately, restoring all bots to functionality within minutes.
Priority 2: Fix Bot Build Pipeline (30 min investigation)
Investigate why new bot images contain stale code:
- Check what git ref/branch bot builds pull from
- Verify Docker layer caching isn't using stale flexus-client-kit
- Confirm ckit dependency version in bot requirements.txt
- Check if bot Dockerfiles hardcode old ckit commit/tag
- Review CI/CD pipeline for bot image builds
Evidence of problem: Bot adspy:80015 created 2026-01-31T11:45:02Z still has old GraphQL queries despite being built 17 hours after the backend change.
Priority 3: Rebuild All Bot Images (After fixing build pipeline)
Only after confirming build pipeline pulls latest code:
- Trigger forced rebuild of all marketplace bot images
- Use
--no-cacheto prevent Docker layer reuse - Verify new images contain updated GraphQL subscription code
- Deploy updated images to all namespaces
Priority 4: Process Improvements
- Add GraphQL schema compatibility testing - automated tests that verify backward compatibility
- Implement coordinated deployments - backend schema changes must coordinate with client updates
- Add bot failure alerting - pod-operator should alert on mass bot failures
- Add automated rollback - when >50% of bots fail, auto-rollback backend deployment
- Fix database migration - per knowledge base NYeuBrlDQN, database columns still named
emessage_*
Related Issues
- Database column names: Backend code queries
emsg_*fields but database columns still namedemessage_*(affects db_keeper and other services) - No automated remediation: Pod-operator logs "no callback configured" - cannot auto-rebuild or rollback
- Previous investigation ignored: Same issue documented 2026-01-30, no action taken
Files & References
- Investigation report:
investigations/graphql-emessage-regression-2026-01-31.json - Knowledge base: NYeuBrlDQN (database column mismatch)
- Subscription template: All bots use subscription named
KarenThreads(likely in flexus-client-kit)
Timeline
2026-01-30T17:43:47Z- Backend image built with breaking change2026-01-30T18:06:03Z- Backend deployed to staging2026-01-30T18:08:07Z- First bot failures detected2026-01-30T18:18:00Z- Initial investigation completed, recommendations made2026-01-30 → 2026-01-31- NO ACTION TAKEN2026-01-31T11:45:02Z- New bot image (:80015) built, STILL BROKEN2026-01-31T11:52:00Z- Follow-up investigation confirms regression- Present - 17+ hours of complete outage ongoing
This issue requires IMMEDIATE attention. Bot platform has been completely non-functional for over 17 hours affecting all users.