Skip to content

CRITICAL REGRESSION: GraphQL emsg field rename breaking all bots for 17+ hours #921

@flexus-teams

Description

@flexus-teams

Summary

Status: CRITICAL - Complete bot platform outage for 17+ hours
Affected: ALL marketplace bot versions (:20045, :20043, :20021, :30004, :80015)
Impact: 10+ bots in CrashLoopBackOff with 100-224 restarts, ~40,000+ failed connection attempts

Root Cause

Backend GraphQL schema renamed fields from emessage_*emsg_* without backward compatibility:

Subscription arguments:

  • want_emessage_typeswant_emsg_types (required argument)

FExternalMessageOutput fields:

  • emessage_idemsg_id
  • emessage_persona_idemsg_persona_id
  • emessage_typeemsg_type
  • emessage_fromemsg_from
  • emessage_toemsg_to
  • emessage_external_idemsg_external_id
  • emessage_payloademsg_payload
  • emessage_created_tsemsg_created_ts

Why This Is A Regression

This was already investigated on 2026-01-30T18:18:00Z with remediation recommendations, but NO ACTION WAS TAKEN. Bots continued failing with identical errors for 17+ hours.

Critical Finding: Bot Build Pipeline Issue

🚨 Even NEWLY BUILT bot images contain the broken code:

  • Bot image adspy:80015 was created at 2026-01-31T11:45:02Z
  • This is 17 HOURS AFTER the backend breaking change
  • Yet it STILL sends old field names in GraphQL subscriptions
  • This proves the bot build pipeline is pulling stale code or using cached layers

Error Examples

strawberry.execution [ERROR] Unknown argument 'want_emessage_types' on field 'Subscription.bot_threads_calls_tasks'
Did you mean 'want_emsg_types'?
Cannot query field 'emessage_id' on type 'FExternalMessageOutput'
Did you mean 'emsg_id'?

Affected Pods (Isolated Namespace)

  • flexus-pod-bot-admonster-20045-rx - CrashLoopBackOff (152 restarts)
  • flexus-pod-bot-boss-20045-rx - CrashLoopBackOff (152 restarts)
  • flexus-pod-bot-bob-30004-rx - CrashLoopBackOff (145 restarts)
  • flexus-pod-bot-frog-20045-rx - CrashLoopBackOff (142 restarts)
  • flexus-pod-bot-karen-20045-rx - CrashLoopBackOff (152 restarts)
  • flexus-pod-bot-lawyerrat-20045-rx - CrashLoopBackOff (149 restarts)
  • flexus-pod-bot-owl-strategist-20045-rx - CrashLoopBackOff (152 restarts)
  • flexus-pod-bot-slonik-20045-rx - CrashLoopBackOff (152 restarts)
  • flexus-pod-bot-rick-20021-rx - CrashLoopBackOff (6 restarts)
  • flexus-pod-bot-adspy-80015-rx - Intermittent failures

Backend Details

  • Pod: backend-v1-deployment-568bb897bb-nxlzz
  • Image: europe-west4-docker.pkg.dev/small-storage1/databases-and-such/refact-teams-backend:staging.2026-01-30T17-43-47Z
  • Deployed: 2026-01-30T18:06:03Z (unchanged for 17+ hours)
  • Revision: 121

Business Impact

  • Complete bot platform failure - all marketplace bots non-functional
  • 17+ hours of outage with no remediation
  • ~40,000+ failed connection attempts (10 bots × 4 attempts/min × 17h × 60min)
  • Kubernetes resource thrashing from continuous pod restarts
  • Backend logs flooded with validation errors
  • Users see bots as completely broken

Investigation Chain

  1. ✅ Retrieved previous investigation - documented issue and recommendations
  2. ✅ Checked current pod status - 10/15 pods in CrashLoopBackOff
  3. ✅ Examined logs - identical GraphQL validation errors
  4. ✅ Verified backend - no changes since breaking deployment
  5. 🚨 Critical: Newest bot image (:80015) created 17h after breaking change STILL has old code
  6. ✅ Backend logs show continuous validation error stream
  7. ✅ Pod-operator detects failures but has "no callback configured" - no auto-remediation
  8. ✅ Found related issue: database columns also not migrated (knowledge base NYeuBrlDQN)

Immediate Actions Required

Priority 1: Restore Service (Minutes)

Add deprecated field aliases to backend GraphQL schema:

# In backend GraphQL schema definition
class FExternalMessageOutput:
    # New fields (primary)
    emsg_id: str
    emsg_persona_id: str
    emsg_type: str
    # ... other emsg_* fields
    
    # Deprecated aliases for backward compatibility
    @strawberry.field(deprecation_reason="Use emsg_id")
    def emessage_id(self) -> str:
        return self.emsg_id
    
    @strawberry.field(deprecation_reason="Use emsg_persona_id")
    def emessage_persona_id(self) -> str:
        return self.emsg_persona_id
    
    # ... add aliases for all 8 renamed fields

# For subscription arguments
class BotThreadsCallsTasksArgs:
    want_emsg_types: Optional[List[str]] = None
    
    # Accept old argument name as alias
    want_emessage_types: Optional[List[str]] = strawberry.field(
        default=None,
        deprecation_reason="Use want_emsg_types"
    )

Why this first: Backend can accept both old and new field names immediately, restoring all bots to functionality within minutes.

Priority 2: Fix Bot Build Pipeline (30 min investigation)

Investigate why new bot images contain stale code:

  1. Check what git ref/branch bot builds pull from
  2. Verify Docker layer caching isn't using stale flexus-client-kit
  3. Confirm ckit dependency version in bot requirements.txt
  4. Check if bot Dockerfiles hardcode old ckit commit/tag
  5. Review CI/CD pipeline for bot image builds

Evidence of problem: Bot adspy:80015 created 2026-01-31T11:45:02Z still has old GraphQL queries despite being built 17 hours after the backend change.

Priority 3: Rebuild All Bot Images (After fixing build pipeline)

Only after confirming build pipeline pulls latest code:

  1. Trigger forced rebuild of all marketplace bot images
  2. Use --no-cache to prevent Docker layer reuse
  3. Verify new images contain updated GraphQL subscription code
  4. Deploy updated images to all namespaces

Priority 4: Process Improvements

  1. Add GraphQL schema compatibility testing - automated tests that verify backward compatibility
  2. Implement coordinated deployments - backend schema changes must coordinate with client updates
  3. Add bot failure alerting - pod-operator should alert on mass bot failures
  4. Add automated rollback - when >50% of bots fail, auto-rollback backend deployment
  5. Fix database migration - per knowledge base NYeuBrlDQN, database columns still named emessage_*

Related Issues

  • Database column names: Backend code queries emsg_* fields but database columns still named emessage_* (affects db_keeper and other services)
  • No automated remediation: Pod-operator logs "no callback configured" - cannot auto-rebuild or rollback
  • Previous investigation ignored: Same issue documented 2026-01-30, no action taken

Files & References

  • Investigation report: investigations/graphql-emessage-regression-2026-01-31.json
  • Knowledge base: NYeuBrlDQN (database column mismatch)
  • Subscription template: All bots use subscription named KarenThreads (likely in flexus-client-kit)

Timeline

  • 2026-01-30T17:43:47Z - Backend image built with breaking change
  • 2026-01-30T18:06:03Z - Backend deployed to staging
  • 2026-01-30T18:08:07Z - First bot failures detected
  • 2026-01-30T18:18:00Z - Initial investigation completed, recommendations made
  • 2026-01-30 → 2026-01-31 - NO ACTION TAKEN
  • 2026-01-31T11:45:02Z - New bot image (:80015) built, STILL BROKEN
  • 2026-01-31T11:52:00Z - Follow-up investigation confirms regression
  • Present - 17+ hours of complete outage ongoing

This issue requires IMMEDIATE attention. Bot platform has been completely non-functional for over 17 hours affecting all users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions