CRITICAL REGRESSION: GraphQL emsg field rename breaking all bots for 17+ hours

## Summary

**Status**: CRITICAL - Complete bot platform outage for 17+ hours  
**Affected**: ALL marketplace bot versions (:20045, :20043, :20021, :30004, :80015)  
**Impact**: 10+ bots in CrashLoopBackOff with 100-224 restarts, ~40,000+ failed connection attempts

## Root Cause

Backend GraphQL schema renamed fields from `emessage_*` → `emsg_*` without backward compatibility:

**Subscription arguments:**
- `want_emessage_types` → `want_emsg_types` (required argument)

**FExternalMessageOutput fields:**
- `emessage_id` → `emsg_id`
- `emessage_persona_id` → `emsg_persona_id`
- `emessage_type` → `emsg_type`
- `emessage_from` → `emsg_from`
- `emessage_to` → `emsg_to`
- `emessage_external_id` → `emsg_external_id`
- `emessage_payload` → `emsg_payload`
- `emessage_created_ts` → `emsg_created_ts`

### Why This Is A Regression

This was **already investigated** on 2026-01-30T18:18:00Z with remediation recommendations, but **NO ACTION WAS TAKEN**. Bots continued failing with identical errors for 17+ hours.

## Critical Finding: Bot Build Pipeline Issue

🚨 **Even NEWLY BUILT bot images contain the broken code:**
- Bot image `adspy:80015` was created at `2026-01-31T11:45:02Z` 
- This is **17 HOURS AFTER** the backend breaking change
- Yet it STILL sends old field names in GraphQL subscriptions
- **This proves the bot build pipeline is pulling stale code or using cached layers**

## Error Examples

```
strawberry.execution [ERROR] Unknown argument 'want_emessage_types' on field 'Subscription.bot_threads_calls_tasks'
Did you mean 'want_emsg_types'?
```

```
Cannot query field 'emessage_id' on type 'FExternalMessageOutput'
Did you mean 'emsg_id'?
```

## Affected Pods (Isolated Namespace)

- `flexus-pod-bot-admonster-20045-rx` - CrashLoopBackOff (152 restarts)
- `flexus-pod-bot-boss-20045-rx` - CrashLoopBackOff (152 restarts)  
- `flexus-pod-bot-bob-30004-rx` - CrashLoopBackOff (145 restarts)
- `flexus-pod-bot-frog-20045-rx` - CrashLoopBackOff (142 restarts)
- `flexus-pod-bot-karen-20045-rx` - CrashLoopBackOff (152 restarts)
- `flexus-pod-bot-lawyerrat-20045-rx` - CrashLoopBackOff (149 restarts)
- `flexus-pod-bot-owl-strategist-20045-rx` - CrashLoopBackOff (152 restarts)
- `flexus-pod-bot-slonik-20045-rx` - CrashLoopBackOff (152 restarts)
- `flexus-pod-bot-rick-20021-rx` - CrashLoopBackOff (6 restarts)
- `flexus-pod-bot-adspy-80015-rx` - Intermittent failures

## Backend Details

- **Pod**: `backend-v1-deployment-568bb897bb-nxlzz`
- **Image**: `europe-west4-docker.pkg.dev/small-storage1/databases-and-such/refact-teams-backend:staging.2026-01-30T17-43-47Z`
- **Deployed**: `2026-01-30T18:06:03Z` (unchanged for 17+ hours)
- **Revision**: 121

## Business Impact

- **Complete bot platform failure** - all marketplace bots non-functional
- **17+ hours of outage** with no remediation
- **~40,000+ failed connection attempts** (10 bots × 4 attempts/min × 17h × 60min)
- Kubernetes resource thrashing from continuous pod restarts
- Backend logs flooded with validation errors
- Users see bots as completely broken

## Investigation Chain

1. ✅ Retrieved previous investigation - documented issue and recommendations
2. ✅ Checked current pod status - 10/15 pods in CrashLoopBackOff
3. ✅ Examined logs - identical GraphQL validation errors
4. ✅ Verified backend - no changes since breaking deployment
5. 🚨 **Critical**: Newest bot image (:80015) created 17h after breaking change STILL has old code
6. ✅ Backend logs show continuous validation error stream
7. ✅ Pod-operator detects failures but has "no callback configured" - no auto-remediation
8. ✅ Found related issue: database columns also not migrated (knowledge base NYeuBrlDQN)

## Immediate Actions Required

### Priority 1: Restore Service (Minutes)

**Add deprecated field aliases to backend GraphQL schema:**

```python
# In backend GraphQL schema definition
class FExternalMessageOutput:
    # New fields (primary)
    emsg_id: str
    emsg_persona_id: str
    emsg_type: str
    # ... other emsg_* fields
    
    # Deprecated aliases for backward compatibility
    @strawberry.field(deprecation_reason="Use emsg_id")
    def emessage_id(self) -> str:
        return self.emsg_id
    
    @strawberry.field(deprecation_reason="Use emsg_persona_id")
    def emessage_persona_id(self) -> str:
        return self.emsg_persona_id
    
    # ... add aliases for all 8 renamed fields

# For subscription arguments
class BotThreadsCallsTasksArgs:
    want_emsg_types: Optional[List[str]] = None
    
    # Accept old argument name as alias
    want_emessage_types: Optional[List[str]] = strawberry.field(
        default=None,
        deprecation_reason="Use want_emsg_types"
    )
```

**Why this first:** Backend can accept both old and new field names immediately, restoring all bots to functionality within minutes.

### Priority 2: Fix Bot Build Pipeline (30 min investigation)

**Investigate why new bot images contain stale code:**

1. Check what git ref/branch bot builds pull from
2. Verify Docker layer caching isn't using stale flexus-client-kit
3. Confirm ckit dependency version in bot requirements.txt
4. Check if bot Dockerfiles hardcode old ckit commit/tag
5. Review CI/CD pipeline for bot image builds

**Evidence of problem:** Bot `adspy:80015` created `2026-01-31T11:45:02Z` still has old GraphQL queries despite being built 17 hours after the backend change.

### Priority 3: Rebuild All Bot Images (After fixing build pipeline)

**Only after confirming build pipeline pulls latest code:**

1. Trigger forced rebuild of all marketplace bot images
2. Use `--no-cache` to prevent Docker layer reuse
3. Verify new images contain updated GraphQL subscription code
4. Deploy updated images to all namespaces

### Priority 4: Process Improvements

1. **Add GraphQL schema compatibility testing** - automated tests that verify backward compatibility
2. **Implement coordinated deployments** - backend schema changes must coordinate with client updates
3. **Add bot failure alerting** - pod-operator should alert on mass bot failures
4. **Add automated rollback** - when >50% of bots fail, auto-rollback backend deployment
5. **Fix database migration** - per knowledge base NYeuBrlDQN, database columns still named `emessage_*`

## Related Issues

- **Database column names**: Backend code queries `emsg_*` fields but database columns still named `emessage_*` (affects db_keeper and other services)
- **No automated remediation**: Pod-operator logs "no callback configured" - cannot auto-rebuild or rollback
- **Previous investigation ignored**: Same issue documented 2026-01-30, no action taken

## Files & References

- **Investigation report**: `investigations/graphql-emessage-regression-2026-01-31.json`
- **Knowledge base**: NYeuBrlDQN (database column mismatch)
- **Subscription template**: All bots use subscription named `KarenThreads` (likely in flexus-client-kit)

## Timeline

- `2026-01-30T17:43:47Z` - Backend image built with breaking change
- `2026-01-30T18:06:03Z` - Backend deployed to staging
- `2026-01-30T18:08:07Z` - First bot failures detected
- `2026-01-30T18:18:00Z` - Initial investigation completed, recommendations made
- `2026-01-30 → 2026-01-31` - **NO ACTION TAKEN**
- `2026-01-31T11:45:02Z` - New bot image (:80015) built, STILL BROKEN
- `2026-01-31T11:52:00Z` - Follow-up investigation confirms regression
- Present - **17+ hours of complete outage ongoing**

---

**This issue requires IMMEDIATE attention.** Bot platform has been completely non-functional for over 17 hours affecting all users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRITICAL REGRESSION: GraphQL emsg field rename breaking all bots for 17+ hours #921

Summary

Root Cause

Why This Is A Regression

Critical Finding: Bot Build Pipeline Issue

Error Examples

Affected Pods (Isolated Namespace)

Backend Details

Business Impact

Investigation Chain

Immediate Actions Required

Priority 1: Restore Service (Minutes)

Priority 2: Fix Bot Build Pipeline (30 min investigation)

Priority 3: Rebuild All Bot Images (After fixing build pipeline)

Priority 4: Process Improvements

Related Issues

Files & References

Timeline

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CRITICAL REGRESSION: GraphQL emsg field rename breaking all bots for 17+ hours #921

Description

Summary

Root Cause

Why This Is A Regression

Critical Finding: Bot Build Pipeline Issue

Error Examples

Affected Pods (Isolated Namespace)

Backend Details

Business Impact

Investigation Chain

Immediate Actions Required

Priority 1: Restore Service (Minutes)

Priority 2: Fix Bot Build Pipeline (30 min investigation)

Priority 3: Rebuild All Bot Images (After fixing build pipeline)

Priority 4: Process Improvements

Related Issues

Files & References

Timeline

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions