Skip to content

Conversation

@Aparnap2
Copy link
Owner

@Aparnap2 Aparnap2 commented Jan 27, 2026

Summary by CodeRabbit

  • New Features

    • Launched multi-agent AI orchestration system for autonomous operations decisions with context-aware decision-making.
    • Added resilient LLM integration with fallback strategies, circuit breaker, and rule-based fallbacks.
    • Introduced agent memory persistence using Redis for learning and decision history.
    • Added metrics and evaluation tracking for agent accuracy and performance monitoring.
  • Documentation

    • Comprehensive architecture guide, developer guides, implementation plan, and testing checklist for the AI service.
  • Chores

    • Added extensive test suite for all agent components and expanded test infrastructure.
    • Updated environment configurations and test automation scripts.

✏️ Tip: You can customize this high-level summary in your review settings.

Aparnap2 and others added 9 commits January 24, 2026 11:24
- test_hunter.py: 30 tests for Zombie Hunter (EBS cleanup)
- test_watchman.py: 30 tests for Night Watchman (staging shutdown)
- test_guard.py: 28 tests for Access Guard (IAM revocation)
- Makefile.opsmate: Sequential container testing rig (no docker-compose)
- Add normalize_email() for +suffix handling in Guard
- Add create_initial_*_state() functions for each agent
- Fix should_shutdown quiet_hours edge case
- Fix execute_shutdown error handling

All 380 tests pass (3 skipped - Neo4j unavailable)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Neo4j: neo4j:echoteam123 (port 7474/7687)
Redis: echoteam-redis (port 6380)
PostgreSQL: postgres-pgvector (port 5432)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add test_e2e_api.py: 23 E2E API tests with FastAPI TestClient
- Add test_llm_eval.py: 12 LLM evaluation tests using DeepEval
- Add test_llm_eval_quick.py: Quick smoke test for minimal hardware load
- Update README with agent docs, test instructions, and new architecture

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add OpenRouter LLM client (httpx-based, OpenAI-compatible)
- Add ExecOpsAgent with context awareness, proactivity, and learning
- Add test_agent_api.py for testing agent decisions

Features:
- Context-aware decision making
- Proactive suggestions
- Learning from human feedback
- Event processing for PR, security, and resource events

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add SentinelAgent for PR compliance analysis
- Add HunterAgent for zombie resource detection
- Add GuardAgent for access control and security
- Add SupervisorAgent for parallel/sequential/hierarchical workflow orchestration
- Implement agent-to-agent messaging with Task, Result, Alert, Query types
- Add memory system for collective learning
- Create comprehensive test suite with 11 tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Redis memory store with TTL and pattern recall
- LLM fallback chain (OpenRouter -> Ollama -> Rules)
- Circuit breaker for resilience
- Decision evaluation framework with accuracy tracking
- Real LLM evaluation tests with Ollama

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Create docs/ folder with ARCHITECTURE.md, GUIDES.md, IMPLEMENTATION_PLAN.md
- Rename evals/ to deepeval_metrics/ for clarity (DeepEval-based metrics)
- Remove agent/archive/ duplicate tech_debt.py
- Update README with current architecture and project structure
- Remove all __pycache__ folders
- Update test_golden_cases.py import path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Infrastructure (Redis, Neo4j, LocalStack, Ollama)
- Unit tests (40+ test classes, 454 tests)
- Integration tests (212 tests)
- LLM evaluations (5 scenarios, real Ollama tests)
- AWS integration (28 tests with moto)
- Code coverage targets by module
- Performance testing checklist
- Security testing checklist
- Docker environment verification
- CI/CD pipeline quality gates
- Bug fixes required (334 deprecation warnings)
- Recommended improvements roadmap

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- test_evaluation.py: 39 tests (73% coverage, up from 45%)
  - DecisionRecord serialization
  - EvaluationMetrics accuracy calculation
  - DecisionStore Redis operations
  - AccuracyCalculator edge cases
  - MetricsCollector trends
  - EvaluationReporter reports
  - Health check function

- test_vertical_agents.py: 29 tests (88% coverage)
  - ActionProposalState TypedDict
  - Vertical routing (all event types)
  - Graph factory functions
  - Human approval workflow
  - Checkpointer selection

- test_sop.py: 15 tests (100% coverage)
  - SOPLoader initialization
  - Policy loading (found/not found)
  - Policy listing
  - Vector search fallback
  - Error handling

Total: 83 new tests | 525 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

This PR introduces a complete architectural redesign of the AI service from a single Sentinel agent to a multi-agent orchestration system (ExecOps/EchoTeam), adding core agents (ExecOpsAgent, Sentinel, Hunter, Guard, Watchman), LLM resilience patterns (circuit breaker, fallback chains), Redis-based memory storage, evaluation metrics framework, and comprehensive test coverage with documentation.

Changes

Cohort / File(s) Summary
Test & Environment Configuration
ai-service/.deepeval/*, ai-service/.env.test, ai-service/Makefile.opsmate
Adds DeepEval cache and test run files, test environment config for Docker containers, and comprehensive Makefile orchestration for local test infrastructure with network, Redis, Neo4j, app builds, and multi-suite test targets.
Documentation Updates
ai-service/README.md, ai-service/docs/{ARCHITECTURE,GUIDES,IMPLEMENTATION_PLAN,TESTING_CHECKLIST}.md
Rebrands platform from OpsMate to EchoTeam, shifts from single Sentinel focus to multi-agent architecture (Sentinel, Hunter, Guard, Watchman), extensively documents architecture, developer guides, implementation phases, and testing roadmap.
Core Agent System
ai-service/src/ai_service/agents/execops_agent.py, ai-service/src/ai_service/agents/multi_agent.py
Introduces ExecOpsAgent for context-aware decision-making and comprehensive multi-agent orchestration system (SupervisorAgent, SentinelAgent, HunterAgent, GuardAgent) with messaging, task delegation, and workflow coordination. Removes archived tech_debt.py agent.
Agent Enhancements
ai-service/src/ai_service/agents/{guard,hunter,watchman}/*
Adds email normalization to Guard, state initialization helpers and metadata fields (event_id, org_id, urgency, status) to Guard/Hunter/Watchman agents; fixes quiet-hours boundary condition and error handling in Watchman.
LLM Infrastructure
ai-service/src/ai_service/llm/{openrouter,fallback}.py
Introduces OpenRouterClient for LLM calls via OpenRouter API and ResilientLLMClient with circuit breaker pattern, fallback strategy chains, and rule-based fallback logic for PR decisions, resource scans, and user access assessment.
Memory & Evaluation
ai-service/src/ai_service/memory/redis_store.py, ai-service/src/ai_service/evaluation/metrics.py
Adds Redis-backed memory store (MemoryItem, RedisMemoryStore, AgentMemoryMixin) for agent persistence and recall; introduces comprehensive evaluation metrics framework (DecisionRecord, DecisionStore, AccuracyCalculator, EvaluationReporter) with async Redis integration and decision tracking.
Comprehensive Test Suite
ai-service/tests/test_{agent_api,e2e_api,evaluation,fallback,guard,hunter,llm_eval*,multi_agent,redis_memory,sop,vertical_agents,watchman}.py
Adds 20+ test modules covering ExecOpsAgent, multi-agent workflows, LLM evaluation (GEval/Ollama), fallback/circuit-breaker behavior, Guard/Hunter/Watchman agents, Redis memory, evaluation metrics, SOP loading, and vertical agent routing with extensive fixtures and async test support.
Import Path Update
ai-service/src/ai_service/deepeval_metrics/test_golden_cases.py
Updates metric import path from ai_service.evals.metrics to ai_service.deepeval_metrics.metrics.

Sequence Diagram(s)

sequenceDiagram
    actor Event
    participant Supervisor as SupervisorAgent
    participant Router as Vertical Router
    participant Agent as Specialized Agent<br/>(Sentinel/Hunter/Guard)
    participant LLM as ResilientLLMClient
    participant Circuit as CircuitBreaker
    participant Fallback as Fallback Chain
    participant Store as DecisionStore<br/>(Redis)
    
    Event->>Supervisor: coordinate_agents(workflow)
    Supervisor->>Router: route_event(event_type)
    Router->>Agent: delegate_task(event)
    Agent->>LLM: chat(messages)
    LLM->>Circuit: allow_request()
    alt Circuit CLOSED
        Circuit-->>LLM: OK
        LLM->>LLM: Call Primary LLM
        alt Primary Success
            LLM-->>Agent: response
        else Primary Fails
            LLM->>Fallback: execute_fallback()
            Fallback->>Fallback: Try Secondary (Ollama)
            Fallback-->>LLM: fallback_response
        end
    else Circuit OPEN
        Circuit-->>LLM: CircuitOpenError
        LLM->>Fallback: rule_based_fallback()
        Fallback-->>LLM: heuristic_response
    end
    LLM-->>Agent: result
    Agent->>Agent: process_result()
    Agent->>Store: record_decision(decision)
    Store-->>Agent: decision_id
    Agent-->>Supervisor: TaskResult
    Supervisor-->>Event: workflow_complete(results)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Rationale: The PR introduces multiple interconnected systems (multi-agent orchestration, circuit breaker resilience, Redis memory, evaluation metrics) with 738+ lines in orchestration logic alone, 595+ lines in fallback patterns, 567+ lines in evaluation metrics, and significant async/await patterns across distributed components. While individual test files are relatively homogeneous, the diversity of production logic across agents, LLM infrastructure, memory stores, and orchestration demands separate reasoning for correctness of async flows, state management, and inter-component interactions.

Possibly related PRs

  • Feature/sentinel vertical #5 — Introduces the initial multi-agent architecture concept and README/project-structure rebranding (OpsMate → EchoTeam) that this PR directly implements at the code level with Sentinel/Hunter/Guard/Watchman agents and orchestration system.
  • fix: resolve production issues across agents and webhooks #4 — Modifies Guard agent departure-detection logic and webhook handling with overlapping changes to GuardState, email normalization patterns, and Watchman/Hunter client integration patterns in the same code paths.
  • Feature/execops pivot #1 — Implements the ExecOps vertical-agent pivot across the same core modules (agents/multi_agent, llm fallback/openrouter, memory/redis_store, graphs/vertical_agents) with direct code-level overlap in agent registration and workflow orchestration.

Poem

🐰 Multi-agents hopping in a line,
Sentinel, Hunter, Guard so fine,
Circuit breaker catching falls,
Redis memory recalls them all,
From event to decision in stride,
Orchestration's our AI pride! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Feature/tests and docs' is generic and does not clearly summarize the primary change. It lists categories ('tests and docs') rather than describing what specific feature or functionality is being added. Consider using a more descriptive title that highlights the main feature or architectural change, such as 'Add ExecOps agent system with multi-agent orchestration and comprehensive testing' or 'Introduce intelligent agent framework with evaluation metrics and fallback strategies'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 89.29% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Aparnap2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the ExecOps AI Service by introducing a sophisticated multi-agent architecture, improving its resilience through advanced LLM fallback mechanisms, and providing agents with persistent memory capabilities. These foundational changes are supported by a comprehensive testing suite and detailed documentation, ensuring the system is robust, maintainable, and scalable for future development.

Highlights

  • Multi-Agent Orchestration System: A new multi-agent orchestration system has been introduced, enabling parallel, sequential, and hierarchical workflows for agents like Sentinel, Hunter, and Guard. This allows for more complex and coordinated AI operations.
  • LLM Resilience with Fallback and Circuit Breaker: The system now features a robust LLM fallback strategy (OpenRouter -> Ollama -> Rule-based) and a Circuit Breaker pattern. This ensures continuous operation and prevents cascading failures even if primary LLM services become unavailable.
  • Persistent Agent Memory (Redis-backed): Agents are now equipped with a Redis-backed memory store, allowing them to store and recall context efficiently. This enhances their context-awareness and learning capabilities across interactions.
  • Comprehensive LLM Evaluation Framework: A new evaluation framework has been added to track agent decisions, record human feedback, calculate accuracy, and generate performance reports. This provides crucial insights into the effectiveness and reliability of LLM-driven actions.
  • Extensive Testing Infrastructure and Documentation: The pull request includes a new Makefile for Docker-based testing, a detailed testing checklist, and numerous new unit, integration, and end-to-end tests. Accompanying architectural and developer documentation has also been significantly expanded.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ai-service/src/ai_service/deepeval_metrics/test_golden_cases.py (1)

11-11: Docstring references outdated file path.

The docstring still references src/ai_service/evals/test_golden_cases.py, but based on the import path change and the file location under deepeval_metrics/, this should be updated.

📝 Suggested fix
-Run with: pytest src/ai_service/evals/test_golden_cases.py -v
+Run with: pytest src/ai_service/deepeval_metrics/test_golden_cases.py -v
🤖 Fix all issues with AI agents
In `@ai-service/docs/GUIDES.md`:
- Around line 143-165: Update the documentation examples to match the actual
API: include the required agent parameter when calling
DecisionStore.record_feedback (i.e., call record_feedback with decision_id,
agent, feedback, reasoning) and replace the non-existent
AccuracyCalculator.calculate(...) example with the real API—either show using
AccuracyCalculator.calculate_from_records(records) which accepts a list of
DecisionRecord objects, or use DecisionStore.get_accuracy(agent, days) for the
documented agent/time-window use case; reference DecisionStore.record_feedback,
AccuracyCalculator.calculate_from_records, and DecisionStore.get_accuracy when
making the changes.

In `@ai-service/Makefile.opsmate`:
- Around line 67-74: The redis Makefile target has a malformed docker run
command: the port variable is mistyped as $(REDIS_PORT9 and the
image/tokenization is corrupted ("-p $(REDIS_PORT9 \n\t):637redis:alpine"). Fix
the redis target (function/target name: redis) to use the correct port variable
$(REDIS_PORT), ensure the -p mapping is "-p $(REDIS_PORT):6379", and restore the
image reference to "redis:alpine"; keep the existing references to
$(REDIS_CONTAINER) and $(NETWORK_NAME) and preserve the 2>/dev/null || echo
fallback behavior.

In `@ai-service/src/ai_service/agents/execops_agent.py`:
- Around line 124-139: The _parse_decision function can raise ValueError when
ActionType(parsed.get("action", "approve").lower()) gets an unexpected value;
update _parse_decision to validate or catch errors when converting to
ActionType: extract the raw_action = parsed.get("action", "approve"), normalize
it (e.g., raw_action.lower()), then attempt to map/convert to ActionType inside
a try/except (catch ValueError or KeyError) and on failure set action =
ActionType.APPROVE (safe fallback); finally construct and return the
AgentDecision with that safe action while keeping the existing reasoning,
confidence, context_summary, and suggestions logic.

In `@ai-service/src/ai_service/agents/multi_agent.py`:
- Around line 592-600: The test calls a non-existent learn_from_feedback API and
feedback_patterns on HunterAgent, causing AttributeError; add these to the
Agent/HunterAgent implementation by introducing a feedback_patterns attribute
(e.g., self.feedback_patterns: Dict[str, List]) on Agent and implement an async
learn_from_feedback(self, decision_id: str, feedback: str, suggestion: str |
None) method that updates feedback_patterns (keyed by decision_id) and persists
or processes feedback as HunterAgent expects; alternatively, adjust
test_collective_memory to stop calling learn_from_feedback and not inspect
feedback_patterns if you prefer not to change Agent.

In `@ai-service/src/ai_service/evaluation/metrics.py`:
- Around line 263-266: In get_feedback_rate, the call to _get_client() is a
NameError because the helper is an instance method; change the call to use
self._get_client() inside async def get_feedback_rate(self, agent: str) -> float
to obtain the client. Locate get_feedback_rate in metrics.py and replace the
bare _get_client() invocation with self._get_client(), ensuring any awaited
usage remains (await self._get_client()) and no other references to the unscoped
function remain in this method.
- Around line 211-218: get_accuracy currently expects a concrete agent name and
builds a key pattern using f"{self._prefix}{agent}:*", so when generate_report
passes the string "all" no real agent keys match and accuracy is 0; change
get_accuracy signature to accept agent: Optional[str] = None and when agent is
None build the pattern as f"{self._prefix}*"(scan all agents) otherwise
f"{self._prefix}{agent}:*"; also update generate_report to pass None instead of
the literal "all" when it intends to aggregate across all agents so the new
pattern will match all decision keys; keep references to get_accuracy and
generate_report when making these changes.

In `@ai-service/src/ai_service/llm/openrouter.py`:
- Around line 66-92: In chat_stream, handle the stream sentinel and fix
attribute access: first check if the line after "data: " equals "[DONE]" and
break/return to end the generator before attempting JSON parsing; otherwise
parse the JSON and access the delta as a dict (e.g., use
data["choices"][0].get("delta", {}) and then .get("content", "")) instead of
data["choices"][0].delta.get(...); keep the existing checks for "choices" and
only yield non-empty content. Ensure these changes are applied inside the
chat_stream async context where the stream lines are iterated.

In `@ai-service/src/ai_service/memory/redis_store.py`:
- Around line 195-202: In retrieve(), preserve the key TTL when incrementing
MemoryItem.access_count: after fetching the item with client.get(full_key) and
before overwriting it, read the remaining TTL (e.g., await client.pttl(full_key)
or await client.ttl(full_key)), update item.access_count, then write the updated
value and restore the TTL—either use client.setex(full_key, seconds, ...) /
client.psetex with the remaining TTL or call client.set(...) followed by
client.pexpire(full_key, remaining_ms) if pttl returned milliseconds; handle
special pttl return values (-1 = no expiry, -2 = key missing) so you only
restore expiry when appropriate. This keeps create()'s setex-based TTL behavior
intact.
- Around line 349-365: The sync wrapper remember() uses asyncio.create_task()
which raises RuntimeError when no running loop; update remember() to follow the
same safe pattern used in recall(): attempt to get the current event loop and
call loop.run_until_complete(self._save_memory(key, value, ttl_seconds)) when a
loop is available, otherwise call asyncio.run(self._save_memory(...)); apply the
same change to forget() to call self._delete_memory(...) via run_until_complete
or asyncio.run as appropriate; reference functions: remember(), forget(),
_save_memory(), _delete_memory(), and recall() to locate and mirror the safe
event-loop handling.

In `@ai-service/tests/test_agent_api.py`:
- Around line 92-100: The curl command construction incorrectly slices the JSON
string (json_body[1:]) which removes the opening '{' and yields malformed JSON;
update the print that builds the curl -d payload to use the full json_body
variable (or otherwise preserve the complete JSON) in the block where json_body
is defined and used in the print statements so the -d argument contains valid
JSON.
- Around line 93-103: The json module is imported inside the if __name__ ==
"__main__" block but used earlier in main() when building json_body, causing a
NameError; move the import json to the top-level imports (so main() and any
top-level code can use it), and remove the trailing import json from the bottom
of the file (keep the existing main(), json_body construction, and the if
__name__ block unchanged aside from removing that bottom import).

In `@ai-service/tests/test_e2e_api.py`:
- Around line 172-208: The first test (test_github_webhook_validates_signature)
uses a too-broad assertion; change it to assert response.status_code == 200 to
reflect test env behavior when GITHUB_WEBHOOK_SECRET is not set and
verify_signature returns True. For the webhook handler, add explicit validation
before calling ainvoke_sentinel: ensure payload contains "pull_request" and that
pull_request.get("number") is present; if missing, log a clear warning and
return a 400 response instead of proceeding with pr_number=None so
ainvoke_sentinel is never called with incomplete data. Apply these changes by
updating the test function and the GitHub webhook handler (the code that reads
pull_request/pr_number and calls ainvoke_sentinel and verify_signature).

In `@ai-service/tests/test_vertical_agents.py`:
- Around line 298-312: The test calls create_vertical_agent_graph with
use_postgres=True which triggers the real Postgres checkpointer; update the test
to mock ai_service.infrastructure.checkpointer.get_sync_checkpointer (the
function used by create_vertical_agent_graph) so it returns a MagicMock
checkpointer, or alternatively set the environment flag
USE_POSTGRES_CHECKPOINTER to "false" before invoking
create_vertical_agent_graph; ensure you still patch get_vertical_graph and/or
langgraph.checkpoint.memory.MemorySaver as needed and return mock objects so the
test remains hermetic.
🟡 Minor comments (20)
ai-service/tests/test_sop.py-178-186 (1)

178-186: Avoid mocking Path here; it can mask default regressions.
This patch makes the test pass even if the default path changes, so it won’t catch a regression.

✅ Suggested fix
-        with patch('ai_service.sop.loader.Path') as mock_path:
-            mock_path.return_value = Path("data/sops")
-            loader = get_sop_loader()
-
-            assert loader.sop_dir == Path("data/sops")
+        loader = get_sop_loader()
+        assert loader.sop_dir == Path("data/sops")
ai-service/docs/TESTING_CHECKLIST.md-52-52 (1)

52-52: Typo: describe-voles should be describe-volumes.

Proposed fix
-| EC2 | ✅ Available | `aws ec2 describe-voles --endpoint-url http://localhost:4566` |
+| EC2 | ✅ Available | `aws ec2 describe-volumes --endpoint-url http://localhost:4566` |
ai-service/tests/test_llm_eval_quick.py-6-6 (1)

6-6: Module-level print executes during pytest collection.

This print statement runs at import time, causing output during test discovery even when this specific test isn't executed. Consider moving it inside the test function or using logging with appropriate level.

Proposed fix
 import os
 OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "granite4:1b-h")
 
-print(f"Testing LLM: {OLLAMA_MODEL}")
-
 def test_geval_basic():
     """Basic GEval test with lenient criteria."""
+    print(f"Testing LLM: {OLLAMA_MODEL}")
     from deepeval.test_case import LLMTestCase, LLMTestCaseParams
ai-service/README.md-24-31 (1)

24-31: Surround tables with blank lines
Adds clarity and satisfies MD058 table linting.

Also applies to: 194-200, 229-235

ai-service/README.md-7-20 (1)

7-20: Add language identifiers to fenced code blocks
Markdownlint flags these fences without a language; consider text for diagrams and bash/text for structure blocks.

Also applies to: 70-74, 87-134

ai-service/tests/test_vertical_agents.py-132-139 (1)

132-139: Rename test_route_unknown_returns_none to match actual behavior
This test asserts a ValueError, so the name/docstring imply the wrong outcome.

💡 Suggested rename
-    def test_route_unknown_returns_none(self):
-        """Test routing returns None for unknown events."""
+    def test_route_unknown_raises(self):
+        """Test routing raises ValueError for unknown events."""
ai-service/README.md-162-165 (1)

162-165: Avoid user-specific absolute paths in run instructions
Using a repo-relative placeholder is more portable for readers.

✏️ Suggested edit
-cd /home/aparna/Desktop/founder_os/ai-service
+cd /path/to/ai-service
ai-service/tests/test_fallback.py-78-98 (1)

78-98: Using time.sleep in async test blocks the event loop.

time.sleep(0.2) at Line 89 is a blocking call inside an async function. Use await asyncio.sleep(0.2) instead.

Use async sleep
     async def run():
         await cb.record_failure()  # Opens circuit
         assert cb.state == CircuitState.OPEN
 
         # Wait for timeout
-        time.sleep(0.2)
+        await asyncio.sleep(0.2)
 
         # Try to allow request - should go to half_open
         result = await cb.allow_request()
         assert cb.state == CircuitState.HALF_OPEN
         return True
ai-service/docs/ARCHITECTURE.md-225-246 (1)

225-246: Security concern: Example shows hardcoded credentials in documentation.

The example environment variables include NEO4J_PASSWORD=echoteam123 which might be an actual password. Documentation should use placeholder values like your-password-here.

Use placeholder credentials
 # Database
 NEO4J_URI=bolt://localhost:7687
 NEO4J_USER=neo4j
-NEO4J_PASSWORD=echoteam123
+NEO4J_PASSWORD=your-secure-password
 REDIS_URL=redis://localhost:6380
-DATABASE_URL=postgresql://postgres:postgres@localhost:5432/postgres
+DATABASE_URL=postgresql://user:password@localhost:5432/dbname

 # LLM
-OPENROUTER_API_KEY=sk-or-v1-xxx
+OPENROUTER_API_KEY=your-openrouter-api-key
ai-service/src/ai_service/llm/fallback.py-339-346 (1)

339-346: Bare except silently swallows all exceptions including KeyboardInterrupt.

The bare except: pass in _extract_context hides parsing errors, making debugging difficult. At minimum, catch (json.JSONDecodeError, ValueError) and consider logging.

Improve exception handling
                 try:
                     import json
                     start = content.find("{")
                     if start >= 0:
                         data = json.loads(content[start:])
                         context.update(data)
-                except:
-                    pass
+                except (json.JSONDecodeError, ValueError):
+                    # Content doesn't contain valid JSON - that's OK for fallback
+                    pass
                 break
ai-service/docs/ARCHITECTURE.md-266-269 (1)

266-269: Docker commands use specific passwords that should be placeholder values.

Use placeholder in example commands
 # Start infrastructure
 docker run -d --name echoteam-neo4j -p 7687:7687 -p 7474:7474 \
-  -e NEO4J_AUTH=neo4j/echoteam123 neo4j:5.14
+  -e NEO4J_AUTH=neo4j/your-secure-password neo4j:5.14
ai-service/docs/GUIDES.md-119-141 (1)

119-141: Guide references non-existent API: breaker.call() and @with_circuit_breaker.

The CircuitBreaker class in fallback.py doesn't have a call() method, and the decorator is named with_resilience, not with_circuit_breaker. This will confuse developers.

Fix to match actual API
 ## Circuit Breaker Usage
 
 ### Manual Control
 ```python
-from ai_service.llm.fallback import CircuitBreaker, CircuitState
+from ai_service.llm.fallback import CircuitBreaker, CircuitOpenError
 
 breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30.0)
 
-try:
-    await breaker.call(my_function)
-    await breaker.record_success()
-except Exception:
-    await breaker.record_failure()
+# Check if request is allowed
+try:
+    await breaker.allow_request()
+    result = await my_function()
+    await breaker.record_success()
+except CircuitOpenError:
+    # Circuit is open, use fallback
+    pass
+except Exception:
+    await breaker.record_failure()
+    raise

Using Decorator

-from ai_service.llm.fallback import with_circuit_breaker
+from ai_service.llm.fallback import with_resilience

-@with_circuit_breaker(breaker)
+@with_resilience(fallbacks=[(fallback_func, 2)])
async def risky_operation():
    return await external_service_call()
</details>

</blockquote></details>
<details>
<summary>ai-service/src/ai_service/llm/fallback.py-308-308 (1)</summary><blockquote>

`308-308`: **Type annotation mismatch: parameter typed as `str` but receives `List[Dict]`.**

The `_rule_based_fallback` method declares `messages: str` but it's called with `List[Dict]` from `chat()` at Line 306. This type mismatch could cause runtime issues or confuse type checkers.


<details>
<summary>Fix type annotation</summary>

```diff
-    async def _rule_based_fallback(self, messages: str) -> Dict[str, Any]:
+    async def _rule_based_fallback(self, messages: List[Dict]) -> Dict[str, Any]:
ai-service/src/ai_service/llm/fallback.py-489-518 (1)

489-518: with_resilience decorator assumes all decorated functions are async.

Line 508 unconditionally awaits the decorated function, but the decorator doesn't check if func is a coroutine function. This will fail for sync functions.

Handle both sync and async functions
     def decorator(func):
         circuit = CircuitBreaker()
         strategy = FallbackStrategy(fallbacks or [])
 
         async def wrapper(*args, **kwargs):
             await circuit.allow_request()
 
             try:
-                result = await func(*args, **kwargs)
+                if asyncio.iscoroutinefunction(func):
+                    result = await func(*args, **kwargs)
+                else:
+                    result = func(*args, **kwargs)
                 await circuit.record_success()
                 return result
             except Exception as e:
                 await circuit.record_failure()
                 if strategy.fallbacks:
                     return await strategy.execute()
                 raise
 
         return wrapper
     return decorator
ai-service/src/ai_service/agents/multi_agent.py-148-152 (1)

148-152: Bare except silently swallows JSON parsing errors with default fallback.

Multiple agent methods (Lines 151, 203, 255) use bare except when parsing LLM responses. This hides the actual error. Catch specific exceptions and consider logging for debugging.

Improve exception handling
         try:
             parsed = json.loads(result["content"][result["content"].find("{"):result["content"].rfind("}")+1])
             return parsed
-        except:
+        except (json.JSONDecodeError, KeyError, TypeError) as e:
+            logger.warning(f"Failed to parse LLM response: {e}")
             return {"decision": "approve", "reasoning": "Default approval", "risk_level": "low"}
ai-service/tests/test_agent_api.py-77-81 (1)

77-81: Potential KeyError if decision not found.

Per the learn_from_feedback API, if the decision_id is not found, it returns {"error": "Decision not found"} without a 'learned' key. This could cause a KeyError.

💡 Suggested defensive access
             feedback = await agent.learn_from_feedback(
                 decision_id=decision.decision_id,
                 feedback="approved",
             )
-            print(f"Feedback learned: {feedback['learned']}")
+            print(f"Feedback learned: {feedback.get('learned', False)}")
ai-service/tests/test_multi_agent.py-54-68 (1)

54-68: Unused variable guard.

The guard agent is created but never used. Either remove it or add assertions that verify the guard receives/processes the message:

Proposed fix
 def test_agent_messaging():
     """Test agent-to-agent messaging."""
     sentinel = SentinelAgent()
-    guard = GuardAgent()
 
     message = asyncio.run(sentinel.send_message(
         to_agent=AgentRole.GUARD,

Or, if you want to test message receipt:

     assert message.from_agent == "sentinel"
     assert message.to_agent == "guard"
+    # Verify guard can receive the message
+    response = asyncio.run(guard.receive_message(message))
+    assert response is not None
     print("Messaging: Message sent correctly")
ai-service/tests/test_hunter.py-126-143 (1)

126-143: Timeout mock may not work as intended.

The timeout_mock is defined as async def but aws_client.ec2.describe_volumes is a synchronous boto3 method. The async function assigned to it won't be awaited, so it will return a coroutine object instead of raising TimeoutError.

Looking at the scan_zombies implementation in nodes.py, it calls await aws_client.get_zombie_volumes() which internally uses asyncio.to_thread(self.ec2.describe_volumes, ...). The mock should target the correct level.

Proposed fix using proper patching
     `@pytest.mark.asyncio`
     async def test_scan_zombies_edge_case_timeout(self, aws_client):
         """Test handling API timeout (Edge Case per checklist)."""
-        # Mock a timeout scenario by patching the ec2 client
-        original_describe_volumes = aws_client.ec2.describe_volumes
-
-        async def timeout_mock(*args, **kwargs):
-            raise TimeoutError("API timeout")
-
-        aws_client.ec2.describe_volumes = timeout_mock
-
-        result = await scan_zombies(aws_client)
-
-        # Should handle error gracefully
-        assert "error_message" in result
-        assert "timeout" in result["error_message"].lower() or result["error_message"] is not None
-
-        # Restore for cleanup
-        aws_client.ec2.describe_volumes = original_describe_volumes
+        with patch.object(aws_client, 'get_zombie_volumes', side_effect=TimeoutError("API timeout")):
+            result = await scan_zombies(aws_client)
+
+            # Should handle error gracefully
+            assert "error_message" in result
+            assert "timeout" in result["error_message"].lower()
ai-service/src/ai_service/agents/execops_agent.py-124-124 (1)

124-124: Remove or mark the unused event_data parameter.

Line 124’s event_data isn’t used; Ruff flags this. Consider renaming to _event_data or removing it from the signature and callsite.

ai-service/tests/test_guard.py-610-610 (1)

610-610: Remove or intentionally ignore the unused IAM scan result.

Line 610 assigns iam_users but it’s never used, which will trigger Ruff F841. Use _ = ... or assert on it.

♻️ Proposed fix
-        iam_users = await scan_iam_users(aws_client)
+        _ = await scan_iam_users(aws_client)
🧹 Nitpick comments (33)
ai-service/.deepeval/.temp_test_run_data.json (1)

1-1: Consider excluding auto-generated deepeval artifacts from version control.

This appears to be a temporary file generated by the deepeval framework during test runs. Committing such files can lead to noisy diffs and merge conflicts. Consider adding .deepeval/ to .gitignore.

ai-service/.deepeval/.latest_test_run.json (1)

1-1: Auto-generated test artifact with failing test result - consider excluding from VCS.

This file captures a failed test run (score 0.5 < threshold 0.7). Similar to other .deepeval/ files, this appears to be a runtime-generated artifact. Committing test run results can clutter history and may inadvertently expose test environment details.

Consider adding the entire .deepeval/ directory to .gitignore.

ai-service/docs/TESTING_CHECKLIST.md (2)

484-496: Consider using placeholder credentials in documentation examples.

While these are test credentials, documenting actual passwords (even for local testing) can lead to accidental reuse or copy-paste into production configurations. Consider using generic placeholders like <password> with a note to refer to .env.test.

Proposed change
 # Neo4j
-curl http://localhost:7474 -u neo4j:echoteam123
+curl http://localhost:7474 -u neo4j:<NEO4J_PASSWORD>

199-205: Add language identifier to fenced code block for consistency.

The fallback chain diagram could use a language identifier (e.g., text or plaintext) for consistency with other code blocks in the document.

Proposed fix
-```
+```text
 Primary: OpenRouter API
     ↓ (rate limit / error)
ai-service/tests/test_llm_eval.py (2)

23-47: Surface HTTP errors explicitly in query()
Calling raise_for_status() improves failure visibility before JSON parsing.

🔧 Suggested tweak
             response = await client.post(
                 "http://localhost:11434/api/generate",
                 json={
                     "model": self.model,
@@
                 },
                 timeout=120.0,
             )
+            response.raise_for_status()
             data = response.json()

76-91: Replace bare except in JSON parsing blocks
Catching json.JSONDecodeError keeps parsing fallbacks while letting unexpected errors surface (or be logged).

🔧 Example pattern
-        except:
+        except json.JSONDecodeError:
             # Fallback: extract keywords from text
             content_lower = content.lower()

Also applies to: 132-146, 185-192, 228-235, 272-279

ai-service/src/ai_service/llm/openrouter.py (1)

16-20: Align env var names with docs / OpenRouter conventions
README references OPENROUTER_* vars while the client reads OPENAI_*. A fallback prevents misconfiguration.

🔧 Suggested fallback mapping
-OPENAI_API_KEY = _get_env("OPENAI_API_KEY")
-OPENAI_BASE_URL = _get_env("OPENAI_BASE_URL", "https://openrouter.ai/api/v1/chat/completions")
-OPENAI_MODEL = _get_env("OPENAI_MODEL_NAME", "liquid/lfm-2.5-1.2b-thinking:free")
+OPENAI_API_KEY = _get_env("OPENAI_API_KEY") or _get_env("OPENROUTER_API_KEY")
+OPENAI_BASE_URL = _get_env("OPENAI_BASE_URL") or _get_env(
+    "OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1/chat/completions"
+)
+OPENAI_MODEL = _get_env("OPENAI_MODEL_NAME") or _get_env(
+    "OPENROUTER_MODEL", "liquid/lfm-2.5-1.2b-thinking:free"
+)
ai-service/src/ai_service/memory/redis_store.py (1)

210-233: Consider SCAN over KEYS for pattern recall/clear
KEYS is O(N) and can block Redis under load; scan_iter is safer for production.

🔧 Example pattern for recall
-            keys = await client.keys(full_pattern)
             results = []
-
-            for key in keys:
+            async for key in client.scan_iter(full_pattern):
                 data = await client.get(key)
                 if data:
                     item = MemoryItem.from_dict(json.loads(data))
                     results.append(item.value)

Also applies to: 271-275

ai-service/src/ai_service/llm/fallback.py (6)

105-129: allow_request may throw CircuitOpenError despite docstring promising bool return.

The docstring says "Returns: True if request allowed, raises CircuitOpenError if not" which is accurate but the return type annotation -> bool is misleading. The method can raise an exception, so callers might not expect it. Additionally, Line 125 has an inconsistent error message using self.state.value which will always be "open" at that point.

Suggested improvements
-    async def allow_request(self) -> bool:
+    async def allow_request(self) -> bool:
         """Check if request should be allowed.
 
         Returns:
-            True if request allowed, raises CircuitOpenError if not
+            True if request is allowed
+
+        Raises:
+            CircuitOpenError: If circuit is open and recovery timeout hasn't passed
         """
         async with self._lock:
             if self.state == CircuitState.CLOSED:
                 return True
 
             if self.state == CircuitState.OPEN:
                 # Check if recovery timeout has passed
                 if self.last_failure_time:
                     elapsed = time.time() - self.last_failure_time
                     if elapsed >= self.recovery_timeout:
                         self.state = CircuitState.HALF_OPEN
                         self.success_count = 0
                         logger.info("Circuit breaker HALF_OPEN (testing recovery)")
                         return True
                 raise CircuitOpenError(
-                    f"Circuit breaker OPEN for {self.state.value}"
+                    f"Circuit breaker is OPEN (recovery timeout: {self.recovery_timeout}s)"
                 )
 
             # HALF_OPEN - allow single request
             return True

186-193: Backoff formula is linear, not exponential.

The sleep time 0.1 * (attempt + 1) results in 0.1s, 0.2s, 0.3s... which is linear progression. For exponential backoff, use 0.1 * (2 ** attempt) which gives 0.1s, 0.2s, 0.4s...

Fix exponential backoff
-                    await asyncio.sleep(0.1 * (attempt + 1))  # Exponential backoff
+                    await asyncio.sleep(0.1 * (2 ** attempt))  # Exponential backoff

240-246: Use explicit Optional type annotation per PEP 484.

The model: str = None parameter should use Optional[str] or str | None for clarity.

Fix type annotation
     async def chat(
         self,
         messages: List[Dict[str, str]],
-        model: str = None,
+        model: Optional[str] = None,
         temperature: float = 0.7,
         max_tokens: int = 1000,
         use_fallback: bool = True,
     ) -> Dict[str, Any]:

341-343: Naive JSON extraction may fail on nested structures or multiple JSON objects.

Using content.find("{") without finding the matching closing brace can fail to parse valid JSON that doesn't extend to the end of the string, or may include trailing garbage. Consider finding the matching } or using a more robust extraction.


475-482: Unused context parameter in rule_based_default.

The context parameter is never used. Either use it or remove it for clarity.

Use context or mark as unused
-def rule_based_default(context: Dict) -> Dict[str, Any]:
+def rule_based_default(context: Dict) -> Dict[str, Any]:  # noqa: ARG001
     """Default rule-based response."""
     return {
         "action": "escalate",
-        "reason": "Unable to process automatically",
+        "reason": f"Unable to process automatically: {context.get('event_type', 'unknown')}",
         "confidence": 0.3,
         "source": "rule_based",
     }

529-532: Bare except hides the cause of Redis health check failures.

Consider catching specific exceptions and logging for observability.

Improve exception handling
     try:
         redis_ok = await RedisClientSingleton.health_check()
-    except:
+    except Exception as e:
+        logger.warning(f"Redis health check failed: {e}")
         redis_ok = False
ai-service/tests/test_fallback.py (2)

6-7: sys.path.insert is fragile; prefer proper package installation.

Manipulating sys.path can cause import issues and is generally discouraged. Consider using pip install -e . (editable install) or configuring pyproject.toml with proper test paths.


150-154: Use pytest.raises or AssertionError instead of assert False.

assert False is removed when Python runs with -O optimization flag. Use raise AssertionError() or preferably pytest.raises.

Use proper assertion pattern
+    import pytest
+
     def test_no_fallback_raises_error(self):
         """Test when all fallbacks fail."""
         from ai_service.llm.fallback import FallbackStrategy, FallbackError
 
         async def fail():
             raise Exception("Always fails")
 
         strategy = FallbackStrategy([(fail, 1)])
 
-        async def run_test():
-            await strategy.execute()
-
-        try:
-            asyncio.run(run_test())
-            assert False, "Should have raised FallbackError"
-        except FallbackError:
-            print("No fallback raises error: PASS")
+        with pytest.raises(FallbackError):
+            asyncio.run(strategy.execute())
+        print("No fallback raises error: PASS")
ai-service/src/ai_service/agents/multi_agent.py (3)

51-51: Use explicit Optional type for nullable fields.

correlation_id: str = None should be correlation_id: Optional[str] = None per PEP 484.

Fix type annotation
 `@dataclass`
 class AgentMessage:
     """Agent-to-agent message."""
     id: str
     from_agent: str
     to_agent: str
     message_type: MessageType
     content: Dict[str, Any]
     timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
-    correlation_id: str = None
+    correlation_id: Optional[str] = None

336-341: Loop variable i is unused; use _ to indicate intentional discard.

Rename unused variable
         parallel_results = await asyncio.gather(*coroutines, return_exceptions=True)
-        for i, r in enumerate(parallel_results):
+        for _i, r in enumerate(parallel_results):
             if isinstance(r, Exception):
                 results.append({"error": str(r)})
             else:
                 results.append(r)

Or simply:

-        for i, r in enumerate(parallel_results):
+        for r in parallel_results:

525-526: f-strings without placeholders; remove f prefix.

Remove extraneous f prefix
-    print(f"Primary agent: sentinel")
-    print(f"Secondary agents: hunter, guard")
+    print("Primary agent: sentinel")
+    print("Secondary agents: hunter, guard")
ai-service/docs/IMPLEMENTATION_PLAN.md (3)

5-27: Add language specifier to ASCII diagram code block.

The fenced code block should specify a language (e.g., text or plaintext) for consistent rendering and to satisfy linters.

Add language specifier
-```
+```text
 ┌─────────────────────────────────────────────────────────────────┐
 │                    FastAPI Application                           │

48-74: Add language specifier to file structure code block.

Add language specifier
-```
+```text
 src/ai_service/
 ├── agents/

84-87: Add language specifier to environment variables code block.

Add language specifier
 ## Environment Variables (already in .env)
+```bash
 - `USE_REDIS_CHECKPOINTER=true`
 - `AGENT_LEARNING_ENABLED=true`
 - `AGENT_PROACTIVE_SCAN_INTERVAL=300`
+```

Or keep as bullet list without code block formatting.

ai-service/tests/test_redis_memory.py (3)

270-304: Consider using pytest for better test infrastructure.

The custom test runner works but lacks features like test isolation, better error reporting, and fixture support. Consider migrating to pytest which would also provide better CI integration.


99-115: Misleading mock setup for setex test.

The test sets up mock_client.set but asserts on mock_client.setex. While AsyncMock auto-creates the method, explicitly mocking setex would be clearer.

💡 Suggested fix
         # Mock Redis client
         mock_client = AsyncMock()
-        mock_client.set = AsyncMock(return_value=True)
+        mock_client.setex = AsyncMock(return_value=True)

         store._client = mock_client

132-147: Test could verify access_count increment.

Per the RedisMemoryStore.retrieve implementation, the access count should be incremented and written back. Consider adding assertions for this behavior for more thorough coverage.

ai-service/tests/test_hunter.py (1)

13-15: Consider using pytest's pythonpath configuration instead of sys.path manipulation.

Direct sys.path manipulation in test files can be fragile and may cause import issues in CI environments. Consider configuring pythonpath in pyproject.toml or pytest.ini:

[tool.pytest.ini_options]
pythonpath = ["src"]

This applies to all test files in this PR that use the same pattern.

ai-service/tests/test_multi_agent.py (2)

6-7: Hardcoded path may be fragile.

Using sys.path.insert(0, 'src') with a hardcoded relative path assumes the test is always run from the repository root. Consider using dynamic path resolution like other test files in this PR:

+from pathlib import Path
-sys.path.insert(0, 'src')
+_src_dir = Path(__file__).resolve().parents[1] / "src"
+sys.path.insert(0, str(_src_dir))

197-219: Broad exception handling in test runner.

While catching bare Exception is flagged by static analysis, in a manual test runner context it's acceptable to catch broadly for reporting purposes. However, consider logging the full traceback for better debugging:

         except Exception as e:
-            print(f"\n{name}: ERROR - {e}")
+            import traceback
+            print(f"\n{name}: ERROR - {e}")
+            traceback.print_exc()
             result = False
ai-service/tests/test_watchman.py (2)

202-214: Unused unpacked variable reason.

The reason variable is unpacked but never used. Use underscore prefix to indicate intentional discard:

-        is_active, reason = await check_activity(
+        is_active, _ = await check_activity(

361-389: Unused fixture sample_context.

The sample_context fixture is declared but not used since the test creates its own context. Remove it from the function signature:

     `@pytest.mark.asyncio`
-    async def test_execute_shutdown_live_mode(self, sample_context, aws_client):
+    async def test_execute_shutdown_live_mode(self, aws_client):
ai-service/tests/test_evaluation.py (1)

6-7: Hardcoded path may be fragile.

Same as noted for test_multi_agent.py - consider using dynamic path resolution:

+from pathlib import Path
-sys.path.insert(0, 'src')
+_src_dir = Path(__file__).resolve().parents[1] / "src"
+sys.path.insert(0, str(_src_dir))
ai-service/tests/test_e2e_api.py (1)

215-308: Minor duplication with test_hunter.py.

These tests overlap with the dedicated test_hunter.py module. While some redundancy in E2E tests is acceptable for sanity checks, consider whether this duplication is intentional or if the tests should be consolidated.

ai-service/src/ai_service/agents/guard/state.py (1)

66-116: Replace deprecated datetime.utcnow() with timezone-aware alternative.

datetime.utcnow() is deprecated since Python 3.12 because it returns a naive datetime. Use datetime.now(timezone.utc) for timezone-aware handling:

Diff
+from datetime import datetime, timezone
 
-        "scan_timestamp": datetime.utcnow().isoformat(),
+        "scan_timestamp": datetime.now(timezone.utc).isoformat(),

This pattern appears across multiple state initializers (Hunter, Watchman, Guard) and integrations throughout the codebase and should be updated consistently.

Comment on lines +143 to +165
## Recording Feedback

### Submit Decision Feedback
```python
from ai_service.evaluation.metrics import DecisionStore

store = DecisionStore()
await store.record_feedback(
decision_id="dec_123",
feedback="correct", # or "incorrect"
reasoning="Agent made the right call"
)
```

### Calculate Accuracy
```python
from ai_service.evaluation.metrics import AccuracyCalculator

accuracy = await AccuracyCalculator.calculate(
agent="sentinel",
time_window_days=30
)
print(f"Accuracy: {accuracy.percentage:.1%}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if DecisionStore and AccuracyCalculator exist in the codebase
echo "Searching for DecisionStore..."
rg -n "class DecisionStore" --type=py

echo ""
echo "Searching for AccuracyCalculator..."
rg -n "class AccuracyCalculator" --type=py

echo ""
echo "Searching for record_feedback method..."
rg -n "def record_feedback" --type=py

echo ""
echo "Checking if evaluation/metrics.py exists..."
fd -t f "metrics.py" | grep evaluation

Repository: Aparnap2/ExecOps

Length of output: 702


🏁 Script executed:

#!/bin/bash
# Get the DecisionStore class and its record_feedback method
echo "=== DecisionStore class and record_feedback method ==="
sed -n '122,220p' ai-service/src/ai_service/evaluation/metrics.py

echo ""
echo "=== AccuracyCalculator class and calculate method ==="
sed -n '292,350p' ai-service/src/ai_service/evaluation/metrics.py

Repository: Aparnap2/ExecOps

Length of output: 5537


🏁 Script executed:

#!/bin/bash
# Search for calculate method in AccuracyCalculator
echo "=== Searching for calculate method (not calculate_from_records) ==="
sed -n '292,400p' ai-service/src/ai_service/evaluation/metrics.py | head -100

Repository: Aparnap2/ExecOps

Length of output: 3745


🏁 Script executed:

#!/bin/bash
# Search for all methods in AccuracyCalculator
echo "=== All AccuracyCalculator methods ==="
rg -A 5 "class AccuracyCalculator" --type=py

echo ""
echo "=== Search for async calculate method ==="
rg "async.*def calculate" --type=py

echo ""
echo "=== Search for def calculate in AccuracyCalculator context ==="
awk '/class AccuracyCalculator/,/^class [A-Z]/ {print NR": "$0}' ai-service/src/ai_service/evaluation/metrics.py | head -150

Repository: Aparnap2/ExecOps

Length of output: 723


🏁 Script executed:

#!/bin/bash
# Get the full AccuracyCalculator class
echo "=== Full AccuracyCalculator class ==="
awk '/^class AccuracyCalculator:/,/^class / {print NR": "$0}' ai-service/src/ai_service/evaluation/metrics.py | grep -v "^$" | head -80

Repository: Aparnap2/ExecOps

Length of output: 128


🏁 Script executed:

#!/bin/bash
# Get line count of metrics.py
wc -l ai-service/src/ai_service/evaluation/metrics.py

# Get AccuracyCalculator class starting at line 292
echo ""
echo "=== AccuracyCalculator class (lines 292-370) ==="
sed -n '292,370p' ai-service/src/ai_service/evaluation/metrics.py

Repository: Aparnap2/ExecOps

Length of output: 3024


🏁 Script executed:

#!/bin/bash
# Search for all calculate methods
echo "=== All 'calculate' methods in metrics.py ==="
rg "def calculate|async def calculate" ai-service/src/ai_service/evaluation/metrics.py -n

echo ""
echo "=== DecisionStore.get_accuracy method ==="
sed -n '200,230p' ai-service/src/ai_service/evaluation/metrics.py

Repository: Aparnap2/ExecOps

Length of output: 1435


Fix API examples in documentation - method signatures and parameters do not match implementation.

The DecisionStore.record_feedback() call is missing the required agent parameter:

# Current (incorrect) example:
await store.record_feedback(
    decision_id="dec_123",
    feedback="correct",
    reasoning="Agent made the right call"
)

# Should include agent parameter:
await store.record_feedback(
    decision_id="dec_123",
    agent="sentinel",  # required parameter
    feedback="correct",
    reasoning="Agent made the right call"
)

Additionally, AccuracyCalculator.calculate() does not exist. The actual API is AccuracyCalculator.calculate_from_records(records), which expects a list of DecisionRecord objects, not agent and time_window parameters. Consider using DecisionStore.get_accuracy(agent, days) instead for the documented use case.

🤖 Prompt for AI Agents
In `@ai-service/docs/GUIDES.md` around lines 143 - 165, Update the documentation
examples to match the actual API: include the required agent parameter when
calling DecisionStore.record_feedback (i.e., call record_feedback with
decision_id, agent, feedback, reasoning) and replace the non-existent
AccuracyCalculator.calculate(...) example with the real API—either show using
AccuracyCalculator.calculate_from_records(records) which accepts a list of
DecisionRecord objects, or use DecisionStore.get_accuracy(agent, days) for the
documented agent/time-window use case; reference DecisionStore.record_feedback,
AccuracyCalculator.calculate_from_records, and DecisionStore.get_accuracy when
making the changes.

Comment on lines +67 to +74
redis:
@echo "Starting Redis container..."
@docker run -d \
--name $(REDIS_CONTAINER) \
--net $(NETWORK_NAME) \
-p $(REDIS_PORT9 \
):637redis:alpine \
2>/dev/null || echo "Redis may already be running"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "Makefile.opsmate" -type f

Repository: Aparnap2/ExecOps

Length of output: 89


🏁 Script executed:

cat -n ./ai-service/Makefile.opsmate | sed -n '60,80p'

Repository: Aparnap2/ExecOps

Length of output: 935


🏁 Script executed:

grep -n "REDIS_PORT\|REDIS_CONTAINER\|NETWORK_NAME" ./ai-service/Makefile.opsmate | head -20

Repository: Aparnap2/ExecOps

Length of output: 1200


Fix malformed Redis docker run command

The port mapping and image reference are corrupted (lines 72–73). The $(REDIS_PORT9 variable reference is invalid, and the image name is incorrectly concatenated with part of a port number. This will break make redis.

🐛 Proposed fix
 	`@docker` run -d \
 		--name $(REDIS_CONTAINER) \
 		--net $(NETWORK_NAME) \
-		-p $(REDIS_PORT9 \
-		):637redis:alpine \
+		-p $(REDIS_PORT):6379 \
+		redis:alpine \
 		2>/dev/null || echo "Redis may already be running"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
redis:
@echo "Starting Redis container..."
@docker run -d \
--name $(REDIS_CONTAINER) \
--net $(NETWORK_NAME) \
-p $(REDIS_PORT9 \
):637redis:alpine \
2>/dev/null || echo "Redis may already be running"
redis:
`@echo` "Starting Redis container..."
`@docker` run -d \
--name $(REDIS_CONTAINER) \
--net $(NETWORK_NAME) \
-p $(REDIS_PORT):6379 \
redis:alpine \
2>/dev/null || echo "Redis may already be running"
🤖 Prompt for AI Agents
In `@ai-service/Makefile.opsmate` around lines 67 - 74, The redis Makefile target
has a malformed docker run command: the port variable is mistyped as
$(REDIS_PORT9 and the image/tokenization is corrupted ("-p $(REDIS_PORT9
\n\t):637redis:alpine"). Fix the redis target (function/target name: redis) to
use the correct port variable $(REDIS_PORT), ensure the -p mapping is "-p
$(REDIS_PORT):6379", and restore the image reference to "redis:alpine"; keep the
existing references to $(REDIS_CONTAINER) and $(NETWORK_NAME) and preserve the
2>/dev/null || echo fallback behavior.

Comment on lines +124 to +139
def _parse_decision(self, result: Dict, decision_id: str, event_type: str, event_data: Dict) -> AgentDecision:
content = result.get("content", "")

try:
start = content.find("{")
end = content.rfind("}") + 1
if start >= 0 and end > start:
parsed = json.loads(content[start:end])
return AgentDecision(
decision_id=decision_id,
action=ActionType(parsed.get("action", "approve").lower()),
reasoning=parsed.get("reasoning", content[:200]),
confidence=parsed.get("confidence", 0.7),
context_summary=f"Event: {event_type}",
suggestions=parsed.get("suggestions", []),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against invalid LLM action values to avoid ValueError.

If the model returns an unexpected action, ActionType(...) raises ValueError and the whole event processing fails. Default to APPROVE (or a safe fallback) when parsing fails.

🐛 Proposed fix
-                return AgentDecision(
-                    decision_id=decision_id,
-                    action=ActionType(parsed.get("action", "approve").lower()),
+                action_value = str(parsed.get("action", "approve")).lower()
+                try:
+                    action = ActionType(action_value)
+                except ValueError:
+                    action = ActionType.APPROVE
+                return AgentDecision(
+                    decision_id=decision_id,
+                    action=action,
                     reasoning=parsed.get("reasoning", content[:200]),
                     confidence=parsed.get("confidence", 0.7),
                     context_summary=f"Event: {event_type}",
                     suggestions=parsed.get("suggestions", []),
                 )
🧰 Tools
🪛 Ruff (0.14.14)

124-124: Unused method argument: event_data

(ARG002)

🤖 Prompt for AI Agents
In `@ai-service/src/ai_service/agents/execops_agent.py` around lines 124 - 139,
The _parse_decision function can raise ValueError when
ActionType(parsed.get("action", "approve").lower()) gets an unexpected value;
update _parse_decision to validate or catch errors when converting to
ActionType: extract the raw_action = parsed.get("action", "approve"), normalize
it (e.g., raw_action.lower()), then attempt to map/convert to ActionType inside
a try/except (catch ValueError or KeyError) and on failure set action =
ActionType.APPROVE (safe fallback); finally construct and return the
AgentDecision with that safe action while keeping the existing reasoning,
confidence, context_summary, and suggestions logic.

Comment on lines +592 to +600
await hunter.learn_from_feedback(
decision_id="test-123",
feedback="approved",
suggestion="Good find, proceed with cleanup",
)

patterns = hunter.feedback_patterns
print(f"Feedback patterns stored: {list(patterns.keys())}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Test calls non-existent learn_from_feedback method on HunterAgent.

test_collective_memory() calls hunter.learn_from_feedback() and accesses hunter.feedback_patterns, but the Agent base class and HunterAgent don't define these. This test will fail with AttributeError.

Options to fix

Either add the missing method to the Agent base class:

class Agent:
    def __init__(self, role: AgentRole, llm_client: OpenRouterClient = None):
        # ... existing init ...
        self.feedback_patterns: Dict[str, List] = {}

    async def learn_from_feedback(self, decision_id: str, feedback: str, suggestion: str = None):
        """Learn from human feedback."""
        # Implementation here
        pass

Or remove/modify the test to not rely on this method:

-    # Test feedback learning
-    await hunter.learn_from_feedback(
-        decision_id="test-123",
-        feedback="approved",
-        suggestion="Good find, proceed with cleanup",
-    )
-
-    patterns = hunter.feedback_patterns
-    print(f"Feedback patterns stored: {list(patterns.keys())}")
+    # Note: feedback learning is handled by ExecOpsAgent, not base Agent
🤖 Prompt for AI Agents
In `@ai-service/src/ai_service/agents/multi_agent.py` around lines 592 - 600, The
test calls a non-existent learn_from_feedback API and feedback_patterns on
HunterAgent, causing AttributeError; add these to the Agent/HunterAgent
implementation by introducing a feedback_patterns attribute (e.g.,
self.feedback_patterns: Dict[str, List]) on Agent and implement an async
learn_from_feedback(self, decision_id: str, feedback: str, suggestion: str |
None) method that updates feedback_patterns (keyed by decision_id) and persists
or processes feedback as HunterAgent expects; alternatively, adjust
test_collective_memory to stop calling learn_from_feedback and not inspect
feedback_patterns if you prefer not to change Agent.

Comment on lines +211 to +218
async def get_accuracy(self, agent: str, days: int = 30) -> float:
"""Calculate accuracy for an agent over time."""
client = await self._get_client()

try:
pattern = f"{self._prefix}{agent}:*"
keys = await client.keys(pattern)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Aggregate accuracy for “all agents” currently returns 0.

generate_report() passes "all" when agent is None, but decision keys are stored by actual agent names; "all" won’t match any keys. Allow agent=None in get_accuracy to scan all keys, and pass None from generate_report.

🐛 Proposed fix (get_accuracy)
-    async def get_accuracy(self, agent: str, days: int = 30) -> float:
+    async def get_accuracy(self, agent: str = None, days: int = 30) -> float:
         """Calculate accuracy for an agent over time."""
         client = await self._get_client()
 
         try:
-            pattern = f"{self._prefix}{agent}:*"
+            pattern = f"{self._prefix}{agent}:*" if agent else f"{self._prefix}*"
             keys = await client.keys(pattern)
🐛 Proposed fix (generate_report)
-            accuracy = await self._store.get_accuracy(agent or "all")
+            accuracy = await self._store.get_accuracy(agent)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def get_accuracy(self, agent: str, days: int = 30) -> float:
"""Calculate accuracy for an agent over time."""
client = await self._get_client()
try:
pattern = f"{self._prefix}{agent}:*"
keys = await client.keys(pattern)
async def get_accuracy(self, agent: str = None, days: int = 30) -> float:
"""Calculate accuracy for an agent over time."""
client = await self._get_client()
try:
pattern = f"{self._prefix}{agent}:*" if agent else f"{self._prefix}*"
keys = await client.keys(pattern)
🤖 Prompt for AI Agents
In `@ai-service/src/ai_service/evaluation/metrics.py` around lines 211 - 218,
get_accuracy currently expects a concrete agent name and builds a key pattern
using f"{self._prefix}{agent}:*", so when generate_report passes the string
"all" no real agent keys match and accuracy is 0; change get_accuracy signature
to accept agent: Optional[str] = None and when agent is None build the pattern
as f"{self._prefix}*"(scan all agents) otherwise f"{self._prefix}{agent}:*";
also update generate_report to pass None instead of the literal "all" when it
intends to aggregate across all agents so the new pattern will match all
decision keys; keep references to get_accuracy and generate_report when making
these changes.

Comment on lines +349 to +365
def remember(
self,
key: str,
value: Any,
ttl_seconds: Optional[int] = None,
) -> None:
"""Remember something (async wrapper).
Args:
key: Memory key
value: Value to remember
ttl_seconds: Optional TTL
"""
import asyncio

asyncio.create_task(self._save_memory(key, value, ttl_seconds))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd ai-service && find . -name "redis_store.py" -type f

Repository: Aparnap2/ExecOps

Length of output: 98


🏁 Script executed:

cat -n src/ai_service/memory/redis_store.py | head -400

Repository: Aparnap2/ExecOps

Length of output: 128


🏁 Script executed:

cat -n src/ai_service/memory/redis_store.py | sed -n '340,400p'

Repository: Aparnap2/ExecOps

Length of output: 128


🏁 Script executed:

cat -n ai-service/src/ai_service/memory/redis_store.py | sed -n '340,400p'

Repository: Aparnap2/ExecOps

Length of output: 2087


🏁 Script executed:

cat -n ai-service/src/ai_service/memory/redis_store.py | sed -n '1,50p'

Repository: Aparnap2/ExecOps

Length of output: 1605


🏁 Script executed:

cat -n ai-service/src/ai_service/memory/redis_store.py | sed -n '395,420p'

Repository: Aparnap2/ExecOps

Length of output: 883


🏁 Script executed:

rg -n "\.remember\(" ai-service/src --type py -B 2 -A 2

Repository: Aparnap2/ExecOps

Length of output: 1169


🏁 Script executed:

rg -n "\.forget\(" ai-service/src --type py -B 2 -A 2

Repository: Aparnap2/ExecOps

Length of output: 42


🏁 Script executed:

rg -n "def forget_async" ai-service/src --type py -A 5

Repository: Aparnap2/ExecOps

Length of output: 513


🏁 Script executed:

rg -n "_save_memory\|_delete_memory" ai-service/src --type py -B 1 -A 3

Repository: Aparnap2/ExecOps

Length of output: 42


🏁 Script executed:

python3 << 'EOF'
import asyncio

# Test 1: What happens with create_task() without running loop
try:
    asyncio.create_task(asyncio.sleep(0))
except RuntimeError as e:
    print(f"Test 1 - create_task() without loop: {e}")

# Test 2: What asyncio.run() does
async def dummy_coro():
    pass

# asyncio.run() will work (it creates a new loop)
try:
    asyncio.run(dummy_coro())
    print("Test 2 - asyncio.run() works fine")
except Exception as e:
    print(f"Test 2 - asyncio.run() error: {e}")

# Test 3: What get_running_loop() does without a loop
try:
    asyncio.get_running_loop()
except RuntimeError as e:
    print(f"Test 3 - get_running_loop() without loop: {e}")
EOF

Repository: Aparnap2/ExecOps

Length of output: 146


🏁 Script executed:

head -30 ai-service/setup.py ai-service/pyproject.toml 2>/dev/null | grep -i python

Repository: Aparnap2/ExecOps

Length of output: 119


🏁 Script executed:

rg -n "get_event_loop\|get_running_loop\|create_task" ai-service/src --type py

Repository: Aparnap2/ExecOps

Length of output: 42


🏁 Script executed:

cat -n ai-service/src/ai_service/memory/redis_store.py | sed -n '349,365p'

Repository: Aparnap2/ExecOps

Length of output: 580


🏁 Script executed:

cat -n ai-service/src/ai_service/memory/redis_store.py | sed -n '366,394p'

Repository: Aparnap2/ExecOps

Length of output: 968


🏁 Script executed:

python3 << 'EOF'
import asyncio

# Demonstrate the issue: create_task without running loop
async def sample():
    await asyncio.sleep(0)

try:
    # This is what happens in remember() and forget()
    task = asyncio.create_task(sample())
    print("create_task succeeded")
except RuntimeError as e:
    print(f"create_task error: {type(e).__name__}: {e}")

# Now show the recall() pattern works
try:
    loop = asyncio.get_event_loop()
    loop.run_until_complete(sample())
    print("get_event_loop + run_until_complete works")
except Exception as e:
    print(f"Error: {e}")

# Show the suggested pattern works
try:
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        asyncio.run(sample())
    print("get_running_loop + asyncio.run pattern works")
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: Aparnap2/ExecOps

Length of output: 357


Fix event loop handling in sync wrappers

asyncio.create_task() raises RuntimeError when called from sync contexts without a running event loop, breaking the remember() and forget() sync wrappers when invoked from non-async code.

The recall() method already handles this correctly with get_event_loop() + run_until_complete(). Apply the same pattern here, or use the suggested approach with get_running_loop() + asyncio.run().

🔧 Suggested safe wrapper
     def remember(
         self,
         key: str,
         value: Any,
         ttl_seconds: Optional[int] = None,
     ) -> None:
         import asyncio
-
-        asyncio.create_task(self._save_memory(key, value, ttl_seconds))
+        try:
+            loop = asyncio.get_running_loop()
+        except RuntimeError:
+            asyncio.run(self._save_memory(key, value, ttl_seconds))
+        else:
+            loop.create_task(self._save_memory(key, value, ttl_seconds))

Apply the same fix to forget() at lines 385-394.

🧰 Tools
🪛 Ruff (0.14.14)

364-364: Store a reference to the return value of asyncio.create_task

(RUF006)

🤖 Prompt for AI Agents
In `@ai-service/src/ai_service/memory/redis_store.py` around lines 349 - 365, The
sync wrapper remember() uses asyncio.create_task() which raises RuntimeError
when no running loop; update remember() to follow the same safe pattern used in
recall(): attempt to get the current event loop and call
loop.run_until_complete(self._save_memory(key, value, ttl_seconds)) when a loop
is available, otherwise call asyncio.run(self._save_memory(...)); apply the same
change to forget() to call self._delete_memory(...) via run_until_complete or
asyncio.run as appropriate; reference functions: remember(), forget(),
_save_memory(), _delete_memory(), and recall() to locate and mirror the safe
event-loop handling.

Comment on lines +92 to +100
for test in test_events:
json_body = json.dumps({
"event_type": test["event_type"],
"event_data": test["data"],
}, indent=6)
print(f"\n# {test['name']}")
print(f"""curl -X POST http://localhost:8000/api/v1/agent/process \\
-H "Content-Type: application/json" \\
-d '{json_body[1:]}'""")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Incorrect JSON slicing produces malformed curl command.

json_body[1:] removes the first character (the opening {) from the JSON string, resulting in invalid JSON in the curl command output.

🐛 Proposed fix
     for test in test_events:
         json_body = json.dumps({
             "event_type": test["event_type"],
             "event_data": test["data"],
-        }, indent=6)
+        }, indent=2)
         print(f"\n# {test['name']}")
         print(f"""curl -X POST http://localhost:8000/api/v1/agent/process \\
   -H "Content-Type: application/json" \\
-  -d '{json_body[1:]}'""")
+  -d '{json_body}'""")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for test in test_events:
json_body = json.dumps({
"event_type": test["event_type"],
"event_data": test["data"],
}, indent=6)
print(f"\n# {test['name']}")
print(f"""curl -X POST http://localhost:8000/api/v1/agent/process \\
-H "Content-Type: application/json" \\
-d '{json_body[1:]}'""")
for test in test_events:
json_body = json.dumps({
"event_type": test["event_type"],
"event_data": test["data"],
}, indent=2)
print(f"\n# {test['name']}")
print(f"""curl -X POST http://localhost:8000/api/v1/agent/process \\
-H "Content-Type: application/json" \\
-d '{json_body}'""")
🤖 Prompt for AI Agents
In `@ai-service/tests/test_agent_api.py` around lines 92 - 100, The curl command
construction incorrectly slices the JSON string (json_body[1:]) which removes
the opening '{' and yields malformed JSON; update the print that builds the curl
-d payload to use the full json_body variable (or otherwise preserve the
complete JSON) in the block where json_body is defined and used in the print
statements so the -d argument contains valid JSON.

Comment on lines +93 to +103
json_body = json.dumps({
"event_type": test["event_type"],
"event_data": test["data"],
}, indent=6)
print(f"\n# {test['name']}")
print(f"""curl -X POST http://localhost:8000/api/v1/agent/process \\
-H "Content-Type: application/json" \\
-d '{json_body[1:]}'""")

if __name__ == "__main__":
import json
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

json module used before it's imported.

The json module is used at line 93 inside main(), but imported at line 103 inside the if __name__ == "__main__": block. This will cause a NameError when main() executes.

🐛 Proposed fix: Move import to top of file
 import sys
 sys.path.insert(0, 'src')

 import asyncio
+import json
 from ai_service.agents.execops_agent import ExecOpsAgent

And remove the import from the bottom:

 if __name__ == "__main__":
-    import json
     asyncio.run(main())
🤖 Prompt for AI Agents
In `@ai-service/tests/test_agent_api.py` around lines 93 - 103, The json module is
imported inside the if __name__ == "__main__" block but used earlier in main()
when building json_body, causing a NameError; move the import json to the
top-level imports (so main() and any top-level code can use it), and remove the
trailing import json from the bottom of the file (keep the existing main(),
json_body construction, and the if __name__ block unchanged aside from removing
that bottom import).

Comment on lines +172 to +208
def test_github_webhook_validates_signature(self, client, mock_github_payload):
"""Test that GitHub webhook signature is validated if configured."""
# Without signature header, should still work (signature optional)
response = client.post(
"/api/v1/webhook/github",
json=mock_github_payload,
headers={
"X-GitHub-Event": "pull_request",
"X-GitHub-Delivery": "test-delivery-125",
},
)

# Should accept request even without signature (depends on config)
assert response.status_code in [200, 400, 401]

def test_github_webhook_missing_pr_data(self, client):
"""Test handling of malformed webhook payload."""
invalid_payload = {
"action": "opened",
# Missing pull_request field
}

response = client.post(
"/api/v1/webhook/github",
json=invalid_payload,
headers={
"Content-Type": "application/json",
"X-GitHub-Event": "pull_request",
"X-GitHub-Delivery": "test-delivery-126",
},
)

# Webhook logs warning but still returns 200 with status
# The webhook processes the event but may have partial data
assert response.status_code == 200
data = response.json()
assert data["status"] == "processed"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find ai-service -type f -name "*.py" | xargs grep -l "webhook" | head -20

Repository: Aparnap2/ExecOps

Length of output: 981


🏁 Script executed:

cd ai-service && find . -type f -name "*.py" -path "*/webhooks/*"

Repository: Aparnap2/ExecOps

Length of output: 95


🏁 Script executed:

rg -nA 20 "def.*github.*webhook|async def.*github.*webhook" ai-service/src/

Repository: Aparnap2/ExecOps

Length of output: 3599


🏁 Script executed:

cat -n ai-service/src/ai_service/webhooks/github.py | head -150

Repository: Aparnap2/ExecOps

Length of output: 5691


🏁 Script executed:

cat -n ai-service/src/ai_service/integrations/webhook.py | head -150

Repository: Aparnap2/ExecOps

Length of output: 5696


🏁 Script executed:

cat -n ai-service/src/ai_service/webhooks/github.py | sed -n '120,180p'

Repository: Aparnap2/ExecOps

Length of output: 2513


🏁 Script executed:

cat -n ai-service/src/ai_service/webhooks/github.py | tail -30

Repository: Aparnap2/ExecOps

Length of output: 1144


🏁 Script executed:

cat -n ai-service/tests/test_e2e_api.py | sed -n '172,210p'

Repository: Aparnap2/ExecOps

Length of output: 1758


🏁 Script executed:

cat -n ai-service/tests/test_e2e_api.py | sed -n '1,60p'

Repository: Aparnap2/ExecOps

Length of output: 2212


🏁 Script executed:

rg -n "def.*client|@pytest.fixture.*client" ai-service/tests/test_e2e_api.py | head -10

Repository: Aparnap2/ExecOps

Length of output: 709


🏁 Script executed:

rg -n "ainvoke_sentinel|def ainvoke_sentinel" ai-service/src/ai_service/agents/sentinel/

Repository: Aparnap2/ExecOps

Length of output: 221


🏁 Script executed:

grep -n "GITHUB_WEBHOOK_SECRET" ai-service/src/ai_service/webhooks/github.py

Repository: Aparnap2/ExecOps

Length of output: 213


🏁 Script executed:

cat -n ai-service/src/ai_service/agents/sentinel/graph.py | sed -n '108,130p'

Repository: Aparnap2/ExecOps

Length of output: 983


🏁 Script executed:

grep -A 30 "def create_initial_state" ai-service/src/ai_service/agents/sentinel/ -r

Repository: Aparnap2/ExecOps

Length of output: 42


🏁 Script executed:

cat -n ai-service/src/ai_service/main.py | grep -A 5 "GITHUB_WEBHOOK_SECRET\|app ="

Repository: Aparnap2/ExecOps

Length of output: 272


🏁 Script executed:

rg -n "test.*signature|mock.*secret" ai-service/tests/test_e2e_api.py

Repository: Aparnap2/ExecOps

Length of output: 147


Line 185: Assertion is too broad—should expect 200 only in test environment.

When no GITHUB_WEBHOOK_SECRET is configured (typical for tests), verify_signature() returns True (line 38-43), so signature validation succeeds and status is 200. The assertion accepting [200, 400, 401] masks the actual behavior. Change to assert response.status_code == 200.

Line 206-208: Assertion is correct, but handler lacks validation for required fields.

The handler does return 200 for a payload with missing pull_request field. However, the handler extracts PR data with .get() defaults (lines 121-127), allowing pr_number to be None, which gets passed to ainvoke_sentinel() (line 173). The test comment claims "webhook logs warning" but the code logs with pr_number=None (line 129) without explicit validation or warning. Consider adding explicit validation for required fields (pull_request, number) to return 400 instead of silently processing incomplete data.

🤖 Prompt for AI Agents
In `@ai-service/tests/test_e2e_api.py` around lines 172 - 208, The first test
(test_github_webhook_validates_signature) uses a too-broad assertion; change it
to assert response.status_code == 200 to reflect test env behavior when
GITHUB_WEBHOOK_SECRET is not set and verify_signature returns True. For the
webhook handler, add explicit validation before calling ainvoke_sentinel: ensure
payload contains "pull_request" and that pull_request.get("number") is present;
if missing, log a clear warning and return a 400 response instead of proceeding
with pr_number=None so ainvoke_sentinel is never called with incomplete data.
Apply these changes by updating the test function and the GitHub webhook handler
(the code that reads pull_request/pr_number and calls ainvoke_sentinel and
verify_signature).

Comment on lines +298 to +312
def test_create_with_postgres_checkpointer(self):
"""Test creating graph with postgres checkpointer (mocked)."""
from ai_service.graphs.vertical_agents import create_vertical_agent_graph

with patch('ai_service.graphs.vertical_agents.get_vertical_graph') as mock_get:
mock_graph = MagicMock()
mock_get.return_value = mock_graph

with patch('langgraph.checkpoint.memory.MemorySaver') as mock_memory:
mock_checkpointer = MagicMock()
mock_memory.return_value = mock_checkpointer
mock_graph.compile = MagicMock(return_value=MagicMock())

graph = create_vertical_agent_graph("team_pulse", use_postgres=True)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find the relevant files in the codebase
find . -type f -name "*.py" | grep -E "(vertical_agents|checkpointer)" | head -20

Repository: Aparnap2/ExecOps

Length of output: 270


🏁 Script executed:

# Look for the create_vertical_agent_graph function
rg "def create_vertical_agent_graph" -A 5

Repository: Aparnap2/ExecOps

Length of output: 988


🏁 Script executed:

# Check where USE_POSTGRES_CHECKPOINTER is defined and its default value
rg "USE_POSTGRES_CHECKPOINTER" -B 2 -A 2

Repository: Aparnap2/ExecOps

Length of output: 2304


🏁 Script executed:

# Find the get_sync_checkpointer function
rg "def get_sync_checkpointer" -A 10

Repository: Aparnap2/ExecOps

Length of output: 1018


🏁 Script executed:

# Check the test file to understand the current state
sed -n '298,312p' ai-service/tests/test_vertical_agents.py

Repository: Aparnap2/ExecOps

Length of output: 791


🏁 Script executed:

# Get the full implementation of create_vertical_agent_graph
sed -n '1,100p' ai-service/src/ai_service/graphs/vertical_agents.py | cat -n

Repository: Aparnap2/ExecOps

Length of output: 3598


🏁 Script executed:

# See more of the implementation
rg "def create_vertical_agent_graph" -A 30 ai-service/src/ai_service/graphs/vertical_agents.py

Repository: Aparnap2/ExecOps

Length of output: 1302


🏁 Script executed:

# Check the checkpointer implementation to see what actually happens
cat -n ai-service/src/ai_service/infrastructure/checkpointer.py | head -50

Repository: Aparnap2/ExecOps

Length of output: 1800


🏁 Script executed:

# Get the rest of the create_vertical_agent_graph implementation
rg "def create_vertical_agent_graph" -A 50 ai-service/src/ai_service/graphs/vertical_agents.py

Repository: Aparnap2/ExecOps

Length of output: 1936


🏁 Script executed:

# Check the full get_sync_checkpointer implementation
rg "def get_sync_checkpointer" -A 20 ai-service/src/ai_service/infrastructure/checkpointer.py

Repository: Aparnap2/ExecOps

Length of output: 682


🏁 Script executed:

# Check if there are other tests in the file that handle the environment variable
rg "USE_POSTGRES_CHECKPOINTER" ai-service/tests/test_vertical_agents.py -B 3 -A 3

Repository: Aparnap2/ExecOps

Length of output: 406


Mock the Postgres checkpointer to keep this unit test hermetic

With use_postgres=True and USE_POSTGRES_CHECKPOINTER defaulting to "true", this test will attempt to create a real Postgres checkpointer. The current mock of langgraph.checkpoint.memory.MemorySaver is not invoked on this code path. Mock ai_service.infrastructure.checkpointer.get_sync_checkpointer or set the environment variable to false to avoid the external database dependency.

🤖 Prompt for AI Agents
In `@ai-service/tests/test_vertical_agents.py` around lines 298 - 312, The test
calls create_vertical_agent_graph with use_postgres=True which triggers the real
Postgres checkpointer; update the test to mock
ai_service.infrastructure.checkpointer.get_sync_checkpointer (the function used
by create_vertical_agent_graph) so it returns a MagicMock checkpointer, or
alternatively set the environment flag USE_POSTGRES_CHECKPOINTER to "false"
before invoking create_vertical_agent_graph; ensure you still patch
get_vertical_graph and/or langgraph.checkpoint.memory.MemorySaver as needed and
return mock objects so the test remains hermetic.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several new files and modifies an existing one to enhance the AI service with new agents, improved infrastructure, and comprehensive testing. The changes include adding a Makefile for deployment testing, environment configuration, documentation, and evaluation metrics. The most significant change is the introduction of the ExecOpsAgent and multi-agent orchestration, along with fallback mechanisms for LLM calls and Redis-backed memory. The addition of new tests and documentation aims to ensure code quality and maintainability.

Comment on lines +72 to +73
-p $(REDIS_PORT9 \
):637redis:alpine \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a typo in the port mapping. It seems like the intention was to map port 6379 to 6379 for the Redis container, but there's an extra 9 and redis:alpine is in the wrong place. This will cause the Redis container to fail to start correctly.

-p $(REDIS_PORT9 \
		):637redis:alpine


# Check if within quiet hours
if quiet_hours_start <= 24:
if quiet_hours_start < 24:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This condition should be < 24 instead of <= 24. If quiet_hours_start is exactly 24, the code will not behave as expected, since it will fall through to the next condition.

Suggested change
if quiet_hours_start < 24:
if quiet_hours_start <= 24:

Comment on lines +21 to +43
def normalize_email(email: str) -> str:
"""Normalize email address for comparison.
Removes +suffixes commonly used for email filtering:
- "bob+work@gmail.com" -> "bob@gmail.com"
- "ALICE@EXAMPLE.COM" -> "alice@example.com"
Args:
email: Email address to normalize
Returns:
Normalized email address
"""
if not email or "@" not in email:
return email.lower() if email else email

local, domain = email.split("@", 1)

# Remove +suffix from local part
if "+" in local:
local = local.split("+")[0]

return f"{local.lower()}@{domain.lower()}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function normalizes email addresses, which is a good practice for comparing emails from different sources. However, it might be beneficial to add a comment explaining why this normalization is necessary, especially the removal of the +suffix part. This will help future developers understand the purpose of this function.

return False, "Team offline but outside quiet hours"

# If no quiet hours configured, always shutdown when offline
# If no quiet hours configured (quiet_hours_start >= 24), always shutdown when offline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is not accurate. The condition for no quiet hours configured is quiet_hours_start >= 24, not just quiet_hours_start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants