Skip to content

Conversation

@asimurka
Copy link
Contributor

@asimurka asimurka commented Jan 12, 2026

Description

This PR adds an implementation of shields using the ModerationsAPI to replace the old Shields API that doesn't work properly.
Now, input shields are implemented as a separate API call that is called before the Responses API itself. Output shields are not available.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Cursor

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Pre-request shield moderation now blocks unsafe inputs immediately, returning a concise violation response or structured violation stream.
    • Violations are appended to conversations for clearer audit trails.
  • Chores

    • Default OpenAI model fallback updated to gpt-4o-mini when unset.
    • Default test and runtime configurations updated to include llama-guard shields.
  • Tests

    • Expanded unit and e2e tests to cover moderation and shield behaviors.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 12, 2026

Walkthrough

Adds pre-LLM shield moderation that can short‑circuit Responses calls, new shield utilities and a ShieldModerationResult type, moves RateLimitError imports to llama_stack_client, and registers gpt-4o-mini model and llama-guard shields across runtime and test configs.

Changes

Cohort / File(s) Summary
Environment / Compose
docker-compose-library.yaml, docker-compose.yaml
Default E2E_OPENAI_MODEL changed to gpt-4o-mini (fallback ${E2E_OPENAI_MODEL:-gpt-4o-mini} added).
Runtime & E2E configs
run.yaml, tests/e2e/configs/run-*.yaml
Added gpt-4o-mini model entries; populated registered_resources.shields with llama-guard -> openai/gpt-4o-mini; added OpenAI remote provider blocks in several e2e configs.
Shield moderation utils & types
src/utils/shields.py, src/utils/types.py
New DEFAULT_VIOLATION_MESSAGE, run_shield_moderation(), append_turn_to_conversation(), and ShieldModerationResult model; moderation listing/creation, error handling, metrics increments, and NotFound/BadRequest handling implemented.
Endpoints — v2 & streaming
src/app/endpoints/query_v2.py, src/app/endpoints/streaming_query_v2.py
Integrates pre-LLM run_shield_moderation() that short‑circuits LLM requests on violation; removes post-response shield checks and available-shields injection; adds create_violation_stream() for streaming violations.
Endpoints — v1 & streaming import tweaks
src/app/endpoints/query.py, src/app/endpoints/streaming_query.py
Moved RateLimitError import to llama_stack_client alongside client imports (import grouping change only).
Unit & e2e tests
tests/unit/..., tests/e2e/features/steps/info.py, tests/e2e/configs/*
Updated tests to mock shields/models/moderations; switched many mocks to Mock and httpx.Response usage; updated tests to use gpt-4o-mini, adjusted expected messages, and added comprehensive tests for run_shield_moderation() and append_turn_to_conversation().

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Endpoint as Query Endpoint v2
    participant ShieldUtil as Shield Utils
    participant ModelSvc as Model Registry
    participant ModSvc as Moderation Service
    participant LLM as Responses API

    Client->>Endpoint: Send user prompt
    Endpoint->>ShieldUtil: run_shield_moderation(input_text)
    ShieldUtil->>ModelSvc: list models / validate shield model
    ShieldUtil->>ModSvc: moderations.create(input_text, shield_model)
    ModSvc-->>ShieldUtil: moderation result (blocked/allowed)

    alt Blocked
        ShieldUtil-->>Endpoint: ShieldModerationResult(blocked=true, message)
        Endpoint->>Endpoint: append_turn_to_conversation(conversation, violation)
        Endpoint-->>Client: Return violation (single response or violation stream)
    else Allowed
        ShieldUtil-->>Endpoint: ShieldModerationResult(blocked=false)
        Endpoint->>LLM: responses.create(...)
        LLM-->>Endpoint: response / stream
        Endpoint-->>Client: Relay response
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • tisnik
  • manstis
  • are-ces
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: implementing shields using the Moderations API, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @docker-compose-library.yaml:
- Line 21: The environment variable default E2E_OPENAI_MODEL is set to
"gpt-4-mini", which appears to be a typo; update the default value to
"gpt-4o-mini" so it matches the actual OpenAI model naming and the rest of the
PR/shield configs—i.e., change the assignment for E2E_OPENAI_MODEL from
gpt-4-mini to gpt-4o-mini.

In @docker-compose.yaml:
- Line 19: The docker-compose environment default for E2E_OPENAI_MODEL uses an
invalid model name "gpt-4-mini"; update the default to the correct OpenAI model
"gpt-4o-mini" wherever this appears (e.g., the E2E_OPENAI_MODEL entry in
docker-compose.yaml and docker-compose-library.yaml) so that the environment
variable line (- E2E_OPENAI_MODEL=${E2E_OPENAI_MODEL:-gpt-4-mini}) becomes the
valid default with gpt-4o-mini.

In @src/app/endpoints/streaming_query_v2.py:
- Around line 472-517: In create_violation_stream, the
OpenAIResponseObjectStreamResponseOutputTextDelta yield is missing the
response_id field; update the OpenAIResponseObjectStreamResponseOutputTextDelta
instantiation (the one that currently sets content_index, delta, item_id,
output_index, sequence_number) to also include response_id=response_id so the
delta event matches the ContentPartAdded event and keeps event structure
consistent.

In @tests/e2e/features/steps/info.py:
- Around line 117-118: Update the assertion message for the check involving
found_shield["provider_resource_id"] so it matches the actual expected value
f"{expected_provider}/{expected_model} instead of only expected_model; modify
the failure string to include the full expected value (e.g.,
f\"{expected_provider}/{expected_model}\") and show the actual
found_shield["provider_resource_id"] so test failures report the correct
expected vs actual values.
🧹 Nitpick comments (4)
src/app/endpoints/streaming_query_v2.py (1)

444-454: Potential double normalization of conversation_id.

At line 453, normalize_conversation_id(conversation_id) is called, but conversation_id may already be normalized (see line 437 where it's set via normalize_conversation_id(llama_stack_conv_id) for new conversations). For existing conversations, conversation_id comes from the request and may need normalization.

However, looking at line 469 returning normalize_conversation_id(conversation_id) for the non-blocked path - this is consistent with line 453, so the behavior is uniform. The issue is that for new conversations, conversation_id is already normalized at line 437, so line 453 and 469 perform redundant normalization.

Consider using llama_stack_conv_id handling more consistently or accepting the minor redundancy.

♻️ Optional: Use already-normalized ID
     if moderation_result.blocked:
         violation_message = moderation_result.message or ""
         await append_turn_to_conversation(
             client, llama_stack_conv_id, input_text, violation_message
         )
         return (
             create_violation_stream(violation_message, moderation_result.shield_model),
-            normalize_conversation_id(conversation_id),
+            conversation_id,  # Already normalized for new convs, or needs normalization for existing
         )
src/utils/shields.py (2)

83-93: Consider caching models list to reduce API calls.

client.models.list() and client.shields.list() are called on every moderation request. For high-traffic scenarios, this could create significant overhead. Consider caching the model list with a TTL, similar to how other configurations are cached in the codebase.

#!/bin/bash
# Check if there's existing caching patterns in the codebase
rg -n "cachetools|TTLCache|lru_cache" --type py -A 3

133-157: Missing error handling for conversation append failure.

If client.conversations.items.create fails (network error, invalid conversation_id, etc.), the exception will propagate up, potentially leaving the user without feedback about the shield violation. Consider wrapping this in a try-except with logging.

♻️ Add error handling for conversation append
 async def append_turn_to_conversation(
     client: AsyncLlamaStackClient,
     conversation_id: str,
     user_message: str,
     assistant_message: str,
 ) -> None:
     """
     Append a user/assistant turn to a conversation after shield violation.
     ...
     """
+    try:
         await client.conversations.items.create(
             conversation_id,
             items=[
                 {"type": "message", "role": "user", "content": user_message},
                 {"type": "message", "role": "assistant", "content": assistant_message},
             ],
         )
+    except Exception as exc:
+        logger.warning(
+            "Failed to append shield violation turn to conversation %s: %s",
+            conversation_id,
+            exc,
+        )
+        # Don't re-raise - the violation response should still be returned to user
tests/unit/utils/test_shields.py (1)

316-344: Consider adding test for multiple shields iteration.

The current tests cover single-shield scenarios. Consider adding a test where the first shield passes but the second shield blocks, to verify the iteration logic in run_shield_moderation.

🧪 Optional: Test for multiple shields
@pytest.mark.asyncio
async def test_returns_blocked_when_second_shield_flags_content(
    self, mocker: MockerFixture
) -> None:
    """Test that run_shield_moderation blocks when second shield flags content."""
    mocker.patch("utils.shields.metrics.llm_calls_validation_errors_total")
    mock_client = mocker.Mock()

    # Setup two shields
    shield1 = mocker.Mock()
    shield1.identifier = "shield-1"
    shield1.provider_resource_id = "model-1"
    shield2 = mocker.Mock()
    shield2.identifier = "shield-2"
    shield2.provider_resource_id = "model-2"
    mock_client.shields.list = mocker.AsyncMock(return_value=[shield1, shield2])

    # Setup models
    model1 = mocker.Mock(identifier="model-1")
    model2 = mocker.Mock(identifier="model-2")
    mock_client.models.list = mocker.AsyncMock(return_value=[model1, model2])

    # First shield passes, second blocks
    safe_result = mocker.Mock(flagged=False)
    flagged_result = mocker.Mock(flagged=True, categories=["violence"], user_message="Blocked")
    
    mock_client.moderations.create = mocker.AsyncMock(
        side_effect=[
            mocker.Mock(results=[safe_result]),
            mocker.Mock(results=[flagged_result]),
        ]
    )

    result = await run_shield_moderation(mock_client, "test input")

    assert result.blocked is True
    assert result.shield_model == "model-2"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ac487c and 82f8a81.

📒 Files selected for processing (20)
  • docker-compose-library.yaml
  • docker-compose.yaml
  • run.yaml
  • src/app/endpoints/query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/utils/shields.py
  • src/utils/types.py
  • tests/e2e/configs/run-azure.yaml
  • tests/e2e/configs/run-ci.yaml
  • tests/e2e/configs/run-rhaiis.yaml
  • tests/e2e/configs/run-rhelai.yaml
  • tests/e2e/configs/run-vertexai.yaml
  • tests/e2e/features/steps/info.py
  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/utils/test_shields.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Use FastAPI dependencies: from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = logging.getLogger(__name__) pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, using typing_extensions.Self for model validators
Use union types with modern syntax: str | int instead of Union[str, int]
Use Optional[Type] for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use logger.debug() for detailed diagnostic information
Use logger.info() for general information about program execution
Use logger.warning() for unexpected events or potential problems
Use logger.error() for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Run uv run make format to auto-format code with black and ruff before completion
Run uv run make verify to run all linters (black, pyl...

Files:

  • tests/e2e/features/steps/info.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • src/app/endpoints/streaming_query.py
  • src/utils/types.py
  • tests/unit/utils/test_shields.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • src/app/endpoints/streaming_query_v2.py
  • tests/unit/app/endpoints/test_query_v2.py
  • src/app/endpoints/query.py
  • src/app/endpoints/query_v2.py
  • tests/unit/app/endpoints/test_query.py
  • src/utils/shields.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Use pytest-mock for AsyncMock objects in tests
Use MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token") pattern for authentication mocks in tests

Files:

  • tests/e2e/features/steps/info.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/utils/test_shields.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_query.py
tests/e2e/features/steps/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define E2E test steps in tests/e2e/features/steps/ directory

Files:

  • tests/e2e/features/steps/info.py
tests/unit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion

Files:

  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/utils/test_shields.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_query.py
src/app/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/query.py
  • src/app/endpoints/query_v2.py
🧠 Learnings (6)
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use Llama Stack imports: `from llama_stack_client import AsyncLlamaStackClient`

Applied to files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/query.py
  • src/app/endpoints/query_v2.py
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use FastAPI dependencies: `from fastapi import APIRouter, HTTPException, Request, status, Depends`

Applied to files:

  • src/app/endpoints/streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • src/app/endpoints/streaming_query_v2.py
  • tests/unit/app/endpoints/test_query_v2.py
  • src/app/endpoints/query.py
  • tests/unit/app/endpoints/test_query.py
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to **/*.py : Handle `APIConnectionError` from Llama Stack in error handling

Applied to files:

  • src/app/endpoints/streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_query.py
📚 Learning: 2025-12-18T10:21:09.038Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 935
File: run.yaml:114-115
Timestamp: 2025-12-18T10:21:09.038Z
Learning: In Llama Stack version 0.3.x, telemetry provider configuration is not supported under the `providers` section in run.yaml configuration files. Telemetry can be enabled with just `telemetry.enabled: true` without requiring an explicit provider block.

Applied to files:

  • tests/e2e/configs/run-rhelai.yaml
📚 Learning: 2025-12-18T10:21:03.056Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 935
File: run.yaml:114-115
Timestamp: 2025-12-18T10:21:03.056Z
Learning: In run.yaml for Llama Stack 0.3.x, do not add a telemetry provider block under providers. To enable telemetry, set telemetry.enabled: true directly (no provider block required). This pattern applies specifically to run.yaml configuration in this repository.

Applied to files:

  • run.yaml
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to src/app/**/*.py : Use FastAPI `HTTPException` with appropriate status codes for API endpoint error handling

Applied to files:

  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_query.py
🧬 Code graph analysis (5)
tests/unit/app/endpoints/test_streaming_query_v2.py (2)
src/utils/types.py (1)
  • ShieldModerationResult (103-108)
src/models/responses.py (1)
  • model (1732-1745)
tests/unit/utils/test_shields.py (1)
src/utils/shields.py (4)
  • append_turn_to_conversation (133-157)
  • detect_shield_violations (37-60)
  • get_available_shields (19-34)
  • run_shield_moderation (63-130)
src/app/endpoints/streaming_query_v2.py (2)
src/utils/shields.py (2)
  • append_turn_to_conversation (133-157)
  • run_shield_moderation (63-130)
src/utils/suid.py (1)
  • normalize_conversation_id (57-78)
src/app/endpoints/query_v2.py (4)
src/utils/shields.py (2)
  • append_turn_to_conversation (133-157)
  • run_shield_moderation (63-130)
src/utils/types.py (1)
  • TurnSummary (144-263)
src/utils/suid.py (1)
  • normalize_conversation_id (57-78)
src/utils/token_counter.py (1)
  • TokenCounter (18-47)
src/utils/shields.py (2)
src/models/responses.py (2)
  • NotFoundResponse (1532-1588)
  • model (1732-1745)
src/utils/types.py (1)
  • ShieldModerationResult (103-108)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E: library mode / ci
  • GitHub Check: E2E: server mode / ci
🔇 Additional comments (43)
src/app/endpoints/streaming_query.py (2)

932-944: Exception handling for RateLimitError is consistent with the new import.

The usage of RateLimitError in the exception handler correctly uses getattr(e, "model", "") to safely access the model attribute. This pattern handles both cases where the exception may or may not have a model attribute, which is important given the import source change.


14-18: Import relocation is correct and verified.

The RateLimitError import from llama_stack_client is properly exported and used consistently throughout the codebase. The # type: ignore comment is justified—while the import functions correctly, the type stubs do not fully export this symbol. This pattern matches other endpoint files and aligns with the coding guidelines to use Llama Stack imports.

tests/e2e/configs/run-vertexai.yaml (2)

82-85: OpenAI provider addition enables shield functionality.

The new OpenAI inference provider is required for the llama-guard shield configuration below, which uses openai/gpt-4o-mini as the provider_shield_id.


140-143: Shield configuration aligns with the PR objective.

The llama-guard shield is now configured with openai/gpt-4o-mini as the provider model, enabling the Moderations API implementation. This is consistent with other e2e config files in the PR. The use of gpt-4o-mini for moderation (shields) versus gpt-4-mini for general inference (docker-compose defaults) is intentional and consistently applied across all configurations.

src/app/endpoints/query.py (2)

11-16: Import relocation is consistent with the streaming_query.py changes.

The RateLimitError import has been moved from litellm.exceptions to llama_stack_client, maintaining consistency across endpoint files. This aligns with the coding guidelines to use Llama Stack imports.


443-451: Exception handling for RateLimitError follows the established pattern.

The exception handler correctly uses getattr(e, "model", "") to safely access the model attribute, consistent with the implementation in streaming_query.py.

src/utils/types.py (1)

103-108: LGTM!

The ShieldModerationResult model is well-designed with appropriate type annotations using modern union syntax (str | None). The fields clearly capture the moderation outcome (blocked status, optional message, and shield model identifier).

tests/e2e/features/steps/info.py (1)

109-109: LGTM!

The changes correctly update the shield validation to use context.default_provider and expect the identifier "llama-guard", aligning with the new shield configuration pattern across the PR.

Also applies to: 120-121

tests/e2e/configs/run-azure.yaml (2)

83-86: LGTM!

The OpenAI provider configuration is correctly added to support the shield moderation functionality. The API key is properly sourced from an environment variable.


145-148: LGTM!

The shield configuration correctly maps the llama-guard shield to use openai/gpt-4o-mini as the provider shield, consistent with the pattern established across other configuration files in this PR.

tests/e2e/configs/run-rhelai.yaml (2)

27-30: LGTM!

The OpenAI provider is correctly added to the RHEL AI test configuration, enabling the shield moderation functionality.


139-140: LGTM!

The shield configuration is updated consistently with other configuration files in this PR, mapping llama-guard to openai/gpt-4o-mini.

tests/unit/app/endpoints/test_query.py (2)

12-13: LGTM!

The import changes correctly align with the migration of RateLimitError from litellm to llama_stack_client. The httpx import is needed for constructing mock HTTP responses in the rate limit test.


2423-2428: LGTM!

The test correctly updates the RateLimitError construction to use the llama_stack_client API pattern with explicit response and body parameters. The expected error message is updated to match the new surface message.

Also applies to: 2456-2456

run.yaml (1)

134-142: LGTM!

The model and shield resource registrations are properly structured. The provider_shield_id: openai/gpt-4o-mini correctly references the OpenAI model for shield moderation, which aligns with the new Moderations API implementation.

tests/e2e/configs/run-ci.yaml (1)

135-148: LGTM!

The CI test configuration correctly adds the gpt-4o-mini model and llama-guard shield entries, consistent with the main run.yaml configuration. This ensures CI tests exercise the new shield moderation flow.

src/app/endpoints/query_v2.py (3)

13-13: LGTM!

The AsyncLlamaStackClient import is now clean without the type: ignore comment, indicating proper type support in the library version.


45-48: LGTM!

The new imports for append_turn_to_conversation and run_shield_moderation from utils.shields follow the project's absolute import convention.


373-391: Shield moderation flow is well-implemented.

The pre-LLM moderation check correctly:

  1. Runs moderation before the Responses API call
  2. Records the conversation turn even when blocked (for audit trail)
  3. Returns early with an appropriate TurnSummary containing the violation message
  4. Returns zero token counts since no LLM call was made

One minor note: On line 388, normalize_conversation_id(conversation_id) is called, but when a new conversation is created (lines 363-366), conversation_id is already assigned the normalized value. This double normalization is harmless but redundant in the new-conversation path.

tests/e2e/configs/run-rhaiis.yaml (2)

27-30: OpenAI provider added for shield moderation.

The OpenAI provider is correctly configured to support the llama-guard shield using gpt-4o-mini. Note that this requires the OPENAI_API_KEY environment variable to be set for RHAIIS e2e tests.

Please verify that the CI environment for RHAIIS tests has OPENAI_API_KEY configured, as this is now a required dependency for shield moderation.


137-140: Shield configuration simplified to fixed model.

The shield configuration has been changed from environment-variable-based (${env.SAFETY_MODEL}, ${env.CODE_SCANNER_MODEL}) to a fixed openai/gpt-4o-mini model. This removes the ability to configure different safety models per environment.

Confirm this is intentional - the PR description mentions the prior Shields API "did not work properly," so consolidating to a single working configuration makes sense. However, this removes the code-scanner shield entirely.

tests/unit/app/endpoints/test_streaming_query.py (3)

11-12: LGTM!

The imports correctly reflect the migration of RateLimitError and APIConnectionError from litellm to llama_stack_client. The httpx import is necessary to construct mock HTTP responses for the updated RateLimitError signature.


2142-2143: LGTM!

The error message assertion is updated to match the new response format. The shortened message "The quota has been exceeded" is still descriptive and appropriately indicates the rate limit scenario.


2087-2092: RateLimitError usage is correct and consistent across tests.

The test properly constructs the RateLimitError with the correct arguments: message string, response httpx.Response object with 429 status, and body=None. This pattern is used consistently across all four related test files and correctly simulates the rate-limit scenario.

src/app/endpoints/streaming_query_v2.py (3)

8-17: LGTM! New imports for violation stream support.

The additional OpenAI response types are correctly imported to support the new create_violation_stream function that emits shield violation responses.


62-65: LGTM! Shield moderation utilities imported.

The imports for append_turn_to_conversation and run_shield_moderation align with the new pre-LLM moderation flow.


500-517: LGTM! Completed response structure looks correct.

The OpenAIResponseObjectStreamResponseCompleted correctly wraps an OpenAIResponseObject with the violation message in the output, maintaining compatibility with the normal response processing path.

src/utils/shields.py (2)

16-17: LGTM! Default violation message constant.

The DEFAULT_VIOLATION_MESSAGE provides a user-friendly fallback when moderation blocks content without a specific message.


116-128: LGTM! BadRequestError handling with documented rationale.

The catch for BadRequestError with the comment explaining the Llama Stack bug is appropriate defensive coding. Treating parse failures as blocked content errs on the side of safety.

tests/unit/utils/test_shields.py (4)

1-16: LGTM! Test imports and setup.

Correct use of pytest, pytest-mock, and httpx for mocking BadRequestError. The import of DEFAULT_VIOLATION_MESSAGE enables testing the fallback message behavior.


108-156: Good coverage of no-shields and moderation-passes scenarios.

The tests correctly verify that run_shield_moderation returns blocked=False when no shields are configured and when moderation passes.


158-196: LGTM! Flagged content test with metric verification.

Good test coverage verifying the blocked result, custom message propagation, shield_model assignment, and metric increment.


277-314: LGTM! BadRequestError handling test.

Correctly tests the documented Llama Stack bug workaround, verifying that BadRequestError results in blocked content with the default message.

tests/unit/app/endpoints/test_streaming_query_v2.py (5)

153-190: LGTM! Mock-based stream events.

The switch from SimpleNamespace to Mock objects for streaming events is appropriate and allows proper attribute access patterns. The explicit setting of name attribute on item_mock correctly handles Mock's special parameter handling.


276-324: LGTM! Shield moderation integration test.

The test correctly verifies that when shields are available, moderations.create is called with the user's input and that responses.create proceeds when moderation passes.


362-450: LGTM! Shield violation streaming test.

Good test coverage for the blocked-by-moderation path, verifying that the violation message appears in the streamed response.


558-570: LGTM! Quota exceeded test with proper RateLimitError mock.

The test correctly constructs a RateLimitError with an httpx.Response object, matching the expected interface from llama-stack-client.


10-11: No action needed. RateLimitError is correctly imported from llama_stack_client with proper constructor interface (message, response, and body arguments) and is consistently used across all test files.

tests/unit/app/endpoints/test_query_v2.py (5)

8-9: LGTM! Updated imports for moderation flow.

The imports of httpx, RateLimitError from llama_stack_client, and ShieldModerationResult are correctly added to support the new test scenarios.

Also applies to: 20-20


134-136: Consistent shield moderation mocking pattern.

The pattern of mocking shields.list returning empty list and models.list is consistently applied across tests. This ensures run_shield_moderation executes without blocking for tests not specifically testing moderation.


723-759: LGTM! Shield violation test verifies early return.

Excellent test coverage verifying that when run_shield_moderation returns blocked=True, the responses.create is NOT called and the violation message is returned in the summary.


596-599: LGTM! Quota test with mocked moderation.

Correctly mocks run_shield_moderation to return blocked=False so the test can proceed to exercise the RateLimitError path from responses.create.


670-675: LGTM! Moderation and response calls verified.

The test correctly asserts that moderations.create is called with the user's query and model, and that responses.create is called after moderation passes.

@asimurka asimurka force-pushed the moderations_api_impl branch from 82f8a81 to 066b0e5 Compare January 12, 2026 09:47
@are-ces
Copy link
Contributor

are-ces commented Jan 12, 2026

LGTM nice work!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/app/endpoints/streaming_query_v2.py (1)

472-517: Consider generating unique identifiers for violation streams.

The hardcoded response_id="resp_shield_violation" and item_id="msg_shield_violation" will be identical across all shield violation responses. While this may work for the current use case, consider if unique IDs are needed for tracing, logging, or client-side deduplication.

Additionally, created_at=0 uses a placeholder timestamp. If consumers rely on this field, consider using int(time.time()) instead.

♻️ Optional: Generate unique IDs and proper timestamp
+import time
+import uuid

 async def create_violation_stream(
     message: str,
     shield_model: str | None = None,
 ) -> AsyncIterator[OpenAIResponseObjectStream]:
     """Create a minimal response stream for shield violations."""
-    response_id = "resp_shield_violation"
-    item_id = "msg_shield_violation"
+    response_id = f"resp_shield_{uuid.uuid4().hex[:12]}"
+    item_id = f"msg_shield_{uuid.uuid4().hex[:12]}"
+    created_at = int(time.time())

     # ... rest of function ...

     # Completed response
     yield OpenAIResponseObjectStreamResponseCompleted(
         response=OpenAIResponseObject(
             id=response_id,
-            created_at=0,
+            created_at=created_at,
             model=shield_model or "shield",
src/utils/shields.py (1)

116-128: BadRequestError handling is a pragmatic workaround.

The comment clearly documents that this is a known Llama Stack bug where violations with unparseable formats raise BadRequestError. Treating this as blocked content with the default message is a safe approach. Consider adding a TODO to revisit once the upstream bug is fixed.

📝 Optional: Add TODO for future cleanup
         # Known Llama Stack bug: BadRequestError is raised when violation is present
         # in the shield LLM response but has wrong format that cannot be parsed.
+        # TODO: Revisit this workaround once upstream Llama Stack bug is fixed.
         except BadRequestError:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82f8a81 and 066b0e5.

📒 Files selected for processing (20)
  • docker-compose-library.yaml
  • docker-compose.yaml
  • run.yaml
  • src/app/endpoints/query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/utils/shields.py
  • src/utils/types.py
  • tests/e2e/configs/run-azure.yaml
  • tests/e2e/configs/run-ci.yaml
  • tests/e2e/configs/run-rhaiis.yaml
  • tests/e2e/configs/run-rhelai.yaml
  • tests/e2e/configs/run-vertexai.yaml
  • tests/e2e/features/steps/info.py
  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/utils/test_shields.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/utils/types.py
  • src/app/endpoints/query.py
  • tests/e2e/configs/run-azure.yaml
  • tests/e2e/configs/run-rhaiis.yaml
  • tests/e2e/configs/run-ci.yaml
  • tests/unit/app/endpoints/test_query.py
  • tests/e2e/features/steps/info.py
  • tests/unit/utils/test_shields.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Use FastAPI dependencies: from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = logging.getLogger(__name__) pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, using typing_extensions.Self for model validators
Use union types with modern syntax: str | int instead of Union[str, int]
Use Optional[Type] for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use logger.debug() for detailed diagnostic information
Use logger.info() for general information about program execution
Use logger.warning() for unexpected events or potential problems
Use logger.error() for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Run uv run make format to auto-format code with black and ruff before completion
Run uv run make verify to run all linters (black, pyl...

Files:

  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/streaming_query.py
  • src/utils/shields.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
src/app/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/streaming_query.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Use pytest-mock for AsyncMock objects in tests
Use MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token") pattern for authentication mocks in tests

Files:

  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
tests/unit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion

Files:

  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
🧠 Learnings (5)
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use FastAPI dependencies: `from fastapi import APIRouter, HTTPException, Request, status, Depends`

Applied to files:

  • src/app/endpoints/streaming_query_v2.py
  • src/app/endpoints/streaming_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use Llama Stack imports: `from llama_stack_client import AsyncLlamaStackClient`

Applied to files:

  • src/app/endpoints/query_v2.py
  • src/app/endpoints/streaming_query.py
📚 Learning: 2025-12-18T10:21:03.056Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 935
File: run.yaml:114-115
Timestamp: 2025-12-18T10:21:03.056Z
Learning: In run.yaml for Llama Stack 0.3.x, do not add a telemetry provider block under providers. To enable telemetry, set telemetry.enabled: true directly (no provider block required). This pattern applies specifically to run.yaml configuration in this repository.

Applied to files:

  • run.yaml
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to **/*.py : Handle `APIConnectionError` from Llama Stack in error handling

Applied to files:

  • src/app/endpoints/streaming_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to src/app/**/*.py : Use FastAPI `HTTPException` with appropriate status codes for API endpoint error handling

Applied to files:

  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
🧬 Code graph analysis (4)
src/app/endpoints/streaming_query_v2.py (2)
src/utils/shields.py (2)
  • append_turn_to_conversation (133-157)
  • run_shield_moderation (63-130)
src/utils/suid.py (1)
  • normalize_conversation_id (57-78)
src/app/endpoints/query_v2.py (4)
src/utils/shields.py (2)
  • append_turn_to_conversation (133-157)
  • run_shield_moderation (63-130)
src/utils/types.py (1)
  • TurnSummary (144-263)
src/utils/suid.py (1)
  • normalize_conversation_id (57-78)
src/utils/token_counter.py (1)
  • TokenCounter (18-47)
src/utils/shields.py (2)
src/models/responses.py (2)
  • NotFoundResponse (1532-1588)
  • model (1732-1745)
src/utils/types.py (1)
  • ShieldModerationResult (103-108)
tests/unit/app/endpoints/test_query_v2.py (4)
src/utils/types.py (1)
  • ShieldModerationResult (103-108)
src/models/responses.py (1)
  • model (1732-1745)
src/app/endpoints/streaming_query.py (1)
  • retrieve_response (999-1125)
src/app/endpoints/streaming_query_v2.py (1)
  • retrieve_response (370-469)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E: library mode / ci
  • GitHub Check: E2E: server mode / ci
🔇 Additional comments (33)
docker-compose.yaml (1)

19-19: LGTM!

Adding a default value for E2E_OPENAI_MODEL prevents runtime failures when the environment variable is not set and aligns with the updated defaults in docker-compose-library.yaml.

docker-compose-library.yaml (1)

21-21: LGTM!

Default model updated to gpt-4o-mini, consistent with the main docker-compose.yaml.

tests/e2e/configs/run-vertexai.yaml (2)

140-143: LGTM!

Shield configuration correctly references the llama-guard safety provider and uses openai/gpt-4o-mini as the backing model for moderation.


82-85: LGTM — OpenAI provider is properly added to enable shield functionality via the Moderations API. This introduces OPENAI_API_KEY as a dependency for VertexAI E2E tests, which is already handled in CI (e2e_tests_providers.yaml provides it as a GitHub secret) and documented in README.md.

src/app/endpoints/query_v2.py (3)

13-13: LGTM!

Import retained from llama_stack_client per coding guidelines.


45-48: LGTM!

New imports for shield moderation utilities enable the pre-LLM moderation flow that replaces the previous Shields API implementation.


373-391: LGTM!

Pre-LLM shield moderation correctly implemented:

  1. Moderation runs before the Responses API call
  2. Blocked content short-circuits the LLM call with proper early return
  3. Conversation history is preserved via append_turn_to_conversation
  4. Returns appropriate empty values for tool calls, RAG chunks, and token usage when blocked

This aligns with the PR objective of implementing input shields using the Moderations API.

src/app/endpoints/streaming_query.py (1)

14-18: Import change is correct. RateLimitError is properly exported from llama_stack_client, validating the consolidation of Llama Stack-related imports. The # type: ignore comment appropriately addresses potential type stub gaps.

run.yaml (1)

134-142: Configuration looks good.

The registered resources properly define the gpt-4o-mini model and llama-guard shield with the correct provider mappings. The shield's provider_shield_id correctly references the OpenAI model that will be used for moderation.

tests/e2e/configs/run-rhelai.yaml (2)

27-30: OpenAI provider addition looks correct.

Adding the OpenAI provider enables the llama-guard shield to use gpt-4o-mini for moderation while the main inference continues to use the vLLM provider for RHEL AI models.


139-140: Shield configuration is consistent.

The shield configuration aligns with the main run.yaml and other test configs, using llama-guard with openai/gpt-4o-mini as the moderation model.

tests/unit/app/endpoints/test_streaming_query_v2.py (7)

5-11: Imports updated correctly for new moderation flow.

The imports now include Mock for simulating streaming events, httpx for constructing the rate limit response, and the llama-stack-client exceptions. This aligns with the updated error handling pattern.


53-55: Shield moderation mocks added appropriately.

Mocking shields.list and models.list is necessary for the run_shield_moderation function. Returning empty lists simulates a scenario with no shields configured.


153-190: Streaming event simulation using Mock is cleaner.

Using Mock objects with explicit attributes provides better control over the simulated streaming events compared to SimpleNamespace. The explicit setting of name attribute (line 179) is correctly handled since name is a special Mock parameter.


276-324: Good coverage for shield moderation pass scenario.

This test properly validates that:

  1. Moderation is called with the user's query
  2. When moderation passes (not flagged), responses.create is invoked

The mock setup for shields, models, and moderation results is comprehensive.


362-450: Shield violation streaming test is comprehensive.

The test correctly simulates a violation stream response with the expected structure: content part added, text delta with violation message, and completed response. The assertion on line 450 properly validates the violation message appears in the stream output.


585-587: ShieldModerationResult mock is correctly applied.

Mocking run_shield_moderation to return ShieldModerationResult(blocked=False) allows the test to focus on the quota exceeded scenario without interference from the moderation flow.


558-570: RateLimitError construction requires external package verification.

The test code at lines 558-570 constructs RateLimitError with parameters: a message string, response=mock_response (httpx.Response object), and body=None. While this pattern is used consistently across multiple test files (test_query.py, test_query_v2.py, test_streaming_query.py, test_streaming_query_v2.py), verification of the exact constructor signature for llama-stack-client==0.3.5 could not be completed. Confirm that the httpx.Response type and body=None parameter align with the actual RateLimitError constructor in the external package.

tests/unit/app/endpoints/test_query_v2.py (5)

8-9: Imports updated consistently with streaming tests.

The imports for httpx and llama-stack-client exceptions align with the pattern established in the streaming query tests.


574-578: RateLimitError mock updated correctly.

The mock response construction using httpx.Response with status code 429 properly simulates the rate limit scenario.


596-599: Moderation bypass for quota test is appropriate.

Mocking run_shield_moderation to return blocked=False allows the test to focus specifically on the quota exceeded behavior.


611-612: Assertion matches updated quota error message.

The response message changed from "The model quota has been exceeded" to "The quota has been exceeded", which aligns with the QuotaExceededResponse model shown in the relevant code snippets.


724-759: Shield violation test correctly validates blocking behavior.

This test is critical for validating the new moderation flow:

  1. Mocks run_shield_moderation to return a blocked result with a custom message
  2. Verifies the violation message is returned in the summary
  3. Correctly asserts that responses.create is NOT called when content is blocked
src/app/endpoints/streaming_query_v2.py (3)

8-16: OpenAI response stream type imports added correctly.

The additional imports provide the necessary types to construct the violation stream response, including content parts, text deltas, and the completed response structure.


62-65: Shield utility imports properly organized.

Importing append_turn_to_conversation and run_shield_moderation from utils.shields provides clean separation of the moderation logic.


444-454: Pre-LLM shield moderation integration is well-structured.

The implementation correctly:

  1. Runs moderation before the LLM call
  2. Appends the turn to the conversation when blocked (maintaining conversation history)
  3. Returns a violation stream instead of calling the Responses API

This ensures users see their blocked query in conversation history with the appropriate violation message.

tests/unit/app/endpoints/test_streaming_query.py (3)

11-12: Imports updated consistently.

The change from litellm.exceptions to llama_stack_client for RateLimitError and APIConnectionError aligns with the broader refactoring across the test suite.


2087-2091: RateLimitError mock construction updated.

The test now correctly constructs the RateLimitError with an httpx.Response object and explicit body=None parameter, matching the updated exception signature.


2142-2143: Assertion updated to match new error format.

The quota exceeded message now correctly expects "The quota has been exceeded" instead of the previous "The model quota has been exceeded".

src/utils/shields.py (4)

4-12: Imports and types are well-organized.

The imports properly include HTTPException for error handling, BadRequestError for the Llama Stack workaround, CreateResponse for type casting, and the custom types NotFoundResponse and ShieldModerationResult.


16-17: Default violation message is user-friendly.

The message "I cannot process this request due to policy restrictions." is appropriately vague to avoid revealing security implementation details while being informative to users.


63-130: Core moderation implementation is well-structured.

The function correctly:

  1. Fetches available models upfront to validate shield configurations
  2. Raises 404 when a shield's model is not found (lines 86-93)
  3. Iterates through all shields and calls moderations.create
  4. Returns early on first violation with appropriate metadata
  5. Handles BadRequestError as a blocked response (documented Llama Stack bug)
  6. Increments metrics on violations
  7. Returns ShieldModerationResult(blocked=False) when all shields pass

One consideration: the validation at lines 86-89 treats both missing provider_resource_id AND model-not-found as the same error. This may mask configuration issues where provider_resource_id is simply not set vs. being set to an invalid model.


133-157: Conversation turn recording is implemented correctly.

The function properly records both the user's original message and the violation response to maintain complete conversation history. This ensures users can see their blocked queries in the conversation context.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit in tests (model name), but LGTM in overall. Thank you!

assert isinstance(detail, dict)
assert detail["response"] == "The model quota has been exceeded" # type: ignore
assert detail["response"] == "The quota has been exceeded" # type: ignore
assert "gpt-4-turbo" in detail["cause"] # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still "gpt-4-turbo" there?

@asimurka asimurka force-pushed the moderations_api_impl branch from 066b0e5 to 710cc33 Compare January 12, 2026 10:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/app/endpoints/test_query.py (1)

2415-2457: Quota exceeded message expectation may be inconsistent with QuotaExceededResponse.model(...).
Line 2456 expects "The quota has been exceeded", but src/models/responses.py:QuotaExceededResponse.model() (per provided snippet) returns "The model quota has been exceeded". If you intentionally changed surface messaging, consider updating QuotaExceededResponse (or the OpenAPI examples) to match the user-facing response.

🤖 Fix all issues with AI agents
In @src/app/endpoints/streaming_query_v2.py:
- Around line 62-65: The code currently appends the full input_text (which may
include raw attachment bodies built earlier) to the conversation when a shield
violation is detected; change the blocked-path behavior so
append_turn_to_conversation receives a redacted user message instead of full
input_text — e.g., pass query_request.query or a short redacted attachment
summary created from the attachment loop — and ensure any variable named
input_text (or the place where attachment bodies are concatenated) is not stored
on violation; update the code paths where append_turn_to_conversation is called
after run_shield_moderation to use the redacted string.
- Around line 472-520: create_violation_stream currently mixes item_id,
output_index, and content_index across events which breaks stream correlation;
make these values stable for the same output item: pick a single item_id (e.g.,
"msg_shield_violation_1") and reuse it for the ContentPartAdded,
OutputTextDelta, and the final OpenAIResponseMessage, and keep content_index and
output_index the same (e.g., content_index=0 and output_index=0) across those
yields while still incrementing sequence_number as appropriate; update the
OpenAIResponseObjectStreamResponseOutputTextDelta and the
OpenAIResponseObjectStreamResponseCompleted/OpenAIResponseMessage entries to use
the same item_id, content_index, and output_index used in
OpenAIResponseObjectStreamResponseContentPartAdded.

In @tests/e2e/configs/run-ci.yaml:
- Around line 135-148: The shield entry using shield_id "llama-guard" is
configured with provider_shield_id "openai/gpt-4o-mini", which is an invalid
moderation model; either remove the entire "llama-guard" shields block or change
provider_id/provider_shield_id to a valid moderation model (for OpenAI use
something like "provider_id: openai" with "provider_shield_id:
omni-moderation-latest" or point to the proper Llama Guard moderation model
supported by the provider) and ensure metadata/inference settings match the
moderation provider's requirements.
🧹 Nitpick comments (5)
src/utils/types.py (1)

103-108: LGTM! Well-structured model for shield moderation results.

The ShieldModerationResult class is cleanly defined with appropriate fields and uses modern type annotation syntax (str | None).

Consider adding an Attributes section to the docstring per Google Python docstring conventions to document each field's purpose:

📝 Optional docstring enhancement
 class ShieldModerationResult(BaseModel):
-    """Result of shield moderation check."""
+    """Result of shield moderation check.
+
+    Attributes:
+        blocked: Whether the content was blocked by the shield.
+        message: Optional message explaining the moderation decision.
+        shield_model: Optional identifier of the model that performed moderation.
+    """

     blocked: bool
     message: str | None = None
     shield_model: str | None = None
tests/unit/app/endpoints/test_query_v2.py (1)

615-760: Nice coverage: verifies moderation gates responses.create, but add an assertion that the blocked path records a turn (if that’s a requirement).
Line 724-Line 760 tests “blocked returns early”; it currently asserts responses.create isn’t called, but doesn’t assert conversations.items.create (or append_turn_to_conversation) behavior. If the PR objective includes “store the violation turn,” asserting it here would prevent regressions.

tests/unit/app/endpoints/test_streaming_query_v2.py (1)

553-587: Good: quota test now matches new RateLimitError construction, but keep one shared helper to reduce drift across v1/v2 tests.
The 429 setup pattern repeats across multiple test files; a small helper (e.g., make_httpx_429_response() + make_rate_limit_error()) would reduce future churn.

src/utils/shields.py (2)

85-94: Missing provider_resource_id is likely a config error; 404 “NotFound” may be the wrong semantic.
Line 86-Line 94 treats missing/invalid provider_resource_id as 404. If this is “misconfigured server” rather than “client requested unknown resource,” consider using a 500-style response (or at least a clearer “shield misconfigured” message).


116-129: Catching only BadRequestError may under-handle other failure modes (timeouts / APIConnectionError).
Line 118-Line 128 treats a known parsing bug as “blocked,” which is fine, but consider whether connection failures should propagate as 503 instead of “blocked.” As per learnings, ensure APIConnectionError is handled somewhere in the call chain.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 066b0e5 and 710cc33.

📒 Files selected for processing (20)
  • docker-compose-library.yaml
  • docker-compose.yaml
  • run.yaml
  • src/app/endpoints/query.py
  • src/app/endpoints/query_v2.py
  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/utils/shields.py
  • src/utils/types.py
  • tests/e2e/configs/run-azure.yaml
  • tests/e2e/configs/run-ci.yaml
  • tests/e2e/configs/run-rhaiis.yaml
  • tests/e2e/configs/run-rhelai.yaml
  • tests/e2e/configs/run-vertexai.yaml
  • tests/e2e/features/steps/info.py
  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/utils/test_shields.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • run.yaml
  • tests/e2e/configs/run-rhelai.yaml
  • src/app/endpoints/query.py
  • docker-compose-library.yaml
  • src/app/endpoints/query_v2.py
  • tests/unit/utils/test_shields.py
  • src/app/endpoints/streaming_query.py
  • tests/e2e/configs/run-rhaiis.yaml
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Use FastAPI dependencies: from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports: from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Use logger = logging.getLogger(__name__) pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, using typing_extensions.Self for model validators
Use union types with modern syntax: str | int instead of Union[str, int]
Use Optional[Type] for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Use async def for I/O operations and external API calls
Handle APIConnectionError from Llama Stack in error handling
Use logger.debug() for detailed diagnostic information
Use logger.info() for general information about program execution
Use logger.warning() for unexpected events or potential problems
Use logger.error() for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes: Configuration, Error/Exception, Resolver, Interface
Use ABC for abstract base classes with @abstractmethod decorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Run uv run make format to auto-format code with black and ruff before completion
Run uv run make verify to run all linters (black, pyl...

Files:

  • tests/unit/app/endpoints/test_query.py
  • src/app/endpoints/streaming_query_v2.py
  • src/utils/shields.py
  • tests/e2e/features/steps/info.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • src/utils/types.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Use pytest-mock for AsyncMock objects in tests
Use MOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token") pattern for authentication mocks in tests

Files:

  • tests/unit/app/endpoints/test_query.py
  • tests/e2e/features/steps/info.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
tests/unit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion

Files:

  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_streaming_query_v2.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
src/app/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use FastAPI HTTPException with appropriate status codes for API endpoint error handling

Files:

  • src/app/endpoints/streaming_query_v2.py
tests/e2e/features/steps/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define E2E test steps in tests/e2e/features/steps/ directory

Files:

  • tests/e2e/features/steps/info.py
🧠 Learnings (3)
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to src/app/**/*.py : Use FastAPI `HTTPException` with appropriate status codes for API endpoint error handling

Applied to files:

  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to **/*.py : Handle `APIConnectionError` from Llama Stack in error handling

Applied to files:

  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use FastAPI dependencies: `from fastapi import APIRouter, HTTPException, Request, status, Depends`

Applied to files:

  • tests/unit/app/endpoints/test_query.py
  • src/app/endpoints/streaming_query_v2.py
  • tests/unit/app/endpoints/test_query_v2.py
  • tests/unit/app/endpoints/test_streaming_query.py
🧬 Code graph analysis (5)
tests/unit/app/endpoints/test_query.py (1)
src/models/responses.py (1)
  • model (1732-1745)
src/app/endpoints/streaming_query_v2.py (2)
src/utils/shields.py (2)
  • append_turn_to_conversation (133-157)
  • run_shield_moderation (63-130)
src/utils/suid.py (1)
  • normalize_conversation_id (57-78)
src/utils/shields.py (2)
src/models/responses.py (2)
  • NotFoundResponse (1532-1588)
  • model (1732-1745)
src/utils/types.py (1)
  • ShieldModerationResult (103-108)
tests/unit/app/endpoints/test_streaming_query_v2.py (2)
src/utils/types.py (1)
  • ShieldModerationResult (103-108)
src/models/responses.py (1)
  • model (1732-1745)
tests/unit/app/endpoints/test_streaming_query.py (1)
src/models/responses.py (1)
  • model (1732-1745)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E: server mode / ci
  • GitHub Check: E2E: library mode / ci
🔇 Additional comments (11)
tests/e2e/features/steps/info.py (1)

108-124: LGTM! Shield validation updated to match new shield structure.

The changes correctly update the validation to expect:

  • provider_resource_id in the format "{provider}/{model}" (e.g., openai/gpt-4o-mini)
  • identifier as "llama-guard" matching the new shield configuration
docker-compose.yaml (1)

19-19: LGTM! Sensible default for E2E_OPENAI_MODEL.

Adding gpt-4o-mini as the fallback ensures the container starts correctly when the environment variable is unset, aligning with the model used across other configs in this PR.

tests/e2e/configs/run-vertexai.yaml (2)

140-143: LGTM! Shield configuration follows consistent pattern.

The llama-guard shield is configured correctly, using openai/gpt-4o-mini as the backing model for moderation, consistent with other E2E configs in this PR.


82-85: The OpenAI provider addition is consistent with existing codebase usage. OPENAI_API_KEY is already documented in README.md and multiple guides (docs/providers.md, docs/getting_started.md, docs/deployment_guide.md, docs/byok_guide.md), already configured in CI/CD workflows (e2e_tests.yaml, e2e_tests_providers.yaml, etc. via secrets.OPENAI_API_KEY), and already in use in other configuration files (run.yaml, tests/configuration/run.yaml, examples). No additional verification or documentation is needed—this change follows the established pattern.

tests/e2e/configs/run-azure.yaml (2)

83-86: LGTM! OpenAI provider added for shield moderation.

The OpenAI provider is correctly configured to support the llama-guard shield, which routes moderation calls to OpenAI's Moderations API while the main inference uses Azure.


145-148: LGTM! Shield configuration consistent across configs.

The llama-guard shield is configured identically to other E2E configs, ensuring consistent moderation behavior across different provider backends.

src/app/endpoints/streaming_query_v2.py (1)

444-469: Error handling for new I/O calls is covered by the endpoint's error mapping.

The new calls to run_shield_moderation() and append_turn_to_conversation() in retrieve_response() are not explicitly wrapped in try-except blocks within the function itself. However, they are covered by the base handler's error mapping at streaming_query_endpoint_handler_base() (streaming_query.py, line 920), which catches APIConnectionError and converts it to a 503 ServiceUnavailableResponse using stream_http_error(). Any APIConnectionError raised by these new I/O calls will propagate up and be handled consistently with the rest of the endpoint, meeting the requirement for 503/5xx behavior on API connection failures.

tests/unit/app/endpoints/test_streaming_query.py (1)

11-13: Remove the unfounded .model attribute extraction from RateLimitError.

The getattr(e, "model", "") calls in the RateLimitError handlers assume llama-stack-client exposes a .model field, but the library only provides .status_code and .response. Since the attribute doesn't exist, the fallback is always used, making the model-specific message path unreachable. Either extract the model name from the error message string or remove this dead code path to avoid future maintenance confusion.

Applies to: src/app/endpoints/streaming_query.py and src/app/endpoints/query.py RateLimitError handlers, and corresponding tests.

src/utils/shields.py (1)

6-13: cast(CreateResponse, moderation) doesn't validate the response structure at runtime.

The cast() on line 99 is a type hint only—it provides no runtime validation. Lines 101-114 directly access .results[0].flagged, .user_message, and .categories without checking if the response actually contains these attributes. If the API response structure changes or differs from expectations, these attribute accesses will fail with AttributeError at runtime. Consider adding explicit checks before accessing nested attributes, or use a validated Pydantic model instead of cast().

tests/unit/app/endpoints/test_query_v2.py (1)

7-10: No issues found—imports and mocking patterns align with llama-stack-client 0.3.5 API.

httpx, RateLimitError, and moderations response shape are correctly used. The Llama Stack Shields API (not standard OpenAI Moderation) includes the user_message field in violation responses, which the code and tests properly handle. Tests correctly mock results[0].flagged, results[0].categories, and results[0].user_message as documented in the Llama Stack API.

tests/unit/app/endpoints/test_query.py (1)

11-14: Verify constructor signature for RateLimitError when mocking in tests.

The test at lines 2424–2428 constructs RateLimitError(message, response=httpx.Response, body=None). While web documentation confirms RateLimitError exists in llama-stack-client==0.3.5 and exposes .status_code and .response properties, the exact constructor signature and parameter types (particularly whether body is accepted) could not be definitively verified from available sources. Consider checking the library's source code or type stubs to confirm the signature matches your usage.

Comment on lines +62 to +65
from utils.shields import (
append_turn_to_conversation,
run_shield_moderation,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential PII/secrets leak: blocked-path stores full input_text (including attachment content) into conversation.
Line 448-Line 450 appends input_text to the conversation on violation; but input_text includes raw attachment bodies (Line 417-Line 423). This can persist secrets even though the request is blocked. Consider storing only query_request.query (or a redacted attachment summary) in the conversation items.

Proposed mitigation (store redacted user message on violation)
-        await append_turn_to_conversation(
-            client, llama_stack_conv_id, input_text, violation_message
-        )
+        # Avoid persisting raw attachment contents on blocked requests.
+        user_message_for_storage = query_request.query
+        if query_request.attachments:
+            user_message_for_storage += "\n\n[Attachments redacted]"
+        await append_turn_to_conversation(
+            client, llama_stack_conv_id, user_message_for_storage, violation_message
+        )

Also applies to: 444-454

🤖 Prompt for AI Agents
In @src/app/endpoints/streaming_query_v2.py around lines 62 - 65, The code
currently appends the full input_text (which may include raw attachment bodies
built earlier) to the conversation when a shield violation is detected; change
the blocked-path behavior so append_turn_to_conversation receives a redacted
user message instead of full input_text — e.g., pass query_request.query or a
short redacted attachment summary created from the attachment loop — and ensure
any variable named input_text (or the place where attachment bodies are
concatenated) is not stored on violation; update the code paths where
append_turn_to_conversation is called after run_shield_moderation to use the
redacted string.

Comment on lines +472 to +520
async def create_violation_stream(
message: str,
shield_model: str | None = None,
) -> AsyncIterator[OpenAIResponseObjectStream]:
"""Generate a minimal streaming response for cases where input is blocked by a shield.
This yields only the essential streaming events to indicate that the input was rejected.
Dummy item identifiers are used solely for protocol compliance and are not used later.
"""
response_id = "resp_shield_violation"

# Content part added (triggers empty initial token)
yield OpenAIResponseObjectStreamResponseContentPartAdded(
content_index=0,
response_id=response_id,
item_id="msg_shield_violation_1",
output_index=0,
part=OpenAIResponseContentPartOutputText(text=""),
sequence_number=0,
)

# Text delta
yield OpenAIResponseObjectStreamResponseOutputTextDelta(
content_index=1,
delta=message,
item_id="msg_shield_violation_2",
output_index=1,
sequence_number=1,
)

# Completed response
yield OpenAIResponseObjectStreamResponseCompleted(
response=OpenAIResponseObject(
id=response_id,
created_at=0, # not used
model=shield_model or "shield",
output=[
OpenAIResponseMessage(
id="msg_shield_violation_3",
content=[
OpenAIResponseOutputMessageContentOutputText(text=message)
],
role="assistant",
status="completed",
)
],
status="completed",
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

llama-stack 0.3.5 streaming response event schema item_id output_index content_index consistency

💡 Result:

Brief summary for llama-stack (v0.3.5) streaming response event fields — meaning and consistency guarantees:

  • item_id — unique identifier for a single output item (stays the same across all streaming events that update that item). Use this to correlate delta/done/other events for the same output item. [1][3]

  • output_index — integer index of the output item’s position in the response output array (stable for the lifetime of that item in this response). Use this to order/match items inside the response list. [1][2]

  • content_index — integer index of the part within an output item’s content (e.g., for multi-part content or multiple content arrays); used to locate which content element a delta or done event refers to. [1][2]

  • sequence_number — monotonically increasing integer used to order streaming events globally for the response (useful to correctly interleave events when receiving them out of network order). Events for the same item/content will have sequence numbers that reflect the correct event ordering. [1][2]

Consistency notes / invariants to rely on:

  • item_id + output_index together identify the particular output item in a response; item_id is the stable unique id. [1][3]
  • content_index identifies which content element inside that item is being updated; deltas and done events include the content_index they apply to. [1][2]
  • sequence_number provides total ordering for events (clients should apply events in sequence_number order to reconstruct text/function-call arguments/etc.). [1][2]
  • Event types appear under a fixed "type" field (e.g., response.output_text.delta, response.output_text.done, response.content_part.added, response.content_part.done) so clients can branch handling by type. [1][2]

Sources:

  • llama-stack OpenAPI/streaming spec (schema definitions showing item_id, output_index, content_index, sequence_number and event types). [1][2][3]

🏁 Script executed:

# Find and examine the actual code at the specified lines
find . -type f -name "streaming_query_v2.py" | head -5

Repository: lightspeed-core/lightspeed-stack

Length of output: 117


🏁 Script executed:

# Read the actual code at the specified lines
sed -n '472,520p' ./src/app/endpoints/streaming_query_v2.py

Repository: lightspeed-core/lightspeed-stack

Length of output: 1734


create_violation_stream must keep item_id, output_index, and content_index consistent across streaming events.

The current implementation uses different IDs and indices across events: content_part.added (lines 484–492) uses item_id="msg_shield_violation_1", output_index=0, content_index=0, but output_text.delta (lines 494–500) uses item_id="msg_shield_violation_2", output_index=1, content_index=1. According to the llama-stack 0.3.5 streaming schema, item_id must remain stable to correlate events for the same output item, and output_index and content_index must be stable for the lifetime of that item. Inconsistent values will break consumers that reconstruct messages by correlating events via these fields.

Proposed fix (consistent IDs/indices)
 async def create_violation_stream(
     message: str,
     shield_model: str | None = None,
 ) -> AsyncIterator[OpenAIResponseObjectStream]:
@@
     response_id = "resp_shield_violation"
+    item_id = "msg_shield_violation"
+    output_index = 0
+    content_index = 0
 
     # Content part added (triggers empty initial token)
     yield OpenAIResponseObjectStreamResponseContentPartAdded(
-        content_index=0,
+        content_index=content_index,
         response_id=response_id,
-        item_id="msg_shield_violation_1",
-        output_index=0,
+        item_id=item_id,
+        output_index=output_index,
         part=OpenAIResponseContentPartOutputText(text=""),
         sequence_number=0,
     )
 
     # Text delta
     yield OpenAIResponseObjectStreamResponseOutputTextDelta(
-        content_index=1,
+        content_index=content_index,
         delta=message,
-        item_id="msg_shield_violation_2",
-        output_index=1,
+        item_id=item_id,
+        output_index=output_index,
         sequence_number=1,
     )
@@
             output=[
                 OpenAIResponseMessage(
-                    id="msg_shield_violation_3",
+                    id=item_id,
                     content=[
                         OpenAIResponseOutputMessageContentOutputText(text=message)
                     ],
                     role="assistant",
                     status="completed",
                 )
             ],
             status="completed",
         )
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def create_violation_stream(
message: str,
shield_model: str | None = None,
) -> AsyncIterator[OpenAIResponseObjectStream]:
"""Generate a minimal streaming response for cases where input is blocked by a shield.
This yields only the essential streaming events to indicate that the input was rejected.
Dummy item identifiers are used solely for protocol compliance and are not used later.
"""
response_id = "resp_shield_violation"
# Content part added (triggers empty initial token)
yield OpenAIResponseObjectStreamResponseContentPartAdded(
content_index=0,
response_id=response_id,
item_id="msg_shield_violation_1",
output_index=0,
part=OpenAIResponseContentPartOutputText(text=""),
sequence_number=0,
)
# Text delta
yield OpenAIResponseObjectStreamResponseOutputTextDelta(
content_index=1,
delta=message,
item_id="msg_shield_violation_2",
output_index=1,
sequence_number=1,
)
# Completed response
yield OpenAIResponseObjectStreamResponseCompleted(
response=OpenAIResponseObject(
id=response_id,
created_at=0, # not used
model=shield_model or "shield",
output=[
OpenAIResponseMessage(
id="msg_shield_violation_3",
content=[
OpenAIResponseOutputMessageContentOutputText(text=message)
],
role="assistant",
status="completed",
)
],
status="completed",
)
)
async def create_violation_stream(
message: str,
shield_model: str | None = None,
) -> AsyncIterator[OpenAIResponseObjectStream]:
"""Generate a minimal streaming response for cases where input is blocked by a shield.
This yields only the essential streaming events to indicate that the input was rejected.
Dummy item identifiers are used solely for protocol compliance and are not used later.
"""
response_id = "resp_shield_violation"
item_id = "msg_shield_violation"
output_index = 0
content_index = 0
# Content part added (triggers empty initial token)
yield OpenAIResponseObjectStreamResponseContentPartAdded(
content_index=content_index,
response_id=response_id,
item_id=item_id,
output_index=output_index,
part=OpenAIResponseContentPartOutputText(text=""),
sequence_number=0,
)
# Text delta
yield OpenAIResponseObjectStreamResponseOutputTextDelta(
content_index=content_index,
delta=message,
item_id=item_id,
output_index=output_index,
sequence_number=1,
)
# Completed response
yield OpenAIResponseObjectStreamResponseCompleted(
response=OpenAIResponseObject(
id=response_id,
created_at=0, # not used
model=shield_model or "shield",
output=[
OpenAIResponseMessage(
id=item_id,
content=[
OpenAIResponseOutputMessageContentOutputText(text=message)
],
role="assistant",
status="completed",
)
],
status="completed",
)
)
🤖 Prompt for AI Agents
In @src/app/endpoints/streaming_query_v2.py around lines 472 - 520,
create_violation_stream currently mixes item_id, output_index, and content_index
across events which breaks stream correlation; make these values stable for the
same output item: pick a single item_id (e.g., "msg_shield_violation_1") and
reuse it for the ContentPartAdded, OutputTextDelta, and the final
OpenAIResponseMessage, and keep content_index and output_index the same (e.g.,
content_index=0 and output_index=0) across those yields while still incrementing
sequence_number as appropriate; update the
OpenAIResponseObjectStreamResponseOutputTextDelta and the
OpenAIResponseObjectStreamResponseCompleted/OpenAIResponseMessage entries to use
the same item_id, content_index, and output_index used in
OpenAIResponseObjectStreamResponseContentPartAdded.

Comment on lines +135 to +148
- model_id: gpt-4o-mini
provider_id: openai
model_type: llm
provider_model_id: gpt-4o-mini
- model_id: sentence-transformers/all-mpnet-base-v2
model_type: embedding
provider_id: sentence-transformers
provider_model_id: sentence-transformers/all-mpnet-base-v2
metadata:
embedding_dimension: 768
- model_id: gpt-4o-mini
provider_id: openai
model_type: llm
provider_model_id: gpt-4o-mini
shields: []
shields:
- shield_id: llama-guard
provider_id: llama-guard
provider_shield_id: openai/gpt-4o-mini
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

llama-stack 0.3.5 Shield object provider_resource_id provider_shield_id

💡 Result:

Short answer:

  • provider_shield_id — the shield identifier defined by the external/provider implementation (the provider's own ID you pass when registering a shield). Use this when you want the provider to look up a specific guard/rail configuration (e.g., "llama-guard-basic" or a Bedrock guardrail id). [1][2]

  • provider_resource_id — the llama‑stack server's internal resource identifier for the Shield resource (the server-side resource id/UUID that ties the registered Shield object to the provider). It's present in the OpenAPI/spec for Shield resources and is the field the server uses to reference the specific registered resource. [3]

Sources:

  • Shield registration / safety docs (examples showing provider_shield_id usage). [1][2]
  • Llama‑stack spec showing Shield schema with provider_resource_id field. [3]

🌐 Web query:

llama-stack client.moderations.create() model parameter openai gpt-4o-mini

💡 Result:

Short answer: You should not pass "gpt-4o-mini" to the moderation endpoint. Use OpenAI's moderation model (e.g. "omni-moderation-latest") as the model parameter.

Example:
client.moderations.create(model="omni-moderation-latest", input="...")

Reason: OpenAI exposes dedicated moderation models (omni-moderation-*) for the v1/moderations endpoint; GPT-4o-mini is a general/chat/generation model, not the moderation model. (openai.com)

Note: Llama Stack / related configs also default to OpenAI's moderation model (omni-moderation-latest) when using OpenAI as the provider. (documentation.open-xchange.com)

Citations:


🌐 Web query:

llama-stack 0.3.5 shields llama-guard configuration documentation

💡 Result:

Summary — how to configure Llama Guard shields in llama-stack (v0.3.x)

  1. Add a safety provider for Llama Guard and an inference provider that serves the Llama‑Guard model (if the shield needs model inference). Example run.yaml snippet:
  • providers.safety: inline::llama-guard
  • add a second inference provider for the Llama‑Guard model
  • create a model entry for the Llama‑Guard model
  • add a shields entry pointing to the llama-guard provider/model. [1]

Example (adapted):
providers:
safety:

  • provider_id: llama-guard
    provider_type: inline::llama-guard
    config: {}
    inference:
  • provider_id: vllm-0
    provider_type: remote::vllm
    config:
    url: ${VLLM_URL}
  • provider_id: vllm-1
    provider_type: remote::vllm
    config:
    url: ${SAFETY_VLLM_URL}
    models:
  • model_id: ${INFERENCE_MODEL}
    provider_id: vllm-0
  • model_id: ${SAFETY_MODEL}
    provider_id: vllm-1
    shields:
  • provider_id: llama-guard
    shield_id: ${SAFETY_MODEL}
    provider_shield_id: null. [1]
  1. Provider-specific config fields
  • inline::llama-guard supports configuration like excluded_categories (list) in its provider config. See the provider page for field descriptions. [2]
  1. Registering and using shields at runtime
  • Register shields (client.shields.register) and run them via client.safety.run_shield (or attach shields to agents as input_shields/output_shields). Example registration and runtime usage are shown in the Safety Guardrails / API docs. [3]

Links / sources

  • Configuring a “Stack” / Safety (run.yaml + Llama Guard example). [1]
  • inline::llama-guard provider docs (config fields, sample). [2]
  • Safety Guardrails / Register-a-shield and examples (shield registration & runtime usage). [3]

If you want, I can paste the exact YAML + Python/JS snippets ready for v0.3.5 (or check the v0.3.5-specific docs/releases for any small changes).


Remove or reconfigure the llama-guard shield — using openai/gpt-4o-mini as a moderation model is invalid and will fail at runtime.

The configuration at lines 145–148 wires gpt-4o-mini (a general chat model) as the shield, but llama-stack's moderations API requires a dedicated moderation model. OpenAI exposes separate moderation models (e.g., omni-moderation-latest), not gpt-4o-mini. Standard llama-guard setup uses the Llama Guard model itself, not a general-purpose LLM. Either remove this shield entry or reconfigure it to use a proper moderation model with the correct inference provider.

🤖 Prompt for AI Agents
In @tests/e2e/configs/run-ci.yaml around lines 135 - 148, The shield entry using
shield_id "llama-guard" is configured with provider_shield_id
"openai/gpt-4o-mini", which is an invalid moderation model; either remove the
entire "llama-guard" shields block or change provider_id/provider_shield_id to a
valid moderation model (for OpenAI use something like "provider_id: openai" with
"provider_shield_id: omni-moderation-latest" or point to the proper Llama Guard
moderation model supported by the provider) and ensure metadata/inference
settings match the moderation provider's requirements.

@tisnik tisnik merged commit f229aa1 into lightspeed-core:main Jan 12, 2026
19 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants