-
Notifications
You must be signed in to change notification settings - Fork 63
LCORE-849: Shields implementation using Moderations API #979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds pre-LLM shield moderation that can short‑circuit Responses calls, new shield utilities and a ShieldModerationResult type, moves RateLimitError imports to Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Endpoint as Query Endpoint v2
participant ShieldUtil as Shield Utils
participant ModelSvc as Model Registry
participant ModSvc as Moderation Service
participant LLM as Responses API
Client->>Endpoint: Send user prompt
Endpoint->>ShieldUtil: run_shield_moderation(input_text)
ShieldUtil->>ModelSvc: list models / validate shield model
ShieldUtil->>ModSvc: moderations.create(input_text, shield_model)
ModSvc-->>ShieldUtil: moderation result (blocked/allowed)
alt Blocked
ShieldUtil-->>Endpoint: ShieldModerationResult(blocked=true, message)
Endpoint->>Endpoint: append_turn_to_conversation(conversation, violation)
Endpoint-->>Client: Return violation (single response or violation stream)
else Allowed
ShieldUtil-->>Endpoint: ShieldModerationResult(blocked=false)
Endpoint->>LLM: responses.create(...)
LLM-->>Endpoint: response / stream
Endpoint-->>Client: Relay response
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @docker-compose-library.yaml:
- Line 21: The environment variable default E2E_OPENAI_MODEL is set to
"gpt-4-mini", which appears to be a typo; update the default value to
"gpt-4o-mini" so it matches the actual OpenAI model naming and the rest of the
PR/shield configs—i.e., change the assignment for E2E_OPENAI_MODEL from
gpt-4-mini to gpt-4o-mini.
In @docker-compose.yaml:
- Line 19: The docker-compose environment default for E2E_OPENAI_MODEL uses an
invalid model name "gpt-4-mini"; update the default to the correct OpenAI model
"gpt-4o-mini" wherever this appears (e.g., the E2E_OPENAI_MODEL entry in
docker-compose.yaml and docker-compose-library.yaml) so that the environment
variable line (- E2E_OPENAI_MODEL=${E2E_OPENAI_MODEL:-gpt-4-mini}) becomes the
valid default with gpt-4o-mini.
In @src/app/endpoints/streaming_query_v2.py:
- Around line 472-517: In create_violation_stream, the
OpenAIResponseObjectStreamResponseOutputTextDelta yield is missing the
response_id field; update the OpenAIResponseObjectStreamResponseOutputTextDelta
instantiation (the one that currently sets content_index, delta, item_id,
output_index, sequence_number) to also include response_id=response_id so the
delta event matches the ContentPartAdded event and keeps event structure
consistent.
In @tests/e2e/features/steps/info.py:
- Around line 117-118: Update the assertion message for the check involving
found_shield["provider_resource_id"] so it matches the actual expected value
f"{expected_provider}/{expected_model} instead of only expected_model; modify
the failure string to include the full expected value (e.g.,
f\"{expected_provider}/{expected_model}\") and show the actual
found_shield["provider_resource_id"] so test failures report the correct
expected vs actual values.
🧹 Nitpick comments (4)
src/app/endpoints/streaming_query_v2.py (1)
444-454: Potential double normalization of conversation_id.At line 453,
normalize_conversation_id(conversation_id)is called, butconversation_idmay already be normalized (see line 437 where it's set vianormalize_conversation_id(llama_stack_conv_id)for new conversations). For existing conversations,conversation_idcomes from the request and may need normalization.However, looking at line 469 returning
normalize_conversation_id(conversation_id)for the non-blocked path - this is consistent with line 453, so the behavior is uniform. The issue is that for new conversations,conversation_idis already normalized at line 437, so line 453 and 469 perform redundant normalization.Consider using
llama_stack_conv_idhandling more consistently or accepting the minor redundancy.♻️ Optional: Use already-normalized ID
if moderation_result.blocked: violation_message = moderation_result.message or "" await append_turn_to_conversation( client, llama_stack_conv_id, input_text, violation_message ) return ( create_violation_stream(violation_message, moderation_result.shield_model), - normalize_conversation_id(conversation_id), + conversation_id, # Already normalized for new convs, or needs normalization for existing )src/utils/shields.py (2)
83-93: Consider caching models list to reduce API calls.
client.models.list()andclient.shields.list()are called on every moderation request. For high-traffic scenarios, this could create significant overhead. Consider caching the model list with a TTL, similar to how other configurations are cached in the codebase.#!/bin/bash # Check if there's existing caching patterns in the codebase rg -n "cachetools|TTLCache|lru_cache" --type py -A 3
133-157: Missing error handling for conversation append failure.If
client.conversations.items.createfails (network error, invalid conversation_id, etc.), the exception will propagate up, potentially leaving the user without feedback about the shield violation. Consider wrapping this in a try-except with logging.♻️ Add error handling for conversation append
async def append_turn_to_conversation( client: AsyncLlamaStackClient, conversation_id: str, user_message: str, assistant_message: str, ) -> None: """ Append a user/assistant turn to a conversation after shield violation. ... """ + try: await client.conversations.items.create( conversation_id, items=[ {"type": "message", "role": "user", "content": user_message}, {"type": "message", "role": "assistant", "content": assistant_message}, ], ) + except Exception as exc: + logger.warning( + "Failed to append shield violation turn to conversation %s: %s", + conversation_id, + exc, + ) + # Don't re-raise - the violation response should still be returned to usertests/unit/utils/test_shields.py (1)
316-344: Consider adding test for multiple shields iteration.The current tests cover single-shield scenarios. Consider adding a test where the first shield passes but the second shield blocks, to verify the iteration logic in
run_shield_moderation.🧪 Optional: Test for multiple shields
@pytest.mark.asyncio async def test_returns_blocked_when_second_shield_flags_content( self, mocker: MockerFixture ) -> None: """Test that run_shield_moderation blocks when second shield flags content.""" mocker.patch("utils.shields.metrics.llm_calls_validation_errors_total") mock_client = mocker.Mock() # Setup two shields shield1 = mocker.Mock() shield1.identifier = "shield-1" shield1.provider_resource_id = "model-1" shield2 = mocker.Mock() shield2.identifier = "shield-2" shield2.provider_resource_id = "model-2" mock_client.shields.list = mocker.AsyncMock(return_value=[shield1, shield2]) # Setup models model1 = mocker.Mock(identifier="model-1") model2 = mocker.Mock(identifier="model-2") mock_client.models.list = mocker.AsyncMock(return_value=[model1, model2]) # First shield passes, second blocks safe_result = mocker.Mock(flagged=False) flagged_result = mocker.Mock(flagged=True, categories=["violence"], user_message="Blocked") mock_client.moderations.create = mocker.AsyncMock( side_effect=[ mocker.Mock(results=[safe_result]), mocker.Mock(results=[flagged_result]), ] ) result = await run_shield_moderation(mock_client, "test input") assert result.blocked is True assert result.shield_model == "model-2"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
docker-compose-library.yamldocker-compose.yamlrun.yamlsrc/app/endpoints/query.pysrc/app/endpoints/query_v2.pysrc/app/endpoints/streaming_query.pysrc/app/endpoints/streaming_query_v2.pysrc/utils/shields.pysrc/utils/types.pytests/e2e/configs/run-azure.yamltests/e2e/configs/run-ci.yamltests/e2e/configs/run-rhaiis.yamltests/e2e/configs/run-rhelai.yamltests/e2e/configs/run-vertexai.yamltests/e2e/features/steps/info.pytests/unit/app/endpoints/test_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/utils/test_shields.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Use FastAPI dependencies:from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, usingtyping_extensions.Selffor model validators
Use union types with modern syntax:str | intinstead ofUnion[str, int]
UseOptional[Type]for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Uselogger.debug()for detailed diagnostic information
Uselogger.info()for general information about program execution
Uselogger.warning()for unexpected events or potential problems
Uselogger.error()for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes:Configuration,Error/Exception,Resolver,Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Runuv run make formatto auto-format code with black and ruff before completion
Runuv run make verifyto run all linters (black, pyl...
Files:
tests/e2e/features/steps/info.pytests/unit/app/endpoints/test_streaming_query_v2.pysrc/app/endpoints/streaming_query.pysrc/utils/types.pytests/unit/utils/test_shields.pytests/unit/app/endpoints/test_streaming_query.pysrc/app/endpoints/streaming_query_v2.pytests/unit/app/endpoints/test_query_v2.pysrc/app/endpoints/query.pysrc/app/endpoints/query_v2.pytests/unit/app/endpoints/test_query.pysrc/utils/shields.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Usepytest-mockfor AsyncMock objects in tests
UseMOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token")pattern for authentication mocks in tests
Files:
tests/e2e/features/steps/info.pytests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/utils/test_shields.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_query.py
tests/e2e/features/steps/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Define E2E test steps in
tests/e2e/features/steps/directory
Files:
tests/e2e/features/steps/info.py
tests/unit/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion
Files:
tests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/utils/test_shields.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_query.py
src/app/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use FastAPI
HTTPExceptionwith appropriate status codes for API endpoint error handling
Files:
src/app/endpoints/streaming_query.pysrc/app/endpoints/streaming_query_v2.pysrc/app/endpoints/query.pysrc/app/endpoints/query_v2.py
🧠 Learnings (6)
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use Llama Stack imports: `from llama_stack_client import AsyncLlamaStackClient`
Applied to files:
src/app/endpoints/streaming_query.pysrc/app/endpoints/query.pysrc/app/endpoints/query_v2.py
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use FastAPI dependencies: `from fastapi import APIRouter, HTTPException, Request, status, Depends`
Applied to files:
src/app/endpoints/streaming_query.pytests/unit/app/endpoints/test_streaming_query.pysrc/app/endpoints/streaming_query_v2.pytests/unit/app/endpoints/test_query_v2.pysrc/app/endpoints/query.pytests/unit/app/endpoints/test_query.py
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to **/*.py : Handle `APIConnectionError` from Llama Stack in error handling
Applied to files:
src/app/endpoints/streaming_query.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_query.py
📚 Learning: 2025-12-18T10:21:09.038Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 935
File: run.yaml:114-115
Timestamp: 2025-12-18T10:21:09.038Z
Learning: In Llama Stack version 0.3.x, telemetry provider configuration is not supported under the `providers` section in run.yaml configuration files. Telemetry can be enabled with just `telemetry.enabled: true` without requiring an explicit provider block.
Applied to files:
tests/e2e/configs/run-rhelai.yaml
📚 Learning: 2025-12-18T10:21:03.056Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 935
File: run.yaml:114-115
Timestamp: 2025-12-18T10:21:03.056Z
Learning: In run.yaml for Llama Stack 0.3.x, do not add a telemetry provider block under providers. To enable telemetry, set telemetry.enabled: true directly (no provider block required). This pattern applies specifically to run.yaml configuration in this repository.
Applied to files:
run.yaml
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to src/app/**/*.py : Use FastAPI `HTTPException` with appropriate status codes for API endpoint error handling
Applied to files:
tests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_query.py
🧬 Code graph analysis (5)
tests/unit/app/endpoints/test_streaming_query_v2.py (2)
src/utils/types.py (1)
ShieldModerationResult(103-108)src/models/responses.py (1)
model(1732-1745)
tests/unit/utils/test_shields.py (1)
src/utils/shields.py (4)
append_turn_to_conversation(133-157)detect_shield_violations(37-60)get_available_shields(19-34)run_shield_moderation(63-130)
src/app/endpoints/streaming_query_v2.py (2)
src/utils/shields.py (2)
append_turn_to_conversation(133-157)run_shield_moderation(63-130)src/utils/suid.py (1)
normalize_conversation_id(57-78)
src/app/endpoints/query_v2.py (4)
src/utils/shields.py (2)
append_turn_to_conversation(133-157)run_shield_moderation(63-130)src/utils/types.py (1)
TurnSummary(144-263)src/utils/suid.py (1)
normalize_conversation_id(57-78)src/utils/token_counter.py (1)
TokenCounter(18-47)
src/utils/shields.py (2)
src/models/responses.py (2)
NotFoundResponse(1532-1588)model(1732-1745)src/utils/types.py (1)
ShieldModerationResult(103-108)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E: library mode / ci
- GitHub Check: E2E: server mode / ci
🔇 Additional comments (43)
src/app/endpoints/streaming_query.py (2)
932-944: Exception handling forRateLimitErroris consistent with the new import.The usage of
RateLimitErrorin the exception handler correctly usesgetattr(e, "model", "")to safely access the model attribute. This pattern handles both cases where the exception may or may not have a model attribute, which is important given the import source change.
14-18: Import relocation is correct and verified.The
RateLimitErrorimport fromllama_stack_clientis properly exported and used consistently throughout the codebase. The# type: ignorecomment is justified—while the import functions correctly, the type stubs do not fully export this symbol. This pattern matches other endpoint files and aligns with the coding guidelines to use Llama Stack imports.tests/e2e/configs/run-vertexai.yaml (2)
82-85: OpenAI provider addition enables shield functionality.The new OpenAI inference provider is required for the llama-guard shield configuration below, which uses
openai/gpt-4o-minias the provider_shield_id.
140-143: Shield configuration aligns with the PR objective.The llama-guard shield is now configured with
openai/gpt-4o-minias the provider model, enabling the Moderations API implementation. This is consistent with other e2e config files in the PR. The use ofgpt-4o-minifor moderation (shields) versusgpt-4-minifor general inference (docker-compose defaults) is intentional and consistently applied across all configurations.src/app/endpoints/query.py (2)
11-16: Import relocation is consistent with the streaming_query.py changes.The
RateLimitErrorimport has been moved fromlitellm.exceptionstollama_stack_client, maintaining consistency across endpoint files. This aligns with the coding guidelines to use Llama Stack imports.
443-451: Exception handling forRateLimitErrorfollows the established pattern.The exception handler correctly uses
getattr(e, "model", "")to safely access the model attribute, consistent with the implementation instreaming_query.py.src/utils/types.py (1)
103-108: LGTM!The
ShieldModerationResultmodel is well-designed with appropriate type annotations using modern union syntax (str | None). The fields clearly capture the moderation outcome (blocked status, optional message, and shield model identifier).tests/e2e/features/steps/info.py (1)
109-109: LGTM!The changes correctly update the shield validation to use
context.default_providerand expect the identifier"llama-guard", aligning with the new shield configuration pattern across the PR.Also applies to: 120-121
tests/e2e/configs/run-azure.yaml (2)
83-86: LGTM!The OpenAI provider configuration is correctly added to support the shield moderation functionality. The API key is properly sourced from an environment variable.
145-148: LGTM!The shield configuration correctly maps the
llama-guardshield to useopenai/gpt-4o-minias the provider shield, consistent with the pattern established across other configuration files in this PR.tests/e2e/configs/run-rhelai.yaml (2)
27-30: LGTM!The OpenAI provider is correctly added to the RHEL AI test configuration, enabling the shield moderation functionality.
139-140: LGTM!The shield configuration is updated consistently with other configuration files in this PR, mapping
llama-guardtoopenai/gpt-4o-mini.tests/unit/app/endpoints/test_query.py (2)
12-13: LGTM!The import changes correctly align with the migration of
RateLimitErrorfromlitellmtollama_stack_client. Thehttpximport is needed for constructing mock HTTP responses in the rate limit test.
2423-2428: LGTM!The test correctly updates the
RateLimitErrorconstruction to use thellama_stack_clientAPI pattern with explicitresponseandbodyparameters. The expected error message is updated to match the new surface message.Also applies to: 2456-2456
run.yaml (1)
134-142: LGTM!The model and shield resource registrations are properly structured. The
provider_shield_id: openai/gpt-4o-minicorrectly references the OpenAI model for shield moderation, which aligns with the new Moderations API implementation.tests/e2e/configs/run-ci.yaml (1)
135-148: LGTM!The CI test configuration correctly adds the
gpt-4o-minimodel andllama-guardshield entries, consistent with the mainrun.yamlconfiguration. This ensures CI tests exercise the new shield moderation flow.src/app/endpoints/query_v2.py (3)
13-13: LGTM!The
AsyncLlamaStackClientimport is now clean without thetype: ignorecomment, indicating proper type support in the library version.
45-48: LGTM!The new imports for
append_turn_to_conversationandrun_shield_moderationfromutils.shieldsfollow the project's absolute import convention.
373-391: Shield moderation flow is well-implemented.The pre-LLM moderation check correctly:
- Runs moderation before the Responses API call
- Records the conversation turn even when blocked (for audit trail)
- Returns early with an appropriate
TurnSummarycontaining the violation message- Returns zero token counts since no LLM call was made
One minor note: On line 388,
normalize_conversation_id(conversation_id)is called, but when a new conversation is created (lines 363-366),conversation_idis already assigned the normalized value. This double normalization is harmless but redundant in the new-conversation path.tests/e2e/configs/run-rhaiis.yaml (2)
27-30: OpenAI provider added for shield moderation.The OpenAI provider is correctly configured to support the
llama-guardshield usinggpt-4o-mini. Note that this requires theOPENAI_API_KEYenvironment variable to be set for RHAIIS e2e tests.Please verify that the CI environment for RHAIIS tests has
OPENAI_API_KEYconfigured, as this is now a required dependency for shield moderation.
137-140: Shield configuration simplified to fixed model.The shield configuration has been changed from environment-variable-based (
${env.SAFETY_MODEL},${env.CODE_SCANNER_MODEL}) to a fixedopenai/gpt-4o-minimodel. This removes the ability to configure different safety models per environment.Confirm this is intentional - the PR description mentions the prior Shields API "did not work properly," so consolidating to a single working configuration makes sense. However, this removes the
code-scannershield entirely.tests/unit/app/endpoints/test_streaming_query.py (3)
11-12: LGTM!The imports correctly reflect the migration of
RateLimitErrorandAPIConnectionErrorfromlitellmtollama_stack_client. Thehttpximport is necessary to construct mock HTTP responses for the updatedRateLimitErrorsignature.
2142-2143: LGTM!The error message assertion is updated to match the new response format. The shortened message "The quota has been exceeded" is still descriptive and appropriately indicates the rate limit scenario.
2087-2092: RateLimitError usage is correct and consistent across tests.The test properly constructs the
RateLimitErrorwith the correct arguments: message string,responsehttpx.Response object with 429 status, andbody=None. This pattern is used consistently across all four related test files and correctly simulates the rate-limit scenario.src/app/endpoints/streaming_query_v2.py (3)
8-17: LGTM! New imports for violation stream support.The additional OpenAI response types are correctly imported to support the new
create_violation_streamfunction that emits shield violation responses.
62-65: LGTM! Shield moderation utilities imported.The imports for
append_turn_to_conversationandrun_shield_moderationalign with the new pre-LLM moderation flow.
500-517: LGTM! Completed response structure looks correct.The
OpenAIResponseObjectStreamResponseCompletedcorrectly wraps anOpenAIResponseObjectwith the violation message in the output, maintaining compatibility with the normal response processing path.src/utils/shields.py (2)
16-17: LGTM! Default violation message constant.The
DEFAULT_VIOLATION_MESSAGEprovides a user-friendly fallback when moderation blocks content without a specific message.
116-128: LGTM! BadRequestError handling with documented rationale.The catch for
BadRequestErrorwith the comment explaining the Llama Stack bug is appropriate defensive coding. Treating parse failures as blocked content errs on the side of safety.tests/unit/utils/test_shields.py (4)
1-16: LGTM! Test imports and setup.Correct use of pytest, pytest-mock, and httpx for mocking
BadRequestError. The import ofDEFAULT_VIOLATION_MESSAGEenables testing the fallback message behavior.
108-156: Good coverage of no-shields and moderation-passes scenarios.The tests correctly verify that
run_shield_moderationreturnsblocked=Falsewhen no shields are configured and when moderation passes.
158-196: LGTM! Flagged content test with metric verification.Good test coverage verifying the blocked result, custom message propagation, shield_model assignment, and metric increment.
277-314: LGTM! BadRequestError handling test.Correctly tests the documented Llama Stack bug workaround, verifying that
BadRequestErrorresults in blocked content with the default message.tests/unit/app/endpoints/test_streaming_query_v2.py (5)
153-190: LGTM! Mock-based stream events.The switch from
SimpleNamespacetoMockobjects for streaming events is appropriate and allows proper attribute access patterns. The explicit setting ofnameattribute onitem_mockcorrectly handles Mock's special parameter handling.
276-324: LGTM! Shield moderation integration test.The test correctly verifies that when shields are available,
moderations.createis called with the user's input and thatresponses.createproceeds when moderation passes.
362-450: LGTM! Shield violation streaming test.Good test coverage for the blocked-by-moderation path, verifying that the violation message appears in the streamed response.
558-570: LGTM! Quota exceeded test with proper RateLimitError mock.The test correctly constructs a
RateLimitErrorwith anhttpx.Responseobject, matching the expected interface from llama-stack-client.
10-11: No action needed.RateLimitErroris correctly imported fromllama_stack_clientwith proper constructor interface (message, response, and body arguments) and is consistently used across all test files.tests/unit/app/endpoints/test_query_v2.py (5)
8-9: LGTM! Updated imports for moderation flow.The imports of
httpx,RateLimitErrorfrom llama_stack_client, andShieldModerationResultare correctly added to support the new test scenarios.Also applies to: 20-20
134-136: Consistent shield moderation mocking pattern.The pattern of mocking
shields.listreturning empty list andmodels.listis consistently applied across tests. This ensuresrun_shield_moderationexecutes without blocking for tests not specifically testing moderation.
723-759: LGTM! Shield violation test verifies early return.Excellent test coverage verifying that when
run_shield_moderationreturnsblocked=True, theresponses.createis NOT called and the violation message is returned in the summary.
596-599: LGTM! Quota test with mocked moderation.Correctly mocks
run_shield_moderationto returnblocked=Falseso the test can proceed to exercise theRateLimitErrorpath fromresponses.create.
670-675: LGTM! Moderation and response calls verified.The test correctly asserts that
moderations.createis called with the user's query and model, and thatresponses.createis called after moderation passes.
82f8a81 to
066b0e5
Compare
|
LGTM nice work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/app/endpoints/streaming_query_v2.py (1)
472-517: Consider generating unique identifiers for violation streams.The hardcoded
response_id="resp_shield_violation"anditem_id="msg_shield_violation"will be identical across all shield violation responses. While this may work for the current use case, consider if unique IDs are needed for tracing, logging, or client-side deduplication.Additionally,
created_at=0uses a placeholder timestamp. If consumers rely on this field, consider usingint(time.time())instead.♻️ Optional: Generate unique IDs and proper timestamp
+import time +import uuid async def create_violation_stream( message: str, shield_model: str | None = None, ) -> AsyncIterator[OpenAIResponseObjectStream]: """Create a minimal response stream for shield violations.""" - response_id = "resp_shield_violation" - item_id = "msg_shield_violation" + response_id = f"resp_shield_{uuid.uuid4().hex[:12]}" + item_id = f"msg_shield_{uuid.uuid4().hex[:12]}" + created_at = int(time.time()) # ... rest of function ... # Completed response yield OpenAIResponseObjectStreamResponseCompleted( response=OpenAIResponseObject( id=response_id, - created_at=0, + created_at=created_at, model=shield_model or "shield",src/utils/shields.py (1)
116-128: BadRequestError handling is a pragmatic workaround.The comment clearly documents that this is a known Llama Stack bug where violations with unparseable formats raise
BadRequestError. Treating this as blocked content with the default message is a safe approach. Consider adding a TODO to revisit once the upstream bug is fixed.📝 Optional: Add TODO for future cleanup
# Known Llama Stack bug: BadRequestError is raised when violation is present # in the shield LLM response but has wrong format that cannot be parsed. + # TODO: Revisit this workaround once upstream Llama Stack bug is fixed. except BadRequestError:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
docker-compose-library.yamldocker-compose.yamlrun.yamlsrc/app/endpoints/query.pysrc/app/endpoints/query_v2.pysrc/app/endpoints/streaming_query.pysrc/app/endpoints/streaming_query_v2.pysrc/utils/shields.pysrc/utils/types.pytests/e2e/configs/run-azure.yamltests/e2e/configs/run-ci.yamltests/e2e/configs/run-rhaiis.yamltests/e2e/configs/run-rhelai.yamltests/e2e/configs/run-vertexai.yamltests/e2e/features/steps/info.pytests/unit/app/endpoints/test_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/utils/test_shields.py
🚧 Files skipped from review as they are similar to previous changes (8)
- src/utils/types.py
- src/app/endpoints/query.py
- tests/e2e/configs/run-azure.yaml
- tests/e2e/configs/run-rhaiis.yaml
- tests/e2e/configs/run-ci.yaml
- tests/unit/app/endpoints/test_query.py
- tests/e2e/features/steps/info.py
- tests/unit/utils/test_shields.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Use FastAPI dependencies:from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, usingtyping_extensions.Selffor model validators
Use union types with modern syntax:str | intinstead ofUnion[str, int]
UseOptional[Type]for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Uselogger.debug()for detailed diagnostic information
Uselogger.info()for general information about program execution
Uselogger.warning()for unexpected events or potential problems
Uselogger.error()for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes:Configuration,Error/Exception,Resolver,Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Runuv run make formatto auto-format code with black and ruff before completion
Runuv run make verifyto run all linters (black, pyl...
Files:
src/app/endpoints/streaming_query_v2.pysrc/app/endpoints/query_v2.pysrc/app/endpoints/streaming_query.pysrc/utils/shields.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_streaming_query_v2.py
src/app/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use FastAPI
HTTPExceptionwith appropriate status codes for API endpoint error handling
Files:
src/app/endpoints/streaming_query_v2.pysrc/app/endpoints/query_v2.pysrc/app/endpoints/streaming_query.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Usepytest-mockfor AsyncMock objects in tests
UseMOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token")pattern for authentication mocks in tests
Files:
tests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_streaming_query_v2.py
tests/unit/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion
Files:
tests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_streaming_query_v2.py
🧠 Learnings (5)
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use FastAPI dependencies: `from fastapi import APIRouter, HTTPException, Request, status, Depends`
Applied to files:
src/app/endpoints/streaming_query_v2.pysrc/app/endpoints/streaming_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use Llama Stack imports: `from llama_stack_client import AsyncLlamaStackClient`
Applied to files:
src/app/endpoints/query_v2.pysrc/app/endpoints/streaming_query.py
📚 Learning: 2025-12-18T10:21:03.056Z
Learnt from: are-ces
Repo: lightspeed-core/lightspeed-stack PR: 935
File: run.yaml:114-115
Timestamp: 2025-12-18T10:21:03.056Z
Learning: In run.yaml for Llama Stack 0.3.x, do not add a telemetry provider block under providers. To enable telemetry, set telemetry.enabled: true directly (no provider block required). This pattern applies specifically to run.yaml configuration in this repository.
Applied to files:
run.yaml
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to **/*.py : Handle `APIConnectionError` from Llama Stack in error handling
Applied to files:
src/app/endpoints/streaming_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to src/app/**/*.py : Use FastAPI `HTTPException` with appropriate status codes for API endpoint error handling
Applied to files:
tests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
🧬 Code graph analysis (4)
src/app/endpoints/streaming_query_v2.py (2)
src/utils/shields.py (2)
append_turn_to_conversation(133-157)run_shield_moderation(63-130)src/utils/suid.py (1)
normalize_conversation_id(57-78)
src/app/endpoints/query_v2.py (4)
src/utils/shields.py (2)
append_turn_to_conversation(133-157)run_shield_moderation(63-130)src/utils/types.py (1)
TurnSummary(144-263)src/utils/suid.py (1)
normalize_conversation_id(57-78)src/utils/token_counter.py (1)
TokenCounter(18-47)
src/utils/shields.py (2)
src/models/responses.py (2)
NotFoundResponse(1532-1588)model(1732-1745)src/utils/types.py (1)
ShieldModerationResult(103-108)
tests/unit/app/endpoints/test_query_v2.py (4)
src/utils/types.py (1)
ShieldModerationResult(103-108)src/models/responses.py (1)
model(1732-1745)src/app/endpoints/streaming_query.py (1)
retrieve_response(999-1125)src/app/endpoints/streaming_query_v2.py (1)
retrieve_response(370-469)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E: library mode / ci
- GitHub Check: E2E: server mode / ci
🔇 Additional comments (33)
docker-compose.yaml (1)
19-19: LGTM!Adding a default value for
E2E_OPENAI_MODELprevents runtime failures when the environment variable is not set and aligns with the updated defaults indocker-compose-library.yaml.docker-compose-library.yaml (1)
21-21: LGTM!Default model updated to
gpt-4o-mini, consistent with the maindocker-compose.yaml.tests/e2e/configs/run-vertexai.yaml (2)
140-143: LGTM!Shield configuration correctly references the
llama-guardsafety provider and usesopenai/gpt-4o-minias the backing model for moderation.
82-85: LGTM — OpenAI provider is properly added to enable shield functionality via the Moderations API. This introduces OPENAI_API_KEY as a dependency for VertexAI E2E tests, which is already handled in CI (e2e_tests_providers.yaml provides it as a GitHub secret) and documented in README.md.src/app/endpoints/query_v2.py (3)
13-13: LGTM!Import retained from
llama_stack_clientper coding guidelines.
45-48: LGTM!New imports for shield moderation utilities enable the pre-LLM moderation flow that replaces the previous Shields API implementation.
373-391: LGTM!Pre-LLM shield moderation correctly implemented:
- Moderation runs before the Responses API call
- Blocked content short-circuits the LLM call with proper early return
- Conversation history is preserved via
append_turn_to_conversation- Returns appropriate empty values for tool calls, RAG chunks, and token usage when blocked
This aligns with the PR objective of implementing input shields using the Moderations API.
src/app/endpoints/streaming_query.py (1)
14-18: Import change is correct.RateLimitErroris properly exported fromllama_stack_client, validating the consolidation of Llama Stack-related imports. The# type: ignorecomment appropriately addresses potential type stub gaps.run.yaml (1)
134-142: Configuration looks good.The registered resources properly define the
gpt-4o-minimodel andllama-guardshield with the correct provider mappings. The shield'sprovider_shield_idcorrectly references the OpenAI model that will be used for moderation.tests/e2e/configs/run-rhelai.yaml (2)
27-30: OpenAI provider addition looks correct.Adding the OpenAI provider enables the
llama-guardshield to usegpt-4o-minifor moderation while the main inference continues to use the vLLM provider for RHEL AI models.
139-140: Shield configuration is consistent.The shield configuration aligns with the main
run.yamland other test configs, usingllama-guardwithopenai/gpt-4o-minias the moderation model.tests/unit/app/endpoints/test_streaming_query_v2.py (7)
5-11: Imports updated correctly for new moderation flow.The imports now include
Mockfor simulating streaming events,httpxfor constructing the rate limit response, and the llama-stack-client exceptions. This aligns with the updated error handling pattern.
53-55: Shield moderation mocks added appropriately.Mocking
shields.listandmodels.listis necessary for therun_shield_moderationfunction. Returning empty lists simulates a scenario with no shields configured.
153-190: Streaming event simulation using Mock is cleaner.Using
Mockobjects with explicit attributes provides better control over the simulated streaming events compared toSimpleNamespace. The explicit setting ofnameattribute (line 179) is correctly handled sincenameis a special Mock parameter.
276-324: Good coverage for shield moderation pass scenario.This test properly validates that:
- Moderation is called with the user's query
- When moderation passes (not flagged),
responses.createis invokedThe mock setup for shields, models, and moderation results is comprehensive.
362-450: Shield violation streaming test is comprehensive.The test correctly simulates a violation stream response with the expected structure: content part added, text delta with violation message, and completed response. The assertion on line 450 properly validates the violation message appears in the stream output.
585-587: ShieldModerationResult mock is correctly applied.Mocking
run_shield_moderationto returnShieldModerationResult(blocked=False)allows the test to focus on the quota exceeded scenario without interference from the moderation flow.
558-570: RateLimitError construction requires external package verification.The test code at lines 558-570 constructs
RateLimitErrorwith parameters: a message string,response=mock_response(httpx.Response object), andbody=None. While this pattern is used consistently across multiple test files (test_query.py, test_query_v2.py, test_streaming_query.py, test_streaming_query_v2.py), verification of the exact constructor signature forllama-stack-client==0.3.5could not be completed. Confirm that thehttpx.Responsetype andbody=Noneparameter align with the actual RateLimitError constructor in the external package.tests/unit/app/endpoints/test_query_v2.py (5)
8-9: Imports updated consistently with streaming tests.The imports for
httpxand llama-stack-client exceptions align with the pattern established in the streaming query tests.
574-578: RateLimitError mock updated correctly.The mock response construction using
httpx.Responsewith status code 429 properly simulates the rate limit scenario.
596-599: Moderation bypass for quota test is appropriate.Mocking
run_shield_moderationto returnblocked=Falseallows the test to focus specifically on the quota exceeded behavior.
611-612: Assertion matches updated quota error message.The response message changed from "The model quota has been exceeded" to "The quota has been exceeded", which aligns with the
QuotaExceededResponsemodel shown in the relevant code snippets.
724-759: Shield violation test correctly validates blocking behavior.This test is critical for validating the new moderation flow:
- Mocks
run_shield_moderationto return a blocked result with a custom message- Verifies the violation message is returned in the summary
- Correctly asserts that
responses.createis NOT called when content is blockedsrc/app/endpoints/streaming_query_v2.py (3)
8-16: OpenAI response stream type imports added correctly.The additional imports provide the necessary types to construct the violation stream response, including content parts, text deltas, and the completed response structure.
62-65: Shield utility imports properly organized.Importing
append_turn_to_conversationandrun_shield_moderationfromutils.shieldsprovides clean separation of the moderation logic.
444-454: Pre-LLM shield moderation integration is well-structured.The implementation correctly:
- Runs moderation before the LLM call
- Appends the turn to the conversation when blocked (maintaining conversation history)
- Returns a violation stream instead of calling the Responses API
This ensures users see their blocked query in conversation history with the appropriate violation message.
tests/unit/app/endpoints/test_streaming_query.py (3)
11-12: Imports updated consistently.The change from
litellm.exceptionstollama_stack_clientforRateLimitErrorandAPIConnectionErroraligns with the broader refactoring across the test suite.
2087-2091: RateLimitError mock construction updated.The test now correctly constructs the
RateLimitErrorwith anhttpx.Responseobject and explicitbody=Noneparameter, matching the updated exception signature.
2142-2143: Assertion updated to match new error format.The quota exceeded message now correctly expects "The quota has been exceeded" instead of the previous "The model quota has been exceeded".
src/utils/shields.py (4)
4-12: Imports and types are well-organized.The imports properly include
HTTPExceptionfor error handling,BadRequestErrorfor the Llama Stack workaround,CreateResponsefor type casting, and the custom typesNotFoundResponseandShieldModerationResult.
16-17: Default violation message is user-friendly.The message "I cannot process this request due to policy restrictions." is appropriately vague to avoid revealing security implementation details while being informative to users.
63-130: Core moderation implementation is well-structured.The function correctly:
- Fetches available models upfront to validate shield configurations
- Raises 404 when a shield's model is not found (lines 86-93)
- Iterates through all shields and calls
moderations.create- Returns early on first violation with appropriate metadata
- Handles
BadRequestErroras a blocked response (documented Llama Stack bug)- Increments metrics on violations
- Returns
ShieldModerationResult(blocked=False)when all shields passOne consideration: the validation at lines 86-89 treats both missing
provider_resource_idAND model-not-found as the same error. This may mask configuration issues whereprovider_resource_idis simply not set vs. being set to an invalid model.
133-157: Conversation turn recording is implemented correctly.The function properly records both the user's original message and the violation response to maintain complete conversation history. This ensures users can see their blocked queries in the conversation context.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit in tests (model name), but LGTM in overall. Thank you!
| assert isinstance(detail, dict) | ||
| assert detail["response"] == "The model quota has been exceeded" # type: ignore | ||
| assert detail["response"] == "The quota has been exceeded" # type: ignore | ||
| assert "gpt-4-turbo" in detail["cause"] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still "gpt-4-turbo" there?
066b0e5 to
710cc33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/app/endpoints/test_query.py (1)
2415-2457: Quota exceeded message expectation may be inconsistent withQuotaExceededResponse.model(...).
Line 2456 expects"The quota has been exceeded", butsrc/models/responses.py:QuotaExceededResponse.model()(per provided snippet) returns"The model quota has been exceeded". If you intentionally changed surface messaging, consider updatingQuotaExceededResponse(or the OpenAPI examples) to match the user-facing response.
🤖 Fix all issues with AI agents
In @src/app/endpoints/streaming_query_v2.py:
- Around line 62-65: The code currently appends the full input_text (which may
include raw attachment bodies built earlier) to the conversation when a shield
violation is detected; change the blocked-path behavior so
append_turn_to_conversation receives a redacted user message instead of full
input_text — e.g., pass query_request.query or a short redacted attachment
summary created from the attachment loop — and ensure any variable named
input_text (or the place where attachment bodies are concatenated) is not stored
on violation; update the code paths where append_turn_to_conversation is called
after run_shield_moderation to use the redacted string.
- Around line 472-520: create_violation_stream currently mixes item_id,
output_index, and content_index across events which breaks stream correlation;
make these values stable for the same output item: pick a single item_id (e.g.,
"msg_shield_violation_1") and reuse it for the ContentPartAdded,
OutputTextDelta, and the final OpenAIResponseMessage, and keep content_index and
output_index the same (e.g., content_index=0 and output_index=0) across those
yields while still incrementing sequence_number as appropriate; update the
OpenAIResponseObjectStreamResponseOutputTextDelta and the
OpenAIResponseObjectStreamResponseCompleted/OpenAIResponseMessage entries to use
the same item_id, content_index, and output_index used in
OpenAIResponseObjectStreamResponseContentPartAdded.
In @tests/e2e/configs/run-ci.yaml:
- Around line 135-148: The shield entry using shield_id "llama-guard" is
configured with provider_shield_id "openai/gpt-4o-mini", which is an invalid
moderation model; either remove the entire "llama-guard" shields block or change
provider_id/provider_shield_id to a valid moderation model (for OpenAI use
something like "provider_id: openai" with "provider_shield_id:
omni-moderation-latest" or point to the proper Llama Guard moderation model
supported by the provider) and ensure metadata/inference settings match the
moderation provider's requirements.
🧹 Nitpick comments (5)
src/utils/types.py (1)
103-108: LGTM! Well-structured model for shield moderation results.The
ShieldModerationResultclass is cleanly defined with appropriate fields and uses modern type annotation syntax (str | None).Consider adding an
Attributessection to the docstring per Google Python docstring conventions to document each field's purpose:📝 Optional docstring enhancement
class ShieldModerationResult(BaseModel): - """Result of shield moderation check.""" + """Result of shield moderation check. + + Attributes: + blocked: Whether the content was blocked by the shield. + message: Optional message explaining the moderation decision. + shield_model: Optional identifier of the model that performed moderation. + """ blocked: bool message: str | None = None shield_model: str | None = Nonetests/unit/app/endpoints/test_query_v2.py (1)
615-760: Nice coverage: verifies moderation gatesresponses.create, but add an assertion that the blocked path records a turn (if that’s a requirement).
Line 724-Line 760 tests “blocked returns early”; it currently assertsresponses.createisn’t called, but doesn’t assertconversations.items.create(orappend_turn_to_conversation) behavior. If the PR objective includes “store the violation turn,” asserting it here would prevent regressions.tests/unit/app/endpoints/test_streaming_query_v2.py (1)
553-587: Good: quota test now matches newRateLimitErrorconstruction, but keep one shared helper to reduce drift across v1/v2 tests.
The 429 setup pattern repeats across multiple test files; a small helper (e.g.,make_httpx_429_response()+make_rate_limit_error()) would reduce future churn.src/utils/shields.py (2)
85-94: Missingprovider_resource_idis likely a config error; 404 “NotFound” may be the wrong semantic.
Line 86-Line 94 treats missing/invalidprovider_resource_idas 404. If this is “misconfigured server” rather than “client requested unknown resource,” consider using a 500-style response (or at least a clearer “shield misconfigured” message).
116-129: Catching onlyBadRequestErrormay under-handle other failure modes (timeouts / APIConnectionError).
Line 118-Line 128 treats a known parsing bug as “blocked,” which is fine, but consider whether connection failures should propagate as 503 instead of “blocked.” As per learnings, ensureAPIConnectionErroris handled somewhere in the call chain.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
docker-compose-library.yamldocker-compose.yamlrun.yamlsrc/app/endpoints/query.pysrc/app/endpoints/query_v2.pysrc/app/endpoints/streaming_query.pysrc/app/endpoints/streaming_query_v2.pysrc/utils/shields.pysrc/utils/types.pytests/e2e/configs/run-azure.yamltests/e2e/configs/run-ci.yamltests/e2e/configs/run-rhaiis.yamltests/e2e/configs/run-rhelai.yamltests/e2e/configs/run-vertexai.yamltests/e2e/features/steps/info.pytests/unit/app/endpoints/test_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/utils/test_shields.py
🚧 Files skipped from review as they are similar to previous changes (8)
- run.yaml
- tests/e2e/configs/run-rhelai.yaml
- src/app/endpoints/query.py
- docker-compose-library.yaml
- src/app/endpoints/query_v2.py
- tests/unit/utils/test_shields.py
- src/app/endpoints/streaming_query.py
- tests/e2e/configs/run-rhaiis.yaml
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Use FastAPI dependencies:from fastapi import APIRouter, HTTPException, Request, status, Depends
Use Llama Stack imports:from llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules start with descriptive docstrings explaining purpose
Uselogger = logging.getLogger(__name__)pattern for module logging
Type aliases defined at module level for clarity
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types, usingtyping_extensions.Selffor model validators
Use union types with modern syntax:str | intinstead ofUnion[str, int]
UseOptional[Type]for optional parameters
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead
Useasync deffor I/O operations and external API calls
HandleAPIConnectionErrorfrom Llama Stack in error handling
Uselogger.debug()for detailed diagnostic information
Uselogger.info()for general information about program execution
Uselogger.warning()for unexpected events or potential problems
Uselogger.error()for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes:Configuration,Error/Exception,Resolver,Interface
Use ABC for abstract base classes with@abstractmethoddecorators
Complete type annotations for all class attributes
Follow Google Python docstring conventions (https://google.github.io/styleguide/pyguide.html) with sections: Args, Returns, Raises, Attributes
Runuv run make formatto auto-format code with black and ruff before completion
Runuv run make verifyto run all linters (black, pyl...
Files:
tests/unit/app/endpoints/test_query.pysrc/app/endpoints/streaming_query_v2.pysrc/utils/shields.pytests/e2e/features/steps/info.pytests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.pysrc/utils/types.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest for all unit and integration tests, not unittest
Usepytest-mockfor AsyncMock objects in tests
UseMOCK_AUTH = ("mock_user_id", "mock_username", False, "mock_token")pattern for authentication mocks in tests
Files:
tests/unit/app/endpoints/test_query.pytests/e2e/features/steps/info.pytests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
tests/unit/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/unit/**/*.py: Unit tests require 60% code coverage
Write unit tests covering new functionality before completion
Files:
tests/unit/app/endpoints/test_query.pytests/unit/app/endpoints/test_streaming_query_v2.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
src/app/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use FastAPI
HTTPExceptionwith appropriate status codes for API endpoint error handling
Files:
src/app/endpoints/streaming_query_v2.py
tests/e2e/features/steps/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Define E2E test steps in
tests/e2e/features/steps/directory
Files:
tests/e2e/features/steps/info.py
🧠 Learnings (3)
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to src/app/**/*.py : Use FastAPI `HTTPException` with appropriate status codes for API endpoint error handling
Applied to files:
tests/unit/app/endpoints/test_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.768Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.768Z
Learning: Applies to **/*.py : Handle `APIConnectionError` from Llama Stack in error handling
Applied to files:
tests/unit/app/endpoints/test_query.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
📚 Learning: 2026-01-11T16:30:41.767Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-11T16:30:41.767Z
Learning: Applies to **/*.py : Use FastAPI dependencies: `from fastapi import APIRouter, HTTPException, Request, status, Depends`
Applied to files:
tests/unit/app/endpoints/test_query.pysrc/app/endpoints/streaming_query_v2.pytests/unit/app/endpoints/test_query_v2.pytests/unit/app/endpoints/test_streaming_query.py
🧬 Code graph analysis (5)
tests/unit/app/endpoints/test_query.py (1)
src/models/responses.py (1)
model(1732-1745)
src/app/endpoints/streaming_query_v2.py (2)
src/utils/shields.py (2)
append_turn_to_conversation(133-157)run_shield_moderation(63-130)src/utils/suid.py (1)
normalize_conversation_id(57-78)
src/utils/shields.py (2)
src/models/responses.py (2)
NotFoundResponse(1532-1588)model(1732-1745)src/utils/types.py (1)
ShieldModerationResult(103-108)
tests/unit/app/endpoints/test_streaming_query_v2.py (2)
src/utils/types.py (1)
ShieldModerationResult(103-108)src/models/responses.py (1)
model(1732-1745)
tests/unit/app/endpoints/test_streaming_query.py (1)
src/models/responses.py (1)
model(1732-1745)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E: server mode / ci
- GitHub Check: E2E: library mode / ci
🔇 Additional comments (11)
tests/e2e/features/steps/info.py (1)
108-124: LGTM! Shield validation updated to match new shield structure.The changes correctly update the validation to expect:
provider_resource_idin the format"{provider}/{model}"(e.g.,openai/gpt-4o-mini)identifieras"llama-guard"matching the new shield configurationdocker-compose.yaml (1)
19-19: LGTM! Sensible default for E2E_OPENAI_MODEL.Adding
gpt-4o-minias the fallback ensures the container starts correctly when the environment variable is unset, aligning with the model used across other configs in this PR.tests/e2e/configs/run-vertexai.yaml (2)
140-143: LGTM! Shield configuration follows consistent pattern.The
llama-guardshield is configured correctly, usingopenai/gpt-4o-minias the backing model for moderation, consistent with other E2E configs in this PR.
82-85: The OpenAI provider addition is consistent with existing codebase usage.OPENAI_API_KEYis already documented in README.md and multiple guides (docs/providers.md, docs/getting_started.md, docs/deployment_guide.md, docs/byok_guide.md), already configured in CI/CD workflows (e2e_tests.yaml, e2e_tests_providers.yaml, etc. viasecrets.OPENAI_API_KEY), and already in use in other configuration files (run.yaml, tests/configuration/run.yaml, examples). No additional verification or documentation is needed—this change follows the established pattern.tests/e2e/configs/run-azure.yaml (2)
83-86: LGTM! OpenAI provider added for shield moderation.The OpenAI provider is correctly configured to support the
llama-guardshield, which routes moderation calls to OpenAI's Moderations API while the main inference uses Azure.
145-148: LGTM! Shield configuration consistent across configs.The
llama-guardshield is configured identically to other E2E configs, ensuring consistent moderation behavior across different provider backends.src/app/endpoints/streaming_query_v2.py (1)
444-469: Error handling for new I/O calls is covered by the endpoint's error mapping.The new calls to
run_shield_moderation()andappend_turn_to_conversation()inretrieve_response()are not explicitly wrapped in try-except blocks within the function itself. However, they are covered by the base handler's error mapping atstreaming_query_endpoint_handler_base()(streaming_query.py, line 920), which catchesAPIConnectionErrorand converts it to a 503ServiceUnavailableResponseusingstream_http_error(). AnyAPIConnectionErrorraised by these new I/O calls will propagate up and be handled consistently with the rest of the endpoint, meeting the requirement for 503/5xx behavior on API connection failures.tests/unit/app/endpoints/test_streaming_query.py (1)
11-13: Remove the unfounded.modelattribute extraction fromRateLimitError.The
getattr(e, "model", "")calls in theRateLimitErrorhandlers assume llama-stack-client exposes a.modelfield, but the library only provides.status_codeand.response. Since the attribute doesn't exist, the fallback is always used, making the model-specific message path unreachable. Either extract the model name from the error message string or remove this dead code path to avoid future maintenance confusion.Applies to:
src/app/endpoints/streaming_query.pyandsrc/app/endpoints/query.pyRateLimitError handlers, and corresponding tests.src/utils/shields.py (1)
6-13:cast(CreateResponse, moderation)doesn't validate the response structure at runtime.The
cast()on line 99 is a type hint only—it provides no runtime validation. Lines 101-114 directly access.results[0].flagged,.user_message, and.categorieswithout checking if the response actually contains these attributes. If the API response structure changes or differs from expectations, these attribute accesses will fail withAttributeErrorat runtime. Consider adding explicit checks before accessing nested attributes, or use a validated Pydantic model instead ofcast().tests/unit/app/endpoints/test_query_v2.py (1)
7-10: No issues found—imports and mocking patterns align with llama-stack-client 0.3.5 API.httpx, RateLimitError, and moderations response shape are correctly used. The Llama Stack Shields API (not standard OpenAI Moderation) includes the
user_messagefield in violation responses, which the code and tests properly handle. Tests correctly mockresults[0].flagged,results[0].categories, andresults[0].user_messageas documented in the Llama Stack API.tests/unit/app/endpoints/test_query.py (1)
11-14: Verify constructor signature forRateLimitErrorwhen mocking in tests.The test at lines 2424–2428 constructs
RateLimitError(message, response=httpx.Response, body=None). While web documentation confirmsRateLimitErrorexists inllama-stack-client==0.3.5and exposes.status_codeand.responseproperties, the exact constructor signature and parameter types (particularly whetherbodyis accepted) could not be definitively verified from available sources. Consider checking the library's source code or type stubs to confirm the signature matches your usage.
| from utils.shields import ( | ||
| append_turn_to_conversation, | ||
| run_shield_moderation, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential PII/secrets leak: blocked-path stores full input_text (including attachment content) into conversation.
Line 448-Line 450 appends input_text to the conversation on violation; but input_text includes raw attachment bodies (Line 417-Line 423). This can persist secrets even though the request is blocked. Consider storing only query_request.query (or a redacted attachment summary) in the conversation items.
Proposed mitigation (store redacted user message on violation)
- await append_turn_to_conversation(
- client, llama_stack_conv_id, input_text, violation_message
- )
+ # Avoid persisting raw attachment contents on blocked requests.
+ user_message_for_storage = query_request.query
+ if query_request.attachments:
+ user_message_for_storage += "\n\n[Attachments redacted]"
+ await append_turn_to_conversation(
+ client, llama_stack_conv_id, user_message_for_storage, violation_message
+ )Also applies to: 444-454
🤖 Prompt for AI Agents
In @src/app/endpoints/streaming_query_v2.py around lines 62 - 65, The code
currently appends the full input_text (which may include raw attachment bodies
built earlier) to the conversation when a shield violation is detected; change
the blocked-path behavior so append_turn_to_conversation receives a redacted
user message instead of full input_text — e.g., pass query_request.query or a
short redacted attachment summary created from the attachment loop — and ensure
any variable named input_text (or the place where attachment bodies are
concatenated) is not stored on violation; update the code paths where
append_turn_to_conversation is called after run_shield_moderation to use the
redacted string.
| async def create_violation_stream( | ||
| message: str, | ||
| shield_model: str | None = None, | ||
| ) -> AsyncIterator[OpenAIResponseObjectStream]: | ||
| """Generate a minimal streaming response for cases where input is blocked by a shield. | ||
| This yields only the essential streaming events to indicate that the input was rejected. | ||
| Dummy item identifiers are used solely for protocol compliance and are not used later. | ||
| """ | ||
| response_id = "resp_shield_violation" | ||
|
|
||
| # Content part added (triggers empty initial token) | ||
| yield OpenAIResponseObjectStreamResponseContentPartAdded( | ||
| content_index=0, | ||
| response_id=response_id, | ||
| item_id="msg_shield_violation_1", | ||
| output_index=0, | ||
| part=OpenAIResponseContentPartOutputText(text=""), | ||
| sequence_number=0, | ||
| ) | ||
|
|
||
| # Text delta | ||
| yield OpenAIResponseObjectStreamResponseOutputTextDelta( | ||
| content_index=1, | ||
| delta=message, | ||
| item_id="msg_shield_violation_2", | ||
| output_index=1, | ||
| sequence_number=1, | ||
| ) | ||
|
|
||
| # Completed response | ||
| yield OpenAIResponseObjectStreamResponseCompleted( | ||
| response=OpenAIResponseObject( | ||
| id=response_id, | ||
| created_at=0, # not used | ||
| model=shield_model or "shield", | ||
| output=[ | ||
| OpenAIResponseMessage( | ||
| id="msg_shield_violation_3", | ||
| content=[ | ||
| OpenAIResponseOutputMessageContentOutputText(text=message) | ||
| ], | ||
| role="assistant", | ||
| status="completed", | ||
| ) | ||
| ], | ||
| status="completed", | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
llama-stack 0.3.5 streaming response event schema item_id output_index content_index consistency
💡 Result:
Brief summary for llama-stack (v0.3.5) streaming response event fields — meaning and consistency guarantees:
-
item_id — unique identifier for a single output item (stays the same across all streaming events that update that item). Use this to correlate delta/done/other events for the same output item. [1][3]
-
output_index — integer index of the output item’s position in the response output array (stable for the lifetime of that item in this response). Use this to order/match items inside the response list. [1][2]
-
content_index — integer index of the part within an output item’s content (e.g., for multi-part content or multiple content arrays); used to locate which content element a delta or done event refers to. [1][2]
-
sequence_number — monotonically increasing integer used to order streaming events globally for the response (useful to correctly interleave events when receiving them out of network order). Events for the same item/content will have sequence numbers that reflect the correct event ordering. [1][2]
Consistency notes / invariants to rely on:
- item_id + output_index together identify the particular output item in a response; item_id is the stable unique id. [1][3]
- content_index identifies which content element inside that item is being updated; deltas and done events include the content_index they apply to. [1][2]
- sequence_number provides total ordering for events (clients should apply events in sequence_number order to reconstruct text/function-call arguments/etc.). [1][2]
- Event types appear under a fixed "type" field (e.g., response.output_text.delta, response.output_text.done, response.content_part.added, response.content_part.done) so clients can branch handling by type. [1][2]
Sources:
- llama-stack OpenAPI/streaming spec (schema definitions showing item_id, output_index, content_index, sequence_number and event types). [1][2][3]
🏁 Script executed:
# Find and examine the actual code at the specified lines
find . -type f -name "streaming_query_v2.py" | head -5Repository: lightspeed-core/lightspeed-stack
Length of output: 117
🏁 Script executed:
# Read the actual code at the specified lines
sed -n '472,520p' ./src/app/endpoints/streaming_query_v2.pyRepository: lightspeed-core/lightspeed-stack
Length of output: 1734
create_violation_stream must keep item_id, output_index, and content_index consistent across streaming events.
The current implementation uses different IDs and indices across events: content_part.added (lines 484–492) uses item_id="msg_shield_violation_1", output_index=0, content_index=0, but output_text.delta (lines 494–500) uses item_id="msg_shield_violation_2", output_index=1, content_index=1. According to the llama-stack 0.3.5 streaming schema, item_id must remain stable to correlate events for the same output item, and output_index and content_index must be stable for the lifetime of that item. Inconsistent values will break consumers that reconstruct messages by correlating events via these fields.
Proposed fix (consistent IDs/indices)
async def create_violation_stream(
message: str,
shield_model: str | None = None,
) -> AsyncIterator[OpenAIResponseObjectStream]:
@@
response_id = "resp_shield_violation"
+ item_id = "msg_shield_violation"
+ output_index = 0
+ content_index = 0
# Content part added (triggers empty initial token)
yield OpenAIResponseObjectStreamResponseContentPartAdded(
- content_index=0,
+ content_index=content_index,
response_id=response_id,
- item_id="msg_shield_violation_1",
- output_index=0,
+ item_id=item_id,
+ output_index=output_index,
part=OpenAIResponseContentPartOutputText(text=""),
sequence_number=0,
)
# Text delta
yield OpenAIResponseObjectStreamResponseOutputTextDelta(
- content_index=1,
+ content_index=content_index,
delta=message,
- item_id="msg_shield_violation_2",
- output_index=1,
+ item_id=item_id,
+ output_index=output_index,
sequence_number=1,
)
@@
output=[
OpenAIResponseMessage(
- id="msg_shield_violation_3",
+ id=item_id,
content=[
OpenAIResponseOutputMessageContentOutputText(text=message)
],
role="assistant",
status="completed",
)
],
status="completed",
)
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def create_violation_stream( | |
| message: str, | |
| shield_model: str | None = None, | |
| ) -> AsyncIterator[OpenAIResponseObjectStream]: | |
| """Generate a minimal streaming response for cases where input is blocked by a shield. | |
| This yields only the essential streaming events to indicate that the input was rejected. | |
| Dummy item identifiers are used solely for protocol compliance and are not used later. | |
| """ | |
| response_id = "resp_shield_violation" | |
| # Content part added (triggers empty initial token) | |
| yield OpenAIResponseObjectStreamResponseContentPartAdded( | |
| content_index=0, | |
| response_id=response_id, | |
| item_id="msg_shield_violation_1", | |
| output_index=0, | |
| part=OpenAIResponseContentPartOutputText(text=""), | |
| sequence_number=0, | |
| ) | |
| # Text delta | |
| yield OpenAIResponseObjectStreamResponseOutputTextDelta( | |
| content_index=1, | |
| delta=message, | |
| item_id="msg_shield_violation_2", | |
| output_index=1, | |
| sequence_number=1, | |
| ) | |
| # Completed response | |
| yield OpenAIResponseObjectStreamResponseCompleted( | |
| response=OpenAIResponseObject( | |
| id=response_id, | |
| created_at=0, # not used | |
| model=shield_model or "shield", | |
| output=[ | |
| OpenAIResponseMessage( | |
| id="msg_shield_violation_3", | |
| content=[ | |
| OpenAIResponseOutputMessageContentOutputText(text=message) | |
| ], | |
| role="assistant", | |
| status="completed", | |
| ) | |
| ], | |
| status="completed", | |
| ) | |
| ) | |
| async def create_violation_stream( | |
| message: str, | |
| shield_model: str | None = None, | |
| ) -> AsyncIterator[OpenAIResponseObjectStream]: | |
| """Generate a minimal streaming response for cases where input is blocked by a shield. | |
| This yields only the essential streaming events to indicate that the input was rejected. | |
| Dummy item identifiers are used solely for protocol compliance and are not used later. | |
| """ | |
| response_id = "resp_shield_violation" | |
| item_id = "msg_shield_violation" | |
| output_index = 0 | |
| content_index = 0 | |
| # Content part added (triggers empty initial token) | |
| yield OpenAIResponseObjectStreamResponseContentPartAdded( | |
| content_index=content_index, | |
| response_id=response_id, | |
| item_id=item_id, | |
| output_index=output_index, | |
| part=OpenAIResponseContentPartOutputText(text=""), | |
| sequence_number=0, | |
| ) | |
| # Text delta | |
| yield OpenAIResponseObjectStreamResponseOutputTextDelta( | |
| content_index=content_index, | |
| delta=message, | |
| item_id=item_id, | |
| output_index=output_index, | |
| sequence_number=1, | |
| ) | |
| # Completed response | |
| yield OpenAIResponseObjectStreamResponseCompleted( | |
| response=OpenAIResponseObject( | |
| id=response_id, | |
| created_at=0, # not used | |
| model=shield_model or "shield", | |
| output=[ | |
| OpenAIResponseMessage( | |
| id=item_id, | |
| content=[ | |
| OpenAIResponseOutputMessageContentOutputText(text=message) | |
| ], | |
| role="assistant", | |
| status="completed", | |
| ) | |
| ], | |
| status="completed", | |
| ) | |
| ) |
🤖 Prompt for AI Agents
In @src/app/endpoints/streaming_query_v2.py around lines 472 - 520,
create_violation_stream currently mixes item_id, output_index, and content_index
across events which breaks stream correlation; make these values stable for the
same output item: pick a single item_id (e.g., "msg_shield_violation_1") and
reuse it for the ContentPartAdded, OutputTextDelta, and the final
OpenAIResponseMessage, and keep content_index and output_index the same (e.g.,
content_index=0 and output_index=0) across those yields while still incrementing
sequence_number as appropriate; update the
OpenAIResponseObjectStreamResponseOutputTextDelta and the
OpenAIResponseObjectStreamResponseCompleted/OpenAIResponseMessage entries to use
the same item_id, content_index, and output_index used in
OpenAIResponseObjectStreamResponseContentPartAdded.
| - model_id: gpt-4o-mini | ||
| provider_id: openai | ||
| model_type: llm | ||
| provider_model_id: gpt-4o-mini | ||
| - model_id: sentence-transformers/all-mpnet-base-v2 | ||
| model_type: embedding | ||
| provider_id: sentence-transformers | ||
| provider_model_id: sentence-transformers/all-mpnet-base-v2 | ||
| metadata: | ||
| embedding_dimension: 768 | ||
| - model_id: gpt-4o-mini | ||
| provider_id: openai | ||
| model_type: llm | ||
| provider_model_id: gpt-4o-mini | ||
| shields: [] | ||
| shields: | ||
| - shield_id: llama-guard | ||
| provider_id: llama-guard | ||
| provider_shield_id: openai/gpt-4o-mini |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
llama-stack 0.3.5 Shield object provider_resource_id provider_shield_id
💡 Result:
Short answer:
-
provider_shield_id — the shield identifier defined by the external/provider implementation (the provider's own ID you pass when registering a shield). Use this when you want the provider to look up a specific guard/rail configuration (e.g., "llama-guard-basic" or a Bedrock guardrail id). [1][2]
-
provider_resource_id — the llama‑stack server's internal resource identifier for the Shield resource (the server-side resource id/UUID that ties the registered Shield object to the provider). It's present in the OpenAPI/spec for Shield resources and is the field the server uses to reference the specific registered resource. [3]
Sources:
- Shield registration / safety docs (examples showing provider_shield_id usage). [1][2]
- Llama‑stack spec showing Shield schema with provider_resource_id field. [3]
🌐 Web query:
llama-stack client.moderations.create() model parameter openai gpt-4o-mini
💡 Result:
Short answer: You should not pass "gpt-4o-mini" to the moderation endpoint. Use OpenAI's moderation model (e.g. "omni-moderation-latest") as the model parameter.
Example:
client.moderations.create(model="omni-moderation-latest", input="...")
Reason: OpenAI exposes dedicated moderation models (omni-moderation-*) for the v1/moderations endpoint; GPT-4o-mini is a general/chat/generation model, not the moderation model. (openai.com)
Note: Llama Stack / related configs also default to OpenAI's moderation model (omni-moderation-latest) when using OpenAI as the provider. (documentation.open-xchange.com)
Citations:
- 1: https://openai.com/index/upgrading-the-moderation-api-with-our-new-multimodal-moderation-model/?utm_source=openai
- 2: https://documentation.open-xchange.com/ai-service/4.4.1/models.html?utm_source=openai
🌐 Web query:
llama-stack 0.3.5 shields llama-guard configuration documentation
💡 Result:
Summary — how to configure Llama Guard shields in llama-stack (v0.3.x)
- Add a safety provider for Llama Guard and an inference provider that serves the Llama‑Guard model (if the shield needs model inference). Example run.yaml snippet:
- providers.safety: inline::llama-guard
- add a second inference provider for the Llama‑Guard model
- create a model entry for the Llama‑Guard model
- add a shields entry pointing to the llama-guard provider/model. [1]
Example (adapted):
providers:
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
inference: - provider_id: vllm-0
provider_type: remote::vllm
config:
url: ${VLLM_URL} - provider_id: vllm-1
provider_type: remote::vllm
config:
url: ${SAFETY_VLLM_URL}
models: - model_id: ${INFERENCE_MODEL}
provider_id: vllm-0 - model_id: ${SAFETY_MODEL}
provider_id: vllm-1
shields: - provider_id: llama-guard
shield_id: ${SAFETY_MODEL}
provider_shield_id: null. [1]
- Provider-specific config fields
- inline::llama-guard supports configuration like excluded_categories (list) in its provider config. See the provider page for field descriptions. [2]
- Registering and using shields at runtime
- Register shields (client.shields.register) and run them via client.safety.run_shield (or attach shields to agents as input_shields/output_shields). Example registration and runtime usage are shown in the Safety Guardrails / API docs. [3]
Links / sources
- Configuring a “Stack” / Safety (run.yaml + Llama Guard example). [1]
- inline::llama-guard provider docs (config fields, sample). [2]
- Safety Guardrails / Register-a-shield and examples (shield registration & runtime usage). [3]
If you want, I can paste the exact YAML + Python/JS snippets ready for v0.3.5 (or check the v0.3.5-specific docs/releases for any small changes).
Remove or reconfigure the llama-guard shield — using openai/gpt-4o-mini as a moderation model is invalid and will fail at runtime.
The configuration at lines 145–148 wires gpt-4o-mini (a general chat model) as the shield, but llama-stack's moderations API requires a dedicated moderation model. OpenAI exposes separate moderation models (e.g., omni-moderation-latest), not gpt-4o-mini. Standard llama-guard setup uses the Llama Guard model itself, not a general-purpose LLM. Either remove this shield entry or reconfigure it to use a proper moderation model with the correct inference provider.
🤖 Prompt for AI Agents
In @tests/e2e/configs/run-ci.yaml around lines 135 - 148, The shield entry using
shield_id "llama-guard" is configured with provider_shield_id
"openai/gpt-4o-mini", which is an invalid moderation model; either remove the
entire "llama-guard" shields block or change provider_id/provider_shield_id to a
valid moderation model (for OpenAI use something like "provider_id: openai" with
"provider_shield_id: omni-moderation-latest" or point to the proper Llama Guard
moderation model supported by the provider) and ensure metadata/inference
settings match the moderation provider's requirements.
Description
This PR adds an implementation of shields using the ModerationsAPI to replace the old Shields API that doesn't work properly.
Now, input shields are implemented as a separate API call that is called before the Responses API itself. Output shields are not available.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.