-
Notifications
You must be signed in to change notification settings - Fork 770
fix: KServe propagate error to client in stream infer #5263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: KServe propagate error to client in stream infer #5263
Conversation
Signed-off-by: Guan Luo <gluo@nvidia.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
WalkthroughThe pull request adds explicit error handling to gRPC streaming paths in the service layer, converting stream errors into observable responses. Test coverage is extended with failure scenario testing for both single-shot and streaming inference modes, including malformed responses and exception cases. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @tests/frontend/grpc/test_tensor_mocker_engine.py:
- Around line 213-218: The else branch in the exception assertion block uses
"assert False, ..." which can be stripped under Python optimization; replace
that with an unconditional raise by changing the else branch to "raise
AssertionError('Expected exception was not raised')" so failures are always
raised regardless of -O; update the block that checks "if 'malformed_response'
in request_params ... elif 'raise_exception' in request_params ... else" to use
raise AssertionError instead of assert False and keep the same message,
referring to the local variables request_params and excinfo.
In @tests/frontend/grpc/triton_echo_client.py:
- Around line 113-116: Change the blocking call
user_data._completed_requests.get() to use a timeout (e.g.,
user_data._completed_requests.get(timeout=10)); catch the queue.Empty exception
from that call and fail the test with an explicit assertion or error message
indicating a timeout (e.g., "Stream inference timed out after Xs"), and keep the
existing check that the returned data_item is not an Exception. Ensure you
import and handle the appropriate Empty exception class so the test fails fast
instead of hanging.
🧹 Nitpick comments (3)
tests/frontend/grpc/triton_echo_client.py (2)
87-101: Consider extractingUserDataand callback to a shared test utility.The
UserDataclass andcallbackfunction are duplicated between this file andtest_tensor_mocker_engine.py. Consider extracting them to a shared module to reduce duplication and ensure consistent behavior.
138-143: Consider addingrun_stream_infer()to the main block.The main block runs
check_health,run_infer, andget_configfor manual testing, but omits the newrun_stream_infer()method. Consider adding it for complete coverage when running the script directly.client.check_health() client.run_infer() + client.run_stream_infer() client.get_config() print("Triton echo client ran successfully.")tests/frontend/grpc/test_tensor_mocker_engine.py (1)
163-168: Update docstrings to describe failure testing.The docstrings for both
test_model_infer_failureandtest_model_stream_infer_failuredescribe echo/identity behavior, but the tests actually verify error propagation for malformed responses and exceptions. Consider updating them to reflect the actual test intent."""Test that gRPC streaming inference properly propagates errors to the client. Verifies that both malformed_response and raise_exception scenarios result in appropriate error messages being received by the client through the stream. """
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/llm/src/grpc/service/kserve.rstests/frontend/grpc/echo_tensor_worker.pytests/frontend/grpc/test_tensor_mocker_engine.pytests/frontend/grpc/triton_echo_client.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.
Applied to files:
lib/llm/src/grpc/service/kserve.rs
🧬 Code graph analysis (2)
tests/frontend/grpc/test_tensor_mocker_engine.py (1)
tests/frontend/grpc/triton_echo_client.py (3)
run_stream_infer(69-128)UserData(87-89)callback(96-101)
tests/frontend/grpc/triton_echo_client.py (1)
tests/frontend/grpc/test_tensor_mocker_engine.py (2)
UserData(177-181)callback(188-193)
🪛 Ruff (0.14.10)
tests/frontend/grpc/test_tensor_mocker_engine.py
218-218: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
tests/frontend/grpc/echo_tensor_worker.py
87-87: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (lib/runtime/examples)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (.)
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/runtime/examples)
🔇 Additional comments (5)
lib/llm/src/grpc/service/kserve.rs (2)
425-435: Same pattern as above - verifydelta.ok()semantics.This block mirrors the error handling pattern at lines 364-374. The same concern about the
delta.ok()method applies here. Ensure consistency in how errors are handled across both streaming paths.
364-374: No changes needed. The code is correct.The pattern
match delta.ok()is intentional. TheAnnotated::ok(self) -> Result<Self, String>method is a custom implementation that extracts error information from annotated responses: if the annotation has anevent="error"field, it returnsErr(message), otherwiseOk(self). This design properly propagates streaming errors asModelStreamInferResponsewitherror_messageset, allowing the stream to continue processing subsequent items after an error.tests/frontend/grpc/echo_tensor_worker.py (1)
78-87: LGTM - Test helper for error scenarios.The implementation correctly adds two test paths:
malformed_response: Corrupts tensor data to trigger downstream parsing errorsraise_exception: Raises an intentional exception to test exception propagationThese align with the PR's goal of testing error propagation in streaming inference. The static analysis hint about TRY003 (long exception message) is acceptable in test code.
tests/frontend/grpc/test_tensor_mocker_engine.py (2)
112-113: LGTM - Extended test_echo to cover streaming.Good addition to ensure the streaming path is exercised in the happy-path test.
116-148: LGTM - Good parameterized test for single-shot inference failures.The test correctly verifies that both
malformed_responseandraise_exceptionscenarios propagate meaningful error messages to the client.
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than:
- Address the logging comment
- Please add a before/after example of an error response to the PR description for better context
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <gluo@nvidia.com> Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Overview:
Before this change, an error occurs during response streaming will not be properly returned and be silently ignored. Which results in malformed response not being returned other than failure logging in the worker side. Whereas non-streaming case unfold the error properly.
After this change, client now receives error messages (in the case of Triton client, exception will be captured in response callback)
See
test_tensor_mocker_engine.py::test_model_stream_infer_failurewhere exception is excepted, before the change,user_data._completed_requests.get(timeout=5)will only raiseEmptyExceptionfrom get timeout which indicates the worker error is not propagating back to client.Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.