Skip to content

Conversation

@jskswamy
Copy link
Contributor

What this PR does

Fixes the bug where minReplicas: 0 on a component (Engine or Router) ignores the explicit deploymentMode: RawDeployment annotation.

Changes

File Change
deployment.go Added shouldInferServerless() helper to centralize logic; updated DetermineEngineDeploymentMode() and DetermineDeploymentModes() signatures
controller.go Pass global deploymentMode to DetermineDeploymentModes()
ingress/reconciler.go Get global mode from annotations and pass it
components/engine.go Updated call to new function signature
utils_test.go Updated existing test for new signature
deployment_test.go Added 12 new tests for the fix

Key Implementation

Added a centralized helper to avoid duplicate logic:

// shouldInferServerless checks if minReplicas=0 should trigger Serverless mode.
// Returns false if globalDeploymentMode explicitly blocks it (e.g., RawDeployment).
func shouldInferServerless(minReplicas *int, globalDeploymentMode constants.DeploymentModeType) bool {
    if globalDeploymentMode == constants.RawDeployment {
        return false
    }
    return minReplicas != nil && *minReplicas == 0
}

Why we need it

When users explicitly set deploymentMode: RawDeployment via annotation, they expect that mode to be respected. However, setting minReplicas: 0 on Engine or Router would override this to Serverless mode.

This caused:

  1. Unexpected Knative Service creation instead of Deployment
  2. Confusing errors when Knative is not installed: no kind is registered for the type v1.Service in scheme
  3. User's explicit configuration being silently ignored
  4. Inability to use external autoscalers (like KEDA) with scale-to-zero

Before

annotations:
  ome.io/deployment-mode: RawDeployment # Ignored!
spec:
  predictor:
    engine:
      minReplicas: 0 # Triggers Serverless anyway

After

Global deploymentMode: RawDeployment is respected. minReplicas: 0 only triggers Serverless when the global mode is not explicitly set to RawDeployment.

Fixes #445

Checklist

  • Tests added/updated (if applicable)
  • Docs updated (if applicable) - N/A, internal bug fix
  • make fmt passes locally
  • make vet passes locally
  • make tidy passes locally

slin1237 and others added 30 commits June 27, 2025 18:13
* clean up deps

* update router image

* add router config to sglang runtimes

* add pd runtimes

* add decoderconfig to pd runtimes

* add -pd as suffix to runtime names

* use --model-path for consistancy

* replace --tp with --tp-size for consistency

* use latest sgl image in runtimes

* fix image repo

* fix ome manager img name

* cleanup dep

* add pd to runtime kustomize
* [doc] add missing benchmark in readme

* [misc] add pr labeler
Bumps [actions/configure-pages](https://github.com/actions/configure-pages) from 4 to 5.
- [Release notes](https://github.com/actions/configure-pages/releases)
- [Commits](actions/configure-pages@v4...v5)

---
updated-dependencies:
- dependency-name: actions/configure-pages
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [softprops/action-gh-release](https://github.com/softprops/action-gh-release) from 1 to 2.
- [Release notes](https://github.com/softprops/action-gh-release/releases)
- [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md)
- [Commits](softprops/action-gh-release@v1...v2)

---
updated-dependencies:
- dependency-name: softprops/action-gh-release
  dependency-version: '2'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* remove llama 3.3 70B FP8 from models

* change llama 3.2 90B rt to bf16 version

* remove 11b 90b pd rt

* add llama 3.3 pd isvc sample

* sgl runtime fixes

* update sample isvc

* fix inference service status propagation

* fix API version

* Revert "fix API version"

This reverts commit 326f4b6.

* Revert "fix inference service status propagation"

This reverts commit 46fe309.
…ng (sgl-project#121)

- Fix progress logging by reordering HubConfig options to prevent logger override
  - Fix model deletion using consistent path calculation with getDestPath
  - Add proper nil checks for progressManager.logger to prevent panics
  - Add HTTP 429 rate limiting handling with exponential backoff
  - Enable detailed progress logging for model downloads
  - Fix RBAC permissions to allow model-agent to get secrets
* [release] release 0.1.2
…gl-project#127)

- Add support for AWS S3, Azure Blob, GCS, and GitHub storage providers
  - Extend storage package with parse and validation functions for each provider
  - Update BuildStorageArgs to generate appropriate CLI arguments for genai-bench
  - Add comprehensive test coverage for all storage types
  - Update benchmark documentation with multi-cloud examples and best practices
sgl-project#128)

[misc] Add support for 20+ new model architectures to HuggingFace model config parser
Replace deprecated beta.kubernetes.io/instance-type label with the GA
node.kubernetes.io/instance-type label for node instance type detection.

The code now tries the newer label first using the ok value from map
lookup and falls back to the deprecated beta label for backward
compatibility. This ensures the code works with both old and new
Kubernetes versions while preferring the non-deprecated label when
available.

The deprecated label constant has been moved to pkg/constants for
better maintainability.
Xinyue Zhang and others added 12 commits December 13, 2025 04:56
…ject#430)

* [chart] Update InferenceService template for PD mode support

- Remove runtime spec (auto-selected by operator)
- Add PD mode auto-detection (model names ending with "-pd")
- PD mode: requires engine, decoder, and router
- Non-PD mode: requires engine, optional router (no decoder)
- Support both flat (minReplicas) and nested (engine.minReplicas) config

* [chart] Replace -pd model entries with pdMode configuration option

- Remove 10 -pd model entries from values.yaml (they're not separate models)
- Remove 10 -pd entries from model registry in _helpers.tpl
- Add pdMode field: when true, InferenceService includes decoder and router
- Update InferenceService to use explicit pdMode flag (not model name suffix)
- Update README with pdMode documentation and supported PD models list
- Update model count from 176 to 165

PD mode supported models: kimi-k2-instruct, deepseek-rdma, llama-3-1-70b-instruct,
llama-3-2-1b-instruct, llama-3-2-3b-instruct, llama-3-3-70b-instruct,
llama-4-maverick-17b-128e-instruct-fp8, llama-4-scout-17b-16e-instruct,
mistral-7b-instruct, mixtral-8x7b-instruct
…ce creation (sgl-project#441)

- Add createModel option to control ClusterBaseModel/BaseModel creation
- Add createRuntime option to control ClusterServingRuntime creation
- Both default to true for backward compatibility
- Allows creating runtime-only or model-only deployments
- Update README with examples for both use cases
- Remove fake models (deepseek-rdma)
- Remove variant models (-grpc, -bf16, -pd-grpc)
- Rename models to match config/models metadata names
- Add missing config/models entries for gemma-2-27b-it and Qwen1.5 models
…ject#444)

- Add 40 missing models to values.yaml from config/models
- Remove obsolete gpt-oss-120b-bf16 and gpt-oss-20b-bf16 configs
- Total: 196 models now in sync
…#411)

* Add model capabilities for the supported endpoints

* Add new model capacities and endpoints

* Use ModelAPICapability as the name
Fix bug where minReplicas=0 on Engine or Router components would
incorrectly trigger Serverless deployment mode, ignoring the user's
explicit deploymentMode: RawDeployment annotation.

Changes:
- Add shouldInferServerless() helper to centralize the logic
- Pass globalDeploymentMode through DetermineDeploymentModes()
- Update DetermineEngineDeploymentMode() signature
- Add comprehensive tests for the fix

When deploymentMode: RawDeployment is explicitly set, minReplicas=0
now correctly creates a Deployment instead of a Knative Service.
@github-actions github-actions bot added inferenceservice InferenceService controller changes controller Controller changes ingress Ingress/networking changes tests Test changes labels Dec 15, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jskswamy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where user-defined RawDeployment mode was being silently ignored if a component specified minReplicas: 0, leading to unexpected Serverless deployments. The changes ensure that explicit deployment mode annotations are always honored, preventing misconfigurations, confusing errors, and enabling proper use of external autoscalers with scale-to-zero capabilities.

Highlights

  • Bug Fix: Deployment Mode Override: Resolved a bug where minReplicas: 0 on an Engine or Router component would incorrectly override an explicitly set deploymentMode: RawDeployment annotation, forcing Serverless mode.
  • Centralized Logic for Serverless Inference: Introduced a new helper function, shouldInferServerless(), to centralize the logic for determining if minReplicas=0 should trigger Serverless mode, ensuring it respects the global deployment mode.
  • Updated Function Signatures: Modified the signatures of DetermineEngineDeploymentMode() and DetermineDeploymentModes() to accept a globalDeploymentMode parameter, allowing the system to properly consider explicit global settings.
  • Enhanced Test Coverage: Added 12 new tests in deployment_test.go to thoroughly validate the fix and ensure that deployment mode inference correctly respects explicit global settings.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the bug where minReplicas: 0 incorrectly overrides an explicit deploymentMode: RawDeployment annotation. The introduction of the shouldInferServerless helper function is a great improvement, centralizing the decision logic and making the code cleaner and more maintainable. The function signatures have been updated appropriately to propagate the global deployment mode, and the changes are well-tested with a comprehensive new test suite. I've found one issue in the new test file that will cause a compilation failure, which I've commented on. Otherwise, this is a solid fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

controller Controller changes inferenceservice InferenceService controller changes ingress Ingress/networking changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] minReplicas=0 ignores explicit deploymentMode annotation