[WIP] Feat: update existing resources instead of wholesale recreation by jhcipar · Pull Request #92 · runpod/flash

jhcipar · 2025-09-23T20:31:36Z

Overview: update existing resources instead of wholesale recreation

Objective: Enable in-place resource updates instead of costly redeploys when configuration changes are detected.

Previously, any configuration change would trigger a complete resource teardown and redeploy cycle.
Solution Overview
This PR introduces anupdate system that:

Detects configuration changes using content-based hashing
Updates resources in-place via platform APIs when possible
Handles complex updates that may require both template and endpoint modifications

Key Changes

Durable Resource Identity System

resource_id: Now represents a logical, human-readable identifier (ResourceType_name)

Provides stable identity across configuration changes
Enables resource reuse and update tracking
Replaces the previous hash-based approach for resource identification

resource_hash: Content-based hash for change detection

Built from _hashed_fields - only mutable configuration parameters
Excludes platform state (IDs, deployment metadata) to focus on user-controllable config
Triggers update flow when hash changes between runs

Update Logic

# New update path in ResourceManager.get_or_deploy_resource()
if existing.resource_hash != config.resource_hash:
    # Identify specific changed fields
    for field in existing.__class__._hashed_fields:
        if getattr(existing, field) != getattr(config, field):
            config.fields_to_update.add(field)
    
    # Sync deployment state and update in-place
    await config.sync_config_with_deployed_resource(existing)
    deployed_resource = await config.update()

Platform Integration

New GraphQL operations: update_endpoint() and update_template() mutations
Granular updates: System determines whether template, endpoint, or both need updating
State preservation: Maintains platform IDs and deployment metadata across updates

Enhanced Resource Model

_hashed_fields: Class-level definition of configuration fields that trigger updates
fields_to_update: Runtime tracking of specific changes to optimize update operations
sync_config_with_deployed_resource(): Transfers deployment state between resource instances

Bug Fixes

GPU configuration persistence: Fixed issue where gpuIds wasn't being properly stored in pickled resource state
Template ID tracking: Ensures template relationships are maintained through update cycles

Logic flow for resource update/creation

flowchart TD
    A[get_or_deploy_resource called] --> B[Acquire resource lock]
    B --> C{Resource exists?}
    
    C -->|No| D[Deploy new resource]
    D --> E[Add to manager & save]
    E --> F[Return deployed resource]
    
    C -->|Yes| G{Is resource deployed?}
    G -->|No| H[Remove invalid resource]
    H --> I[Deploy new resource]
    I --> J[Add to manager & save]
    J --> K[Return deployed resource]
    
    G -->|Yes| L{resource_hash changed?}
    L -->|No| M[Resource unchanged]
    M --> N[Return existing resource]
    
    L -->|Yes| O[Config change detected]
    O --> P[Compare _hashed_fields]
    P --> Q[Identify changed fields]
    Q --> R[Add to fields_to_update set]
    R --> S[sync_config_with_deployed_resource]
    S --> T[Call resource.update]
    
    T --> U{Pod template needs update?}
    U -->|Yes| V[Update template via GraphQL]
    V --> W{Template-only changes?}
    W -->|Yes| X[Return updated resource]
    W -->|No| Y[Update endpoint via GraphQL]
    
    U -->|No| Y
    Y --> Z[Remove old resource]
    Z --> AA[Add updated resource]
    AA --> BB[Return updated resource]

In the future, we'll have to integrate with durable Tetra state on the server side.

changed default resource "identifier" to be a resource id that includes an input config name not generated from config args so a logical resource isn't defined by its config (so we can change config for the same resource) added a "resource hash" to the base sls resource class so we can detect changes to input config args and update resources in place instead of redeploying adds update methods to serverless resources changed deploy method to add platform-related state (eg durable resource ids) back to pickled state and config objects at runtime so we can fetch and interact with runpod sls endpoints created via tetra add update template methods to sls resource so we can update template-only variables via gql (eg env vars) changed the defaults for some sls resource configs to reflect existing defaults in runpod add update path to resource manager class when existing and new config have differnt resource hashes changed the behavior of sync gpu and gpuIds fields because there was a bug where gpus would always get created and pickled as the ANY gpu group

jhcipar added 2 commits September 19, 2025 17:22

fix: exclude additional template fields for endpoint update path

1c585f9

jhcipar requested review from deanq and pandyamarut September 23, 2025 20:31

jhcipar changed the title ~~Jhcipar/ae 1196/update existing resources~~ Feat: update existing resources instead of wholesale recreation Sep 23, 2025

jhcipar changed the title ~~Feat: update existing resources instead of wholesale recreation~~ [WIP] Feat: update existing resources instead of wholesale recreation Sep 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feat: update existing resources instead of wholesale recreation#92

[WIP] Feat: update existing resources instead of wholesale recreation#92
jhcipar wants to merge 2 commits intomainfrom
jhcipar/ae-1196/update-existing-resources

jhcipar commented Sep 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhcipar commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview: update existing resources instead of wholesale recreation

Logic flow for resource update/creation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jhcipar commented Sep 23, 2025 •

edited

Loading