Skip to content

Question about the choice of the baseline model #4

@Henry-Bi

Description

@Henry-Bi

Hi Jianyi @IceClear, thank you for your excellent work on SeedVR! I've been studying your project and have a question regarding the model's initialization. I noticed that SeedVR initializes its parameters from Stable Diffusion 3 Medium. This made me curious about the design choice. For a task like low-level video enhancement, my initial thought would be to use a pre-trained video latent diffusion model as the baseline.

I have a hypothesis and would love to know if I'm on the right track:

Is the choice of an image model deliberate because low-level enhancement tasks require a more deterministic mapping from the low-quality input to the high-quality output? Perhaps the incredibly strong image priors from SD3-M for restoring textures and details within each frame are more critical than the temporal generation capabilities of a native video model. In this context, maybe introducing the kind of randomness or complex temporal dynamics inherent in video generation models is not the primary focus—or could even be undesirable—for an enhancer.

Thank you for your time and for sharing this fantastic project with the community!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions