Skip to content

Conversation

@slin1237
Copy link
Collaborator

@slin1237 slin1237 commented Jan 2, 2026

Move P2P coordination lease creation from model-agent to the basemodel
controller to eliminate race conditions when multiple agents try to
create/acquire the same lease simultaneously.

Changes:

  • Controller now creates P2P leases using resource UUID as lease name
  • Model-agent only acquires existing leases (no longer creates them)
  • Updated RBAC: controller gets create/delete, agent gets only get/update
  • Lease cleanup added to model deletion flow

This ensures only one lease is created per model (by controller) and
agents simply race to acquire it, reducing API pressure from 13+ agents
all trying to create the same lease.

Add BitTorrent library dependency (anacrolix/torrent) and define
constants for P2P model distribution:
- Lease coordination constants (prefix, labels, durations)
- Default configuration values (ports, rates, timeouts)
- Environment variable keys for P2P configuration
Introduce the P2P distributor package with:
- Config struct with validation and defaults
- ConfigFromEnv for environment-based configuration
- Comprehensive Prometheus metrics for P2P operations
  - Download metrics (total, duration, failures)
  - Peer discovery and connection metrics
  - Lease and seeding metrics
  - Metainfo server metrics
Implement ModelDistributor for P2P model distribution:
- BitTorrent client management with rate limiting
- Peer discovery via Kubernetes headless service DNS
- Metainfo fetching from peers for torrent coordination
- Model seeding and download operations
- Active torrent tracking with proper cleanup

Fix API compatibility with anacrolix/torrent v1.57.1:
- Use bencode.Marshal instead of info.MarshalBencode
- Use t.Complete().Bool() instead of t.Complete.Bool
- Handle PeerRemoteAddr type assertion for peer addresses
Implement MetainfoServer to enable peers to discover available models:
- GET /metainfo/{modelHash} - serve torrent metainfo
- GET /health - health check endpoint
- GET /stats - P2P distribution statistics
- GET /models - list available models with seeding status
- Graceful shutdown support

Fix API compatibility with anacrolix/torrent v1.57.1:
- Add exists() helper function
- Use bencode.Marshal instead of info.MarshalBencode
Add comprehensive tests for the distributor package:
- Config validation tests (valid, missing fields, invalid ports)
- ConfigWithDefaults tests
- Metrics recording tests
- Stats struct tests
- Test helper functions for integration tests

Fix test config to include required LeaseDurationSeconds field.
Implement P2PLeaseManager for coordinating model downloads:
- Lease acquisition with expired lease takeover
- Lease renewal for long-running downloads
- Complete/release lifecycle management
- Ensures only one node downloads from HuggingFace

Tests cover:
- Lease acquisition (new, existing, expired)
- Lease expiration detection
- Lease name generation with hash truncation
- Renewal and holder verification
Integrate P2P model distribution into the Gopher download workflow:
- Add P2P fields to Gopher struct (distributor, lease manager, timeout)
- EnableP2P() and SetP2PTimeout() configuration methods
- computeModelHash() for consistent model identification
- downloadWithP2P() orchestrates P2P-first download strategy
- downloadWithLeaseHeld() handles HF download with lease coordination
- waitForP2PAvailability() with exponential backoff for waiting nodes
- startSeeding() begins seeding after successful download

Flow: Check P2P peers → Try P2P download → Acquire lease → HF download → Seed
Add deployment configuration for P2P model distribution:
- Headless Service for peer discovery via DNS
- DaemonSet with P2P-enabled model-agent
- Documentation with architecture overview and usage instructions
Integrate P2P model distribution into the model-agent entry point:
- Add P2P configuration fields (ports, rates, encryption, timeout)
- Read P2P settings from environment variables
- Create ModelDistributor when P2P_ENABLED=true
- Create P2PLeaseManager for download coordination
- Call gopher.EnableP2P() to activate P2P download flow
- Start MetainfoServer for peer discovery
- Add graceful shutdown for P2P resources

Environment variables:
- P2P_ENABLED: Enable/disable P2P (default: false)
- PEERS_SERVICE: Headless service DNS for peer discovery
- P2P_TORRENT_PORT: BitTorrent port (default: 6881)
- P2P_METAINFO_PORT: Metainfo HTTP port (default: 8081)
- P2P_MAX_DOWNLOAD_RATE: Max download rate in bytes/s
- P2P_MAX_UPLOAD_RATE: Max upload rate in bytes/s
- P2P_ENCRYPTION_ENABLED: Enable BitTorrent encryption
- P2P_DOWNLOAD_TIMEOUT: P2P download timeout in seconds
- Add dedicated deleteTaskChan for delete tasks to ensure deletions
  are never blocked by downloads (even with 100 concurrent downloads)
- Scout sends delete tasks to deleteChan instead of gopherChan
- Gopher runs dedicated runDeleteWorker for immediate delete processing
- Skip download if already seeding the model (prevents infinite loop
  when update events trigger DownloadOverride for already-downloaded models)
- Add --p2p-* CLI flags to model-agent (replaces P2P_* env vars)
- Fix pod restart: check if model exists on disk before P2P/lease logic
- Update daemonset to use CLI args instead of env vars
- Increase default rate limits from 500 MB/s to 2 GB/s
- Remove unused ConfigFromEnv() and P2P env var constants
- Update README documentation
Move P2P coordination lease creation from model-agent to the basemodel
controller to eliminate race conditions when multiple agents try to
create/acquire the same lease simultaneously.

Changes:
- Controller now creates P2P leases using resource UUID as lease name
- Model-agent only acquires existing leases (no longer creates them)
- Updated RBAC: controller gets create/delete, agent gets only get/update
- Lease cleanup added to model deletion flow

This ensures only one lease is created per model (by controller) and
agents simply race to acquire it, reducing API pressure from 13+ agents
all trying to create the same lease.
@github-actions github-actions bot added documentation Documentation changes helm Helm chart changes rbac RBAC configuration changes model-agent Model agent changes controller Controller changes tests Test changes config Configuration changes dependencies Dependency updates labels Jan 2, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the P2P model distribution mechanism by centralizing the creation and management of coordination leases within the basemodel controller. This change aims to eliminate race conditions and significantly reduce the load on the Kubernetes API server by streamlining how model-agents interact with leases. It also introduces a robust BitTorrent-based P2P distribution system, complete with peer discovery, metainfo serving, and comprehensive monitoring, to optimize large model file transfers across the cluster.

Highlights

  • P2P Lease Creation Shift: The responsibility for creating P2P coordination leases has been moved from the model-agent to the basemodel controller. This change centralizes lease creation, ensuring that only one lease is generated per model, which helps prevent race conditions.
  • Model-Agent Role Update: Model-agents will now only acquire existing P2P leases and no longer create them. This reduces the API pressure on the Kubernetes API server, as multiple agents will not simultaneously attempt to create the same lease.
  • RBAC Adjustments: Role-Based Access Control (RBAC) permissions have been updated. The controller now has create and delete permissions for leases resources, while the model-agent is restricted to get, list, update, and patch permissions.
  • Lease Cleanup on Model Deletion: A mechanism has been added to ensure that P2P leases are properly cleaned up when a model is deleted, preventing orphaned lease resources.
  • P2P Distribution Implementation: New components for P2P model distribution using BitTorrent have been introduced, including a ModelDistributor for peer discovery and metainfo handling, a MetainfoServer to serve torrent files, and Prometheus metrics for monitoring P2P activity.
  • Dedicated Delete Worker: A dedicated worker channel (deleteTaskChan) has been added to the Gopher in the model-agent to process deletion tasks immediately, preventing them from being blocked by long-running download operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and well-executed change that introduces a P2P model distribution mechanism to alleviate pressure on HuggingFace and improve download times in a cluster. The core logic is robust, handling various edge cases like pod restarts, lease holder crashes, and download cancellations. Moving lease creation to the controller and having agents only acquire them is a solid architectural decision that correctly addresses the race condition problem. The introduction of a dedicated channel for delete tasks is another great improvement that enhances responsiveness. The new distributor package is well-designed, with good optimizations like parallel hashing and metainfo caching. The documentation and tests are also comprehensive. I have a few suggestions to further improve the code, mostly around configuration and removing dead code.

Comment on lines +395 to +396
LeaseDurationSeconds: 120, // 2 minutes
LeaseRenewIntervalSeconds: 30, // renew every 30 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The lease duration and renewal interval are currently hardcoded. It would be beneficial to make these values configurable via command-line flags, similar to other P2P settings. This would provide greater flexibility for tuning the P2P coordination behavior in different cluster environments without requiring code changes.

// ensureP2PLease creates a P2P coordination lease for the model if it doesn't exist.
// The lease is used to coordinate which node downloads from HuggingFace while others
// wait for P2P availability. Using the resource UID ensures a unique, stable lease name.
func ensureP2PLease(ctx context.Context, kubeClient client.Client, log logr.Logger, uid types.UID, ownerName string, isClusterScope bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The isClusterScope parameter in the ensureP2PLease function signature is not used within the function body. To improve code clarity and simplify the function, this parameter should be removed from both the function definition and its call sites.

Suggested change
func ensureP2PLease(ctx context.Context, kubeClient client.Client, log logr.Logger, uid types.UID, ownerName string, isClusterScope bool) error {
func ensureP2PLease(ctx context.Context, kubeClient client.Client, log logr.Logger, uid types.UID, ownerName string) error {

Comment on lines +193 to +214
// runDeleteWorker is a dedicated worker for processing delete tasks.
// This worker runs separately from download workers to ensure deletions
// are never blocked by downloads (even with 100 concurrent downloads).
// Cancellation of active downloads is handled inside processTask.
func (s *Gopher) runDeleteWorker() {
for {
select {
case task, ok := <-s.deleteChan:
if ok {
err := s.processTask(task)
if err != nil {
s.logger.Errorf("Delete task failed with error: %s", err.Error())
}
} else {
s.logger.Info("delete channel closed, delete worker exits.")
return
}
default:
time.Sleep(500 * time.Millisecond)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This dedicated worker for delete tasks is a great architectural improvement. It ensures that model deletions are processed promptly and are not blocked by long-running download tasks, which significantly improves the responsiveness and reliability of the cleanup process. The separation of concerns between download and delete workers is excellent.

Comment on lines +1299 to +1434
// downloadWithP2P orchestrates the model download with P2P support.
// The flow is:
// 1. If already seeding (in-memory state) → skip, we have it
// 2. Try to acquire lease first (determines if we're the HF downloader)
// 3. If lease acquired → download from HF (handles partial/resume) → start seeding
// 4. If lease not acquired → wait for P2P from lease holder
// 5. If P2P wait fails → fallback to HF download
//
// Key insight: The lease holder ALWAYS downloads from HF because HF handles
// partial file recovery and resume. We don't skip HF based on files existing
// on disk because they might be incomplete from an interrupted download.
func (s *Gopher) downloadWithP2P(ctx context.Context, task *GopherTask, baseModelSpec v1beta1.BaseModelSpec,
hfComponents *storage.HuggingFaceStorageComponents, destPath, modelHash, modelInfo, modelType, namespace, name string) error {

// If P2P is not enabled, go directly to HuggingFace
if !s.p2pEnabled || s.p2pDistributor == nil {
return s.downloadFromHuggingFace(ctx, task, baseModelSpec, hfComponents, destPath, modelInfo, modelType, namespace, name)
}

// Check if we're already seeding this model - means we already have it locally
// and verified (seeding state is only set after successful download + metainfo creation).
// This prevents re-downloading a model we just downloaded (e.g., from update events).
if s.p2pDistributor.IsSeeding(modelHash) {
s.logger.Infof("Already seeding model %s (hash: %s), skipping download", modelInfo, modelHash[:16])
return nil
}

// Check ConfigMap - if model is marked Ready, files are complete (pod restart recovery).
// ConfigMap is only updated to Ready AFTER successful download, so this is safe.
// We still need to start seeding since in-memory state was lost on restart.
if s.configMapReconciler.IsModelReady(task.BaseModel, task.ClusterBaseModel) {
if stat, err := os.Stat(destPath); err == nil && stat.IsDir() {
s.logger.Infof("Model %s marked Ready in ConfigMap (pod restart recovery), starting seeding", modelInfo)
s.startSeeding(destPath, modelHash, modelInfo)
return nil
}
// ConfigMap says Ready but files don't exist - this is unexpected, proceed with download
s.logger.Warnf("Model %s marked Ready in ConfigMap but files not found at %s, will re-download", modelInfo, destPath)
}

s.markModelOnNodeFailed(task)
// Check if model already exists on disk - handles cases where ConfigMap cache is not populated
// (e.g., pod restart before cache was synced). This is a safety net for restart recovery.
// We check the hash directory path to see if it exists (either as symlink or directory).
// os.Stat follows symlinks, so IsDir() will be true if hash path is a symlink to a directory.
hashPath := filepath.Join(s.modelRootDir, modelHash)
if stat, err := os.Stat(hashPath); err == nil && stat.IsDir() {
s.logger.Infof("Model %s already exists at hash path %s (disk recovery), starting seeding", modelInfo, hashPath)
// Use destPath for seeding to maintain consistency with the rest of the codebase
s.startSeeding(destPath, modelHash, modelInfo)
return nil
}

s.logger.Infof("P2P enabled for model %s (hash: %s), acquiring lease", modelInfo, modelHash[:16])

// Step 1: Try to acquire lease FIRST
// The lease determines who downloads from HuggingFace vs who waits for P2P.
// We acquire lease before checking for peers because:
// - Lease holder always downloads from HF (handles partial recovery)
// - Non-lease holders wait for P2P from the lease holder
if s.p2pLeaseManager == nil {
// No lease manager, just download directly from HF
s.logger.Infof("No lease manager configured, downloading directly from HuggingFace for model %s", modelInfo)
if err := s.downloadFromHuggingFace(ctx, task, baseModelSpec, hfComponents, destPath, modelInfo, modelType, namespace, name); err != nil {
return err
}
if ctx.Err() != nil {
return ctx.Err()
}
s.startSeeding(destPath, modelHash, modelInfo)
return nil
}

s.logger.Infof("Successfully downloaded HuggingFace model %s to %s",
modelInfo, downloadPath)
// Use resource UID for lease name (matches what controller creates)
resourceUID := getModelUID(task)
leaseName := constants.GetP2PLeaseName(types.UID(resourceUID))
acquired, err := s.p2pLeaseManager.TryAcquire(ctx, leaseName)
if err != nil {
s.logger.Warnf("Failed to acquire P2P lease for model %s: %v, will try P2P or fallback", modelInfo, err)
}

// Parse model config and update ConfigMap
var baseModel *v1beta1.BaseModel
var clusterBaseModel *v1beta1.ClusterBaseModel
// Step 2: If we acquired the lease, we're the designated HF downloader
if acquired {
s.logger.Infof("Acquired lease for model %s, downloading from HuggingFace", modelInfo)
return s.downloadWithLeaseHeld(ctx, task, baseModelSpec, hfComponents, destPath, modelHash, modelInfo, modelType, namespace, name, leaseName)
}

if task.BaseModel != nil {
baseModel = task.BaseModel
s.logger.Debugf("Using BaseModel %s/%s for config parsing", baseModel.Namespace, baseModel.Name)
} else if task.ClusterBaseModel != nil {
clusterBaseModel = task.ClusterBaseModel
s.logger.Debugf("Using ClusterBaseModel %s for config parsing", clusterBaseModel.Name)
// Step 3: Lease held by another node - first check if P2P is already available
// (the lease holder might have already finished downloading and started seeding)
if s.p2pDistributor.HasPeers(ctx, modelHash) {
s.logger.Infof("Peers found for model %s, attempting P2P download", modelInfo)
if err := s.p2pDistributor.TryP2PDownload(ctx, modelHash, destPath, s.p2pTimeout); err == nil {
if ctx.Err() != nil {
return ctx.Err()
}
s.logger.Infof("Successfully downloaded model %s via P2P", modelInfo)
// Start seeding so we can serve other nodes
s.startSeeding(destPath, modelHash, modelInfo)
return nil
}
s.logger.Warnf("P2P download failed for model %s: %v, will wait for P2P availability", modelInfo, err)
}

if err := s.safeParseAndUpdateModelConfig(destPath, baseModel, clusterBaseModel, shaStr); err != nil {
s.logger.Errorf("Failed to parse and update model config: %v", err)
// Step 4: Wait for P2P availability from the lease holder
s.logger.Infof("Lease held by another node for model %s, waiting for P2P availability", modelInfo)
if err := s.waitForP2PAvailability(ctx, task, modelHash, modelInfo, leaseName, destPath); err == nil {
if ctx.Err() != nil {
return ctx.Err()
}
s.logger.Infof("Model %s downloaded via P2P", modelInfo)
// Start seeding so we can serve other nodes
s.startSeeding(destPath, modelHash, modelInfo)
return nil
} else {
s.logger.Warnf("Wait for P2P failed for model %s: %v, falling back to HuggingFace download", modelInfo, err)
}

// Step 5: Final fallback - download directly from HuggingFace
// This happens when:
// - P2P wait timed out (lease holder took too long or crashed)
// - Lease expired (holder crashed before completing)
if ctx.Err() != nil {
return ctx.Err()
}

s.logger.Infof("Fallback: downloading model %s directly from HuggingFace", modelInfo)
if err := s.downloadFromHuggingFace(ctx, task, baseModelSpec, hfComponents, destPath, modelInfo, modelType, namespace, name); err != nil {
return err
}

if ctx.Err() != nil {
return ctx.Err()
}

s.startSeeding(destPath, modelHash, modelInfo)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The P2P download logic is very robust. It correctly handles various scenarios, including pod restarts (by checking IsModelReady and disk state), lease acquisition races, and fallback to direct HuggingFace download if P2P fails. This resilience is crucial for a production system. The flow of trying to acquire a lease first, then deciding whether to download from HF or wait for P2P, is well-designed.

Comment on lines +46 to +53
// GetLeaseName returns the lease name for a model hash.
func (m *P2PLeaseManager) GetLeaseName(modelHash string) string {
// Truncate hash to fit Kubernetes name constraints
if len(modelHash) > 16 {
modelHash = modelHash[:16]
}
return constants.P2PLeasePrefix + modelHash
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This GetLeaseName method appears to be dead code, as the lease name is generated in gopher.go using constants.GetP2PLeaseName. Furthermore, its truncation logic is different from what's actually used and could potentially lead to hash collisions. To avoid confusion and prevent future bugs, this unused method should be removed.

renewCtx, cancel := context.WithCancel(ctx)

go func() {
ticker := time.NewTicker(time.Duration(constants.P2PDefaultLeaseRenewSeconds) * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The lease renewal interval is hardcoded using a constant. For better configurability and to keep the P2PLeaseManager self-contained, consider adding a leaseRenewInterval field to the struct, initialized from the distributor's configuration. This would be consistent with how leaseDurationSeconds is handled.

Comment on lines +83 to +90
1. **Pod starts**: Model-agent initializes P2P distributor
2. **Model request**: Scout detects new BaseModel/ClusterBaseModel
3. **Check local**: If model exists on hostPath, seed it
4. **Try P2P**: Query peers for model via metainfo HTTP
5. **Lease coordination**: If no peers have it, try to acquire lease
6. **HF download**: Lease holder downloads from HuggingFace
7. **Seed**: Downloaded model is seeded to other nodes
8. **Complete**: All nodes have the model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The workflow described in the 'How It Works' section does not fully match the implementation in gopher.go. The implementation first attempts to acquire a lease before checking for P2P peers, whereas the documentation suggests the opposite. To avoid confusion, please update this section to reflect the actual execution order:

  1. Pod starts
  2. Model request detected
  3. Check local (if already seeding or files exist)
  4. Try to acquire lease
  5. If lease acquired: Download from HuggingFace, then seed.
  6. If lease not acquired: Try to download from P2P peers. If peers are not ready, wait for them. If P2P fails, fallback to HuggingFace download.

// Returns nil, nil if file doesn't exist.
func (d *ModelDistributor) LoadMetainfoFromFile(modelHash string) (*metainfo.MetaInfo, error) {
filePath := d.metainfoFilePath(modelHash)
f, err := os.Open(filePath)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 23 days ago

In general, the fix is to validate or constrain the user-controlled modelHash before using it to construct a filesystem path. There are two common approaches: (1) enforce that modelHash is a single “filename-like” component (no path separators or ..), or (2) after joining it with the base directory, resolve the absolute path and verify that it still lies within that base directory. Here, the cleanest, least invasive fix is to ensure the value used in the path is a single component that cannot traverse directories.

The best targeted fix without changing existing functionality is to (a) validate modelHash in handleMetainfo so it cannot contain /, \, or .., and (b) defensively constrain how metainfoFilePath uses modelHash by stripping any directory components and using only the final base name. This preserves current behavior for legitimate hash strings (which do not contain separators), but prevents an attacker-supplied path from escaping d.dataDir. Concretely:

  • In pkg/distributor/metainfo_server.go, after computing modelHash from r.URL.Path, add a check that rejects any value containing /, \, or .. with 400 Bad Request.
  • In pkg/distributor/distributor.go, update metainfoFilePath to normalize modelHash to its last path element using filepath.Base before joining with d.dataDir. This ensures that even if other callers ever pass a path-like string, only the last component is used in the filename.

No new imports are required: both files already import path/filepath and strings.


Suggested changeset 2
pkg/distributor/distributor.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/pkg/distributor/distributor.go b/pkg/distributor/distributor.go
--- a/pkg/distributor/distributor.go
+++ b/pkg/distributor/distributor.go
@@ -517,7 +517,8 @@
 
 // metainfoFilePath returns the path to the cached .torrent file for a model.
 func (d *ModelDistributor) metainfoFilePath(modelHash string) string {
-	return filepath.Join(d.dataDir, modelHash+".torrent")
+	safeName := filepath.Base(modelHash)
+	return filepath.Join(d.dataDir, safeName+".torrent")
 }
 
 // saveMetainfoToFile caches the metainfo to a .torrent file for fast serving.
EOF
@@ -517,7 +517,8 @@

// metainfoFilePath returns the path to the cached .torrent file for a model.
func (d *ModelDistributor) metainfoFilePath(modelHash string) string {
return filepath.Join(d.dataDir, modelHash+".torrent")
safeName := filepath.Base(modelHash)
return filepath.Join(d.dataDir, safeName+".torrent")
}

// saveMetainfoToFile caches the metainfo to a .torrent file for fast serving.
pkg/distributor/metainfo_server.go
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/pkg/distributor/metainfo_server.go b/pkg/distributor/metainfo_server.go
--- a/pkg/distributor/metainfo_server.go
+++ b/pkg/distributor/metainfo_server.go
@@ -75,8 +75,11 @@
 	modelHash := strings.TrimPrefix(r.URL.Path, "/metainfo/")
 	modelHash = filepath.Clean(modelHash)
 
-	if modelHash == "" || modelHash == "." {
-		http.Error(w, "Model hash required", http.StatusBadRequest)
+	// Ensure modelHash is a single path component without traversal
+	if modelHash == "" || modelHash == "." ||
+		strings.Contains(modelHash, "/") || strings.Contains(modelHash, "\\") ||
+		strings.Contains(modelHash, "..") {
+		http.Error(w, "Invalid model hash", http.StatusBadRequest)
 		return
 	}
 
EOF
@@ -75,8 +75,11 @@
modelHash := strings.TrimPrefix(r.URL.Path, "/metainfo/")
modelHash = filepath.Clean(modelHash)

if modelHash == "" || modelHash == "." {
http.Error(w, "Model hash required", http.StatusBadRequest)
// Ensure modelHash is a single path component without traversal
if modelHash == "" || modelHash == "." ||
strings.Contains(modelHash, "/") || strings.Contains(modelHash, "\\") ||
strings.Contains(modelHash, "..") {
http.Error(w, "Invalid model hash", http.StatusBadRequest)
return
}

Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration changes controller Controller changes dependencies Dependency updates documentation Documentation changes helm Helm chart changes model-agent Model agent changes rbac RBAC configuration changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants