Skip to content

Conversation

@Ultimate-Storm
Copy link
Contributor

This pull request introduces several updates across multiple files to enhance configurability, improve logging, and add new functionality for federated learning workflows. The most significant changes include updating controller paths for consistency, improving logging in the training workflow, and adding a new job configuration for the STAMP protocol.

Updates to Controller Paths:

  • Updated the controller paths in config_fed_client.conf and config_fed_server.conf files to use nvflare.app_common.ccwf for both the ODELIA_ternary_classification and minimal_training_pytorch_cnn jobs. This ensures consistency with the updated NVFlare package structure. [1] [2] [3] [4]

Enhancements to Logging:

  • Enhanced logging in the main.py file of the minimal_training_pytorch_cnn job to provide detailed information about the received model's parameters and state during training. This improves debugging and monitoring of the training process. [1] [2]

New Job Configuration for STAMP Protocol:

  • Added a new federated learning job configuration for the STAMP protocol, including config_fed_client.conf and config_fed_server.conf, which defines executors, components, and workflows for training and aggregation. This includes support for PyTorch-based model training and evaluation. [1] [2]
  • Introduced a comprehensive README.md for the STAMP job, detailing installation, usage, and citation information for the protocol.

…P integration

Signed-off-by: GitHub CI <ci@github.com>
…d configurations

Signed-off-by: GitHub CI <ci@github.com>
…AMP integration

Signed-off-by: GitHub CI <ci@github.com>
Signed-off-by: GitHub CI <ci@github.com>
@Ultimate-Storm Ultimate-Storm requested a review from Copilot July 30, 2025 09:33
@Ultimate-Storm Ultimate-Storm self-assigned this Jul 30, 2025
@Ultimate-Storm
Copy link
Contributor Author

Ultimate-Storm commented Jul 30, 2025

WIP checklist:

  • Works with simulator, with STAMP 2.2.0
  • Merge production release of STAMP
  • Remove code other than training
  • Reconstruct persister config in .config files
  • Reconstruct the Docker script
  • Startup kits
  • Test across 3 sites

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces comprehensive updates to enhance the MediSwarm federated learning platform by adding support for the STAMP protocol, improving Docker configuration capabilities, and updating existing job configurations. The changes enable broader compatibility with different pathology analysis workflows while maintaining consistency across the platform.

Key changes include:

  • Addition of the STAMP protocol for pathology analysis with complete job configuration, custom modeling components, and federated learning integration
  • Enhanced Docker build system with support for multiple Dockerfile configurations (ODELIA and STAMP)
  • Updated controller paths in existing federated learning job configurations for consistency with the NVFlare package structure

Reviewed Changes

Copilot reviewed 28 out of 30 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docker_config/pyproject.toml Python project configuration for STAMP with dependencies and optional feature extractors
docker_config/Dockerfile_stamp Multi-stage Docker configuration for STAMP environment with CUDA support and NVFlare integration
buildDockerImageAndStartupKits.sh Enhanced build script supporting multiple Docker configurations and conditional model downloads
application/jobs/stamp/ Complete STAMP job implementation including custom models, data handling, and federated learning integration
application/provision/project_Odelia_allsites.yml Updated project configuration with version number placeholders replaced

import modeling.lightning_model
from modeling.registry import MODEL_REGISTRY

print("Training script is using:", modeling.lightning_model.__file__)
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider using the logger instead of print() for consistency with the rest of the logging setup in this file.

Suggested change
print("Training script is using:", modeling.lightning_model.__file__)
logger.info("Training script is using: %s", modeling.lightning_model.__file__)

Copilot uses AI. Check for mistakes.
flare.patch(trainer) # Patch trainer to enable swarm learning
logger.info(f"Site name: {SITE_NAME}")

logger.info(" About to enter flare.is_running loop")
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the leading space in the log message for consistency.

Suggested change
logger.info(" About to enter flare.is_running loop")
logger.info("About to enter flare.is_running loop")

Copilot uses AI. Check for mistakes.
Comment on lines 206 to 230
'''
if input_model is not None:
logger.info("==== Swarm model received ====")
logger.info(
f"input_model.params.keys() = {list(input_model.params.keys())[:10]} ... total = {len(input_model.params)}")
# log input_model.params
logger.info(f"input_model.params.keys() = {input_model.params.keys()}")
logger.info(
f"model.state_dict().keys() = {list(model.state_dict().keys())[:10]} ... total = {len(model.state_dict())}")
try:
model.load_state_dict(input_model.params)
except Exception as e:
logger.error("load_state_dict failed:", exc_info=True)
raise

logger.info(f"[DEBUG] Got input model: {input_model}")
if input_model is None:
logger.info("[DEBUG] no swarm_start received in 10s")
else:
model.load_state_dict(input_model.params)

logger.info("[DEBUG] received swarm_start:", input_model)
logger.info(f"Current round: {input_model.current_round}")
'''

Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large block of commented code should be removed to improve code maintainability. Consider using version control history if this code needs to be referenced later.

Suggested change
'''
if input_model is not None:
logger.info("==== Swarm model received ====")
logger.info(
f"input_model.params.keys() = {list(input_model.params.keys())[:10]} ... total = {len(input_model.params)}")
# log input_model.params
logger.info(f"input_model.params.keys() = {input_model.params.keys()}")
logger.info(
f"model.state_dict().keys() = {list(model.state_dict().keys())[:10]} ... total = {len(model.state_dict())}")
try:
model.load_state_dict(input_model.params)
except Exception as e:
logger.error("load_state_dict failed:", exc_info=True)
raise
logger.info(f"[DEBUG] Got input model: {input_model}")
if input_model is None:
logger.info("[DEBUG] no swarm_start received in 10s")
else:
model.load_state_dict(input_model.params)
logger.info("[DEBUG] received swarm_start:", input_model)
logger.info(f"Current round: {input_model.current_round}")
'''

Copilot uses AI. Check for mistakes.
model_info = MODEL_REGISTRY['vit']

ModelClass = model_info["model_class"]
'''
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large block of commented code with hardcoded values should be removed. This appears to be debugging or test code that should not be in production.

Copilot uses AI. Check for mistakes.
Comment on lines 530 to 583
''''
output_dir = os.getenv("TRAINING_OUTPUT_DIR")
print('output_dir:', output_dir)
#_add_file_handle_(_logger, output_dir=Path(output_dir))
#_logger.info("Using training configuration from environment variables.")

train_categorical_model_(
output_dir=Path(output_dir),
clini_table=Path(os.getenv("TRAINING_CLINI_TABLE")),
slide_table=Path(os.getenv("TRAINING_SLIDE_TABLE")),
feature_dir=Path(os.getenv("TRAINING_FEATURE_DIR")),
patient_label=os.getenv("TRAINING_PATIENT_LABEL"),
ground_truth_label=os.getenv("TRAINING_GROUND_TRUTH_LABEL"),
filename_label=os.getenv("TRAINING_FILENAME_LABEL"),
categories=os.getenv("TRAINING_CATEGORIES").split(","),
# Dataset and loader parameters
bag_size=int(os.getenv("TRAINING_BAG_SIZE")),
num_workers=int(os.getenv("TRAINING_NUM_WORKERS")),
# Training parameters
batch_size=int(os.getenv("TRAINING_BATCH_SIZE")),
max_epochs=int(os.getenv("TRAINING_MAX_EPOCHS")),
patience=int(os.getenv("TRAINING_PATIENCE")),
accelerator=os.getenv("TRAINING_ACCELERATOR"),
# Experimental features
use_vary_precision_transform=os.getenv("USE_VARY_PRECISION_TRANSFORM", "False").lower() == "true",
use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true",
)
'''
output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new"
#logger.info('output_dir:', output_dir)
# _add_file_handle_(_logger, output_dir=Path(output_dir))
#_logger.info("Using training configuration from environment variables.")

train_categorical_model_(
output_dir=Path(output_dir),
clini_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_clini.xlsx"),
slide_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_slide_h5.csv"),
feature_dir=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/uni2-02627079"),
patient_label="PATIENT",
ground_truth_label="isMSIH",
filename_label="FILENAME",
categories=None,
# Dataset and loader parameters
bag_size=512,
num_workers=15,
# Training parameters
batch_size=64,
max_epochs=64, # 64
patience=16,
accelerator="gpu",
# Experimental features
use_vary_precision_transform='true',
use_alibi='false',
)
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main() function contains large blocks of commented code and hardcoded paths. Consider cleaning up the commented sections and using configuration files or environment variables for paths.

Suggested change
''''
output_dir = os.getenv("TRAINING_OUTPUT_DIR")
print('output_dir:', output_dir)
#_add_file_handle_(_logger, output_dir=Path(output_dir))
#_logger.info("Using training configuration from environment variables.")
train_categorical_model_(
output_dir=Path(output_dir),
clini_table=Path(os.getenv("TRAINING_CLINI_TABLE")),
slide_table=Path(os.getenv("TRAINING_SLIDE_TABLE")),
feature_dir=Path(os.getenv("TRAINING_FEATURE_DIR")),
patient_label=os.getenv("TRAINING_PATIENT_LABEL"),
ground_truth_label=os.getenv("TRAINING_GROUND_TRUTH_LABEL"),
filename_label=os.getenv("TRAINING_FILENAME_LABEL"),
categories=os.getenv("TRAINING_CATEGORIES").split(","),
# Dataset and loader parameters
bag_size=int(os.getenv("TRAINING_BAG_SIZE")),
num_workers=int(os.getenv("TRAINING_NUM_WORKERS")),
# Training parameters
batch_size=int(os.getenv("TRAINING_BATCH_SIZE")),
max_epochs=int(os.getenv("TRAINING_MAX_EPOCHS")),
patience=int(os.getenv("TRAINING_PATIENCE")),
accelerator=os.getenv("TRAINING_ACCELERATOR"),
# Experimental features
use_vary_precision_transform=os.getenv("USE_VARY_PRECISION_TRANSFORM", "False").lower() == "true",
use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true",
)
'''
output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new"
#logger.info('output_dir:', output_dir)
# _add_file_handle_(_logger, output_dir=Path(output_dir))
#_logger.info("Using training configuration from environment variables.")
train_categorical_model_(
output_dir=Path(output_dir),
clini_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_clini.xlsx"),
slide_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_slide_h5.csv"),
feature_dir=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/uni2-02627079"),
patient_label="PATIENT",
ground_truth_label="isMSIH",
filename_label="FILENAME",
categories=None,
# Dataset and loader parameters
bag_size=512,
num_workers=15,
# Training parameters
batch_size=64,
max_epochs=64, # 64
patience=16,
accelerator="gpu",
# Experimental features
use_vary_precision_transform='true',
use_alibi='false',
)
output_dir = os.getenv("TRAINING_OUTPUT_DIR")
if not output_dir:
raise ValueError("Environment variable TRAINING_OUTPUT_DIR is not set.")
train_categorical_model_(
output_dir=Path(output_dir),
clini_table=Path(os.getenv("TRAINING_CLINI_TABLE", "default_clini_table.xlsx")),
slide_table=Path(os.getenv("TRAINING_SLIDE_TABLE", "default_slide_table.csv")),
feature_dir=Path(os.getenv("TRAINING_FEATURE_DIR", "default_feature_dir")),
patient_label=os.getenv("TRAINING_PATIENT_LABEL", "PATIENT"),
ground_truth_label=os.getenv("TRAINING_GROUND_TRUTH_LABEL", "isMSIH"),
filename_label=os.getenv("TRAINING_FILENAME_LABEL", "FILENAME"),
categories=os.getenv("TRAINING_CATEGORIES", "").split(",") if os.getenv("TRAINING_CATEGORIES") else None,
# Dataset and loader parameters
bag_size=int(os.getenv("TRAINING_BAG_SIZE", 512)),
num_workers=int(os.getenv("TRAINING_NUM_WORKERS", 4)),
# Training parameters
batch_size=int(os.getenv("TRAINING_BATCH_SIZE", 64)),
max_epochs=int(os.getenv("TRAINING_MAX_EPOCHS", 64)),
patience=int(os.getenv("TRAINING_PATIENCE", 16)),
accelerator=os.getenv("TRAINING_ACCELERATOR", "gpu"),
# Experimental features
use_vary_precision_transform=os.getenv("USE_VARY_PRECISION_TRANSFORM", "False").lower() == "true",
use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true",
)

Copilot uses AI. Check for mistakes.
use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true",
)
'''
output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new"
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded file paths should be moved to configuration files or environment variables to improve portability and maintainability.

Suggested change
output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new"
output_dir = os.getenv("TRAINING_OUTPUT_DIR", "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new")

Copilot uses AI. Check for mistakes.
RUN /usr/bin/python3.11 -m pip install /workspace/controller && rm -rf /workspace/controller

# ----------- Copy MediSwarm Source and Link -----------
#COPY ./MediSwarm /MediSwarm originally
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented line that appears to be leftover from development.

Suggested change
#COPY ./MediSwarm /MediSwarm originally

Copilot uses AI. Check for mistakes.
Comment on lines 50 to 53
WORKDIR /workspace/

#docker build -f docker_config/Dockerfile_stamp -t stamp-image .
#docker run --rm -it stamp-image:latest-image bash No newline at end of file
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented Docker commands that appear to be leftover from development. These should be in documentation if needed for reference.

Suggested change
WORKDIR /workspace/
#docker build -f docker_config/Dockerfile_stamp -t stamp-image .
#docker run --rm -it stamp-image:latest-image bash
WORKDIR /workspace/

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +12
self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype))
self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype))
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nn.Buffer should be register_buffer. The current syntax is incorrect and will likely cause runtime errors.

Suggested change
self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype))
self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype))
self.register_buffer("running_mean", torch.ones(1, dtype=dtype))
self.register_buffer("items_so_far", torch.ones(1, dtype=dtype))

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +12
self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype))
self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype))
Copy link

Copilot AI Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nn.Buffer should be register_buffer. The current syntax is incorrect and will likely cause runtime errors.

Suggested change
self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype))
self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype))
self.register_buffer("running_mean", torch.ones(1, dtype=dtype))
self.register_buffer("items_so_far", torch.ones(1, dtype=dtype))

Copilot uses AI. Check for mistakes.
# Conflicts:
#	buildDockerImageAndStartupKits.sh
Signed-off-by: GitHub CI <ci@github.com>
slide for slide in slide_to_patient.keys() if not slide.exists()
}:
_logger.warning(
f"some feature files could not be found: {slides_without_features}"

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (private)
as clear text.

Copilot Autofix

AI 5 months ago

To fix the problem, we should avoid logging sensitive patient identifiers in clear text. The log message on line 593 currently outputs the full set of missing feature files, which may include patient IDs. The best way to address this is to either (a) remove patient identifiers from the log message, (b) log only non-sensitive information (e.g., the count of missing files), or (c) mask the identifiers if logging is necessary for operational reasons.

The recommended fix is to log only the number of missing feature files, not their full paths or associated patient IDs. This preserves the utility of the log message for monitoring data completeness without exposing sensitive information. The change should be made in the _log_patient_slide_feature_inconsistencies function in application/jobs/stamp/app/custom/data.py, specifically replacing the log message on line 593.

No new imports or dependencies are required.


Suggested changeset 1
application/jobs/stamp/app/custom/data.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/application/jobs/stamp/app/custom/data.py b/application/jobs/stamp/app/custom/data.py
--- a/application/jobs/stamp/app/custom/data.py
+++ b/application/jobs/stamp/app/custom/data.py
@@ -590,7 +590,7 @@
         slide for slide in slide_to_patient.keys() if not slide.exists()
     }:
         _logger.warning(
-            f"some feature files could not be found: {slides_without_features}"
+            f"some feature files could not be found: {len(slides_without_features)} missing files"
         )
 
 
EOF
@@ -590,7 +590,7 @@
slide for slide in slide_to_patient.keys() if not slide.exists()
}:
_logger.warning(
f"some feature files could not be found: {slides_without_features}"
f"some feature files could not be found: {len(slides_without_features)} missing files"
)


Copilot is powered by AI and may make mistakes. Always verify output.

if missing_features:
_logger.warning(
f"Some patients have no feature file in {feature_dir}: {missing_features}"

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (private)
as clear text.

Copilot Autofix

AI 5 months ago

To fix the problem, we should avoid logging the actual patient IDs in clear text. Instead, the log message can indicate the number of patients missing feature files, or log only non-sensitive, aggregate information. If more detail is needed for debugging, consider logging a truncated or hashed version of the patient IDs, or log them only in a secure, access-controlled environment (not shown here). The best fix is to change the log message on line 216 to avoid including missing_features directly. For example, log only the count of missing patients, or a generic message.

Required changes:

  • In file application/jobs/stamp/app/custom/data.py, replace the log message on line 216 to avoid logging the actual patient IDs. Instead, log the count of missing patients.
  • No new imports or methods are needed.
Suggested changeset 1
application/jobs/stamp/app/custom/data.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/application/jobs/stamp/app/custom/data.py b/application/jobs/stamp/app/custom/data.py
--- a/application/jobs/stamp/app/custom/data.py
+++ b/application/jobs/stamp/app/custom/data.py
@@ -213,7 +213,7 @@
 
     if missing_features:
         _logger.warning(
-            f"Some patients have no feature file in {feature_dir}: {missing_features}"
+            f"{len(missing_features)} patients have no feature file in {feature_dir}."
         )
 
     return patient_to_data
EOF
@@ -213,7 +213,7 @@

if missing_features:
_logger.warning(
f"Some patients have no feature file in {feature_dir}: {missing_features}"
f"{len(missing_features)} patients have no feature file in {feature_dir}."
)

return patient_to_data
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants