-
Notifications
You must be signed in to change notification settings - Fork 2
WIP Dev stamp integration #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…P integration Signed-off-by: GitHub CI <ci@github.com>
…d configurations Signed-off-by: GitHub CI <ci@github.com>
Signed-off-by: GitHub CI <ci@github.com>
…AMP integration Signed-off-by: GitHub CI <ci@github.com>
Signed-off-by: GitHub CI <ci@github.com>
Signed-off-by: GitHub CI <ci@github.com>
|
WIP checklist:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces comprehensive updates to enhance the MediSwarm federated learning platform by adding support for the STAMP protocol, improving Docker configuration capabilities, and updating existing job configurations. The changes enable broader compatibility with different pathology analysis workflows while maintaining consistency across the platform.
Key changes include:
- Addition of the STAMP protocol for pathology analysis with complete job configuration, custom modeling components, and federated learning integration
- Enhanced Docker build system with support for multiple Dockerfile configurations (ODELIA and STAMP)
- Updated controller paths in existing federated learning job configurations for consistency with the NVFlare package structure
Reviewed Changes
Copilot reviewed 28 out of 30 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docker_config/pyproject.toml | Python project configuration for STAMP with dependencies and optional feature extractors |
| docker_config/Dockerfile_stamp | Multi-stage Docker configuration for STAMP environment with CUDA support and NVFlare integration |
| buildDockerImageAndStartupKits.sh | Enhanced build script supporting multiple Docker configurations and conditional model downloads |
| application/jobs/stamp/ | Complete STAMP job implementation including custom models, data handling, and federated learning integration |
| application/provision/project_Odelia_allsites.yml | Updated project configuration with version number placeholders replaced |
| import modeling.lightning_model | ||
| from modeling.registry import MODEL_REGISTRY | ||
|
|
||
| print("Training script is using:", modeling.lightning_model.__file__) |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider using the logger instead of print() for consistency with the rest of the logging setup in this file.
| print("Training script is using:", modeling.lightning_model.__file__) | |
| logger.info("Training script is using: %s", modeling.lightning_model.__file__) |
| flare.patch(trainer) # Patch trainer to enable swarm learning | ||
| logger.info(f"Site name: {SITE_NAME}") | ||
|
|
||
| logger.info(" About to enter flare.is_running loop") |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the leading space in the log message for consistency.
| logger.info(" About to enter flare.is_running loop") | |
| logger.info("About to enter flare.is_running loop") |
| ''' | ||
| if input_model is not None: | ||
| logger.info("==== Swarm model received ====") | ||
| logger.info( | ||
| f"input_model.params.keys() = {list(input_model.params.keys())[:10]} ... total = {len(input_model.params)}") | ||
| # log input_model.params | ||
| logger.info(f"input_model.params.keys() = {input_model.params.keys()}") | ||
| logger.info( | ||
| f"model.state_dict().keys() = {list(model.state_dict().keys())[:10]} ... total = {len(model.state_dict())}") | ||
| try: | ||
| model.load_state_dict(input_model.params) | ||
| except Exception as e: | ||
| logger.error("load_state_dict failed:", exc_info=True) | ||
| raise | ||
|
|
||
| logger.info(f"[DEBUG] Got input model: {input_model}") | ||
| if input_model is None: | ||
| logger.info("[DEBUG] no swarm_start received in 10s") | ||
| else: | ||
| model.load_state_dict(input_model.params) | ||
|
|
||
| logger.info("[DEBUG] received swarm_start:", input_model) | ||
| logger.info(f"Current round: {input_model.current_round}") | ||
| ''' | ||
|
|
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large block of commented code should be removed to improve code maintainability. Consider using version control history if this code needs to be referenced later.
| ''' | |
| if input_model is not None: | |
| logger.info("==== Swarm model received ====") | |
| logger.info( | |
| f"input_model.params.keys() = {list(input_model.params.keys())[:10]} ... total = {len(input_model.params)}") | |
| # log input_model.params | |
| logger.info(f"input_model.params.keys() = {input_model.params.keys()}") | |
| logger.info( | |
| f"model.state_dict().keys() = {list(model.state_dict().keys())[:10]} ... total = {len(model.state_dict())}") | |
| try: | |
| model.load_state_dict(input_model.params) | |
| except Exception as e: | |
| logger.error("load_state_dict failed:", exc_info=True) | |
| raise | |
| logger.info(f"[DEBUG] Got input model: {input_model}") | |
| if input_model is None: | |
| logger.info("[DEBUG] no swarm_start received in 10s") | |
| else: | |
| model.load_state_dict(input_model.params) | |
| logger.info("[DEBUG] received swarm_start:", input_model) | |
| logger.info(f"Current round: {input_model.current_round}") | |
| ''' |
| model_info = MODEL_REGISTRY['vit'] | ||
|
|
||
| ModelClass = model_info["model_class"] | ||
| ''' |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large block of commented code with hardcoded values should be removed. This appears to be debugging or test code that should not be in production.
| '''' | ||
| output_dir = os.getenv("TRAINING_OUTPUT_DIR") | ||
| print('output_dir:', output_dir) | ||
| #_add_file_handle_(_logger, output_dir=Path(output_dir)) | ||
| #_logger.info("Using training configuration from environment variables.") | ||
|
|
||
| train_categorical_model_( | ||
| output_dir=Path(output_dir), | ||
| clini_table=Path(os.getenv("TRAINING_CLINI_TABLE")), | ||
| slide_table=Path(os.getenv("TRAINING_SLIDE_TABLE")), | ||
| feature_dir=Path(os.getenv("TRAINING_FEATURE_DIR")), | ||
| patient_label=os.getenv("TRAINING_PATIENT_LABEL"), | ||
| ground_truth_label=os.getenv("TRAINING_GROUND_TRUTH_LABEL"), | ||
| filename_label=os.getenv("TRAINING_FILENAME_LABEL"), | ||
| categories=os.getenv("TRAINING_CATEGORIES").split(","), | ||
| # Dataset and loader parameters | ||
| bag_size=int(os.getenv("TRAINING_BAG_SIZE")), | ||
| num_workers=int(os.getenv("TRAINING_NUM_WORKERS")), | ||
| # Training parameters | ||
| batch_size=int(os.getenv("TRAINING_BATCH_SIZE")), | ||
| max_epochs=int(os.getenv("TRAINING_MAX_EPOCHS")), | ||
| patience=int(os.getenv("TRAINING_PATIENCE")), | ||
| accelerator=os.getenv("TRAINING_ACCELERATOR"), | ||
| # Experimental features | ||
| use_vary_precision_transform=os.getenv("USE_VARY_PRECISION_TRANSFORM", "False").lower() == "true", | ||
| use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true", | ||
| ) | ||
| ''' | ||
| output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new" | ||
| #logger.info('output_dir:', output_dir) | ||
| # _add_file_handle_(_logger, output_dir=Path(output_dir)) | ||
| #_logger.info("Using training configuration from environment variables.") | ||
|
|
||
| train_categorical_model_( | ||
| output_dir=Path(output_dir), | ||
| clini_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_clini.xlsx"), | ||
| slide_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_slide_h5.csv"), | ||
| feature_dir=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/uni2-02627079"), | ||
| patient_label="PATIENT", | ||
| ground_truth_label="isMSIH", | ||
| filename_label="FILENAME", | ||
| categories=None, | ||
| # Dataset and loader parameters | ||
| bag_size=512, | ||
| num_workers=15, | ||
| # Training parameters | ||
| batch_size=64, | ||
| max_epochs=64, # 64 | ||
| patience=16, | ||
| accelerator="gpu", | ||
| # Experimental features | ||
| use_vary_precision_transform='true', | ||
| use_alibi='false', | ||
| ) |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main() function contains large blocks of commented code and hardcoded paths. Consider cleaning up the commented sections and using configuration files or environment variables for paths.
| '''' | |
| output_dir = os.getenv("TRAINING_OUTPUT_DIR") | |
| print('output_dir:', output_dir) | |
| #_add_file_handle_(_logger, output_dir=Path(output_dir)) | |
| #_logger.info("Using training configuration from environment variables.") | |
| train_categorical_model_( | |
| output_dir=Path(output_dir), | |
| clini_table=Path(os.getenv("TRAINING_CLINI_TABLE")), | |
| slide_table=Path(os.getenv("TRAINING_SLIDE_TABLE")), | |
| feature_dir=Path(os.getenv("TRAINING_FEATURE_DIR")), | |
| patient_label=os.getenv("TRAINING_PATIENT_LABEL"), | |
| ground_truth_label=os.getenv("TRAINING_GROUND_TRUTH_LABEL"), | |
| filename_label=os.getenv("TRAINING_FILENAME_LABEL"), | |
| categories=os.getenv("TRAINING_CATEGORIES").split(","), | |
| # Dataset and loader parameters | |
| bag_size=int(os.getenv("TRAINING_BAG_SIZE")), | |
| num_workers=int(os.getenv("TRAINING_NUM_WORKERS")), | |
| # Training parameters | |
| batch_size=int(os.getenv("TRAINING_BATCH_SIZE")), | |
| max_epochs=int(os.getenv("TRAINING_MAX_EPOCHS")), | |
| patience=int(os.getenv("TRAINING_PATIENCE")), | |
| accelerator=os.getenv("TRAINING_ACCELERATOR"), | |
| # Experimental features | |
| use_vary_precision_transform=os.getenv("USE_VARY_PRECISION_TRANSFORM", "False").lower() == "true", | |
| use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true", | |
| ) | |
| ''' | |
| output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new" | |
| #logger.info('output_dir:', output_dir) | |
| # _add_file_handle_(_logger, output_dir=Path(output_dir)) | |
| #_logger.info("Using training configuration from environment variables.") | |
| train_categorical_model_( | |
| output_dir=Path(output_dir), | |
| clini_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_clini.xlsx"), | |
| slide_table=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/TCGA-CRC-DX_slide_h5.csv"), | |
| feature_dir=Path("/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/uni2-02627079"), | |
| patient_label="PATIENT", | |
| ground_truth_label="isMSIH", | |
| filename_label="FILENAME", | |
| categories=None, | |
| # Dataset and loader parameters | |
| bag_size=512, | |
| num_workers=15, | |
| # Training parameters | |
| batch_size=64, | |
| max_epochs=64, # 64 | |
| patience=16, | |
| accelerator="gpu", | |
| # Experimental features | |
| use_vary_precision_transform='true', | |
| use_alibi='false', | |
| ) | |
| output_dir = os.getenv("TRAINING_OUTPUT_DIR") | |
| if not output_dir: | |
| raise ValueError("Environment variable TRAINING_OUTPUT_DIR is not set.") | |
| train_categorical_model_( | |
| output_dir=Path(output_dir), | |
| clini_table=Path(os.getenv("TRAINING_CLINI_TABLE", "default_clini_table.xlsx")), | |
| slide_table=Path(os.getenv("TRAINING_SLIDE_TABLE", "default_slide_table.csv")), | |
| feature_dir=Path(os.getenv("TRAINING_FEATURE_DIR", "default_feature_dir")), | |
| patient_label=os.getenv("TRAINING_PATIENT_LABEL", "PATIENT"), | |
| ground_truth_label=os.getenv("TRAINING_GROUND_TRUTH_LABEL", "isMSIH"), | |
| filename_label=os.getenv("TRAINING_FILENAME_LABEL", "FILENAME"), | |
| categories=os.getenv("TRAINING_CATEGORIES", "").split(",") if os.getenv("TRAINING_CATEGORIES") else None, | |
| # Dataset and loader parameters | |
| bag_size=int(os.getenv("TRAINING_BAG_SIZE", 512)), | |
| num_workers=int(os.getenv("TRAINING_NUM_WORKERS", 4)), | |
| # Training parameters | |
| batch_size=int(os.getenv("TRAINING_BATCH_SIZE", 64)), | |
| max_epochs=int(os.getenv("TRAINING_MAX_EPOCHS", 64)), | |
| patience=int(os.getenv("TRAINING_PATIENCE", 16)), | |
| accelerator=os.getenv("TRAINING_ACCELERATOR", "gpu"), | |
| # Experimental features | |
| use_vary_precision_transform=os.getenv("USE_VARY_PRECISION_TRANSFORM", "False").lower() == "true", | |
| use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true", | |
| ) |
| use_alibi=os.getenv("USE_ALIBI", "False").lower() == "true", | ||
| ) | ||
| ''' | ||
| output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new" |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded file paths should be moved to configuration files or environment variables to improve portability and maintainability.
| output_dir = "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new" | |
| output_dir = os.getenv("TRAINING_OUTPUT_DIR", "/mnt/swarm_alpha/ECDP2025/pathology_data/TCGA/TCGA-CRC/STAMP_crossval_new") |
| RUN /usr/bin/python3.11 -m pip install /workspace/controller && rm -rf /workspace/controller | ||
|
|
||
| # ----------- Copy MediSwarm Source and Link ----------- | ||
| #COPY ./MediSwarm /MediSwarm originally |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented line that appears to be leftover from development.
| #COPY ./MediSwarm /MediSwarm originally |
docker_config/Dockerfile_stamp
Outdated
| WORKDIR /workspace/ | ||
|
|
||
| #docker build -f docker_config/Dockerfile_stamp -t stamp-image . | ||
| #docker run --rm -it stamp-image:latest-image bash No newline at end of file |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented Docker commands that appear to be leftover from development. These should be in documentation if needed for reference.
| WORKDIR /workspace/ | |
| #docker build -f docker_config/Dockerfile_stamp -t stamp-image . | |
| #docker run --rm -it stamp-image:latest-image bash | |
| WORKDIR /workspace/ |
| self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype)) | ||
| self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype)) |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nn.Buffer should be register_buffer. The current syntax is incorrect and will likely cause runtime errors.
| self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype)) | |
| self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype)) | |
| self.register_buffer("running_mean", torch.ones(1, dtype=dtype)) | |
| self.register_buffer("items_so_far", torch.ones(1, dtype=dtype)) |
| self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype)) | ||
| self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype)) |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nn.Buffer should be register_buffer. The current syntax is incorrect and will likely cause runtime errors.
| self.running_mean = nn.Buffer(torch.ones(1, dtype=dtype)) | |
| self.items_so_far = nn.Buffer(torch.ones(1, dtype=dtype)) | |
| self.register_buffer("running_mean", torch.ones(1, dtype=dtype)) | |
| self.register_buffer("items_so_far", torch.ones(1, dtype=dtype)) |
# Conflicts: # buildDockerImageAndStartupKits.sh
Signed-off-by: GitHub CI <ci@github.com>
| slide for slide in slide_to_patient.keys() if not slide.exists() | ||
| }: | ||
| _logger.warning( | ||
| f"some feature files could not be found: {slides_without_features}" |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
sensitive data (private)
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 months ago
To fix the problem, we should avoid logging sensitive patient identifiers in clear text. The log message on line 593 currently outputs the full set of missing feature files, which may include patient IDs. The best way to address this is to either (a) remove patient identifiers from the log message, (b) log only non-sensitive information (e.g., the count of missing files), or (c) mask the identifiers if logging is necessary for operational reasons.
The recommended fix is to log only the number of missing feature files, not their full paths or associated patient IDs. This preserves the utility of the log message for monitoring data completeness without exposing sensitive information. The change should be made in the _log_patient_slide_feature_inconsistencies function in application/jobs/stamp/app/custom/data.py, specifically replacing the log message on line 593.
No new imports or dependencies are required.
-
Copy modified line R593
| @@ -590,7 +590,7 @@ | ||
| slide for slide in slide_to_patient.keys() if not slide.exists() | ||
| }: | ||
| _logger.warning( | ||
| f"some feature files could not be found: {slides_without_features}" | ||
| f"some feature files could not be found: {len(slides_without_features)} missing files" | ||
| ) | ||
|
|
||
|
|
Signed-off-by: GitHub CI <ci@github.com>
Signed-off-by: GitHub CI <ci@github.com>
|
|
||
| if missing_features: | ||
| _logger.warning( | ||
| f"Some patients have no feature file in {feature_dir}: {missing_features}" |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
sensitive data (private)
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 5 months ago
To fix the problem, we should avoid logging the actual patient IDs in clear text. Instead, the log message can indicate the number of patients missing feature files, or log only non-sensitive, aggregate information. If more detail is needed for debugging, consider logging a truncated or hashed version of the patient IDs, or log them only in a secure, access-controlled environment (not shown here). The best fix is to change the log message on line 216 to avoid including missing_features directly. For example, log only the count of missing patients, or a generic message.
Required changes:
- In file
application/jobs/stamp/app/custom/data.py, replace the log message on line 216 to avoid logging the actual patient IDs. Instead, log the count of missing patients. - No new imports or methods are needed.
-
Copy modified line R216
| @@ -213,7 +213,7 @@ | ||
|
|
||
| if missing_features: | ||
| _logger.warning( | ||
| f"Some patients have no feature file in {feature_dir}: {missing_features}" | ||
| f"{len(missing_features)} patients have no feature file in {feature_dir}." | ||
| ) | ||
|
|
||
| return patient_to_data |
This pull request introduces several updates across multiple files to enhance configurability, improve logging, and add new functionality for federated learning workflows. The most significant changes include updating controller paths for consistency, improving logging in the training workflow, and adding a new job configuration for the STAMP protocol.
Updates to Controller Paths:
config_fed_client.confandconfig_fed_server.conffiles to usenvflare.app_common.ccwffor both theODELIA_ternary_classificationandminimal_training_pytorch_cnnjobs. This ensures consistency with the updated NVFlare package structure. [1] [2] [3] [4]Enhancements to Logging:
main.pyfile of theminimal_training_pytorch_cnnjob to provide detailed information about the received model's parameters and state during training. This improves debugging and monitoring of the training process. [1] [2]New Job Configuration for STAMP Protocol:
config_fed_client.confandconfig_fed_server.conf, which defines executors, components, and workflows for training and aggregation. This includes support for PyTorch-based model training and evaluation. [1] [2]README.mdfor the STAMP job, detailing installation, usage, and citation information for the protocol.