Skip to content

Conversation

@dltamayo
Copy link
Collaborator

Moving towards de-hardcoding of metadata fields

dltamayo and others added 5 commits October 2, 2025 11:19
- using metadata in JSON format instead of hard-coded list access
  - still hard-coding metadata access though
- switched to pandas for outputting instead of csv
- removed old code

Eventual goal is to have any combination of clinical metadata on a patient/sample level be used for clustering/grouping
Header will now be included in aggregated .csvs, instead of hard coded within notebook.
@dltamayo dltamayo requested a review from dimalvovs October 14, 2025 19:06
@dimalvovs dimalvovs requested a review from Copilot October 14, 2025 19:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Refactors sample calculation workflow to produce per-sample CSVs with richer metadata, then aggregates them for plotting, moving away from hard‑coded column definitions. Introduces a new SAMPLE_AGGREGATE process and updates downstream notebook to align with renamed metadata fields; also adds dynamic CPU allocation to TCRDIST3.

  • Introduce per-sample output file naming (stats/, vdj/) and new aggregation step replacing prior collectFile usage.
  • Update analysis notebook to rely on CSV headers instead of hard-coded column name lists and rename metadata fields (patient_id -> subject_id, sample_id -> sample).
  • Add dynamic CPU allocation logic to TCRDIST3 process.

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
subworkflows/local/sample.nf Replaces collectFile outputs with collection + aggregation processes; updates inputs to plotting step.
notebooks/sample_stats_template.qmd Changes CSV reading strategy and metadata field names; adjusts plotting hover fields and grouping.
modules/local/sample/tcrdist3.nf Adds dynamic cpus directive based on task.memory.
modules/local/sample/sample_calc.nf Changes output file paths/name patterns and serializes sample metadata as JSON.
modules/local/sample/sample_aggregate.nf New process to concatenate multiple per-sample CSVs into aggregated files.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +6 to +14
if (task.memory > 256.GB)
return 16 * task.attempt
else if (task.memory > 64.GB)
return 8 * task.attempt
else if (task.memory > 4.GB)
return 4 * task.attempt
else
return 2 * task.attempt
}
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpus directive depends on task.memory, which itself is derived later by the memory block; this circular dependency can result in task.memory being undefined or default when cpus is evaluated. Derive both cpu and memory from the same underlying metric (e.g., count_table.size()) or compute memory first in a variable and base cpus on that variable instead of task.memory.

Suggested change
if (task.memory > 256.GB)
return 16 * task.attempt
else if (task.memory > 64.GB)
return 8 * task.attempt
else if (task.memory > 4.GB)
return 4 * task.attempt
else
return 2 * task.attempt
}
def sz = count_table.size()
def mb = 1024 * 1024
if (sz > 26 * mb)
return 16 * task.attempt
else if (sz > 20 * mb)
return 8 * task.attempt
else if (sz > 10 * mb)
return 4 * task.attempt
else
return 2 * task.attempt
}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@dimalvovs dimalvovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • setting to 'Request changes' avoid being unnoticed, but feel free to create additional issues to address if we need it merged now;
  • nf-test test minimal example runs without errors, nice!
  • nextflow lint main.nf returns "❌ 4 files had 21 errors", most errors are really simple to fix, the benefit besides having better code formatting and standardization are beautiful flow charts such as below. Can be done now or if preferred, with a dedicated refactoring feature.
    dag-SAMPLE
  • in the same refactoring category, I could not find the versions.yml output in the minimal example test results;
  • skimmed through Copilot comments and left mine where applicable
  • regarding the resource allocation, it does seem to work, run this gist https://gist.github.com/dimalvovs/3b1853effe9105c96ca40ac3e8886332
    that produces these outputs:
 N E X T F L O W   ~  version 25.04.8

Launching `main.nf` [loquacious_kare] DSL2 - revision: 39f7d81a10

executor >  local (12)
[d4/613c6c] MAKE_TEST_FILES (3) [100%] 6 of 6 ✔
[47/10b6db] TCRDIST3_MATRIX (6) [100%] 6 of 6 ✔
cpus:8 memory:1.3 GB size:20480
cpus:2 memory:163.8 MB size:4096
cpus:4 memory:655.4 MB size:10240
cpus:8 memory:5.1 GB size:30720
cpus:2 memory:409.6 MB size:2048
cpus:4 memory:655.4 MB size:6144

and also is susceptible to restricting resources from the nextflow.config:

 N E X T F L O W   ~  version 25.04.8
Launching `main.nf` [friendly_angela] DSL2 - revision: 39f7d81a10

[17/8f8b22] MAKE_TEST_FILES (6) [100%] 6 of 6 ✔
[77/703366] TCRDIST3_MATRIX (6) [100%] 6 of 6 ✔
cpus:1 memory:409.6 MB size:2048
cpus:1 memory:1 GB size:20480
cpus:1 memory:1 GB size:30720
cpus:1 memory:163.8 MB size:4096
cpus:1 memory:655.4 MB size:10240
cpus:1 memory:655.4 MB size:6144

@dimalvovs dimalvovs self-requested a review November 7, 2025 22:17
@dltamayo dltamayo merged commit 5692e2b into main Dec 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants