Skip to content

Conversation

@keiran-rowell-unsw
Copy link

@keiran-rowell-unsw keiran-rowell-unsw commented Jan 16, 2026

A Pipeline structure and metrics precis, to guide addition of any new proteinfold --modes.

Designed to quickly orient new contributors of the main places to edit to add a newly release protein structure prediction program, not provide fine-grained implementation details.

@keiran-rowell-unsw keiran-rowell-unsw added the documentation Improvements or additions to documentation label Jan 16, 2026
@github-actions
Copy link

github-actions bot commented Jan 16, 2026

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 371a4be

+| ✅ 327 tests passed       |+
#| ❔   4 tests were ignored |#
#| ❔   1 tests had warnings |#
!| ❗  33 tests had warnings |!
Details

❗ Test warnings:

  • files_exist - File not found: conf/igenomes.config
  • files_exist - File not found: conf/igenomes_ignored.config
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in nextflow.config: Specify any additional parameters here
  • schema_description - No description provided in schema for parameter: rosettafold2na_uniref30_link
  • schema_description - No description provided in schema for parameter: rosettafold2na_bfd_link
  • schema_description - No description provided in schema for parameter: rosettafold2na_pdb100_link
  • schema_description - No description provided in schema for parameter: rosettafold2na_weights_link
  • schema_description - No description provided in schema for parameter: rfam_full_region_link
  • schema_description - No description provided in schema for parameter: rfam_cm_link
  • schema_description - No description provided in schema for parameter: rnacentral_rfam_annotations_link
  • schema_description - No description provided in schema for parameter: rnacentral_id_mapping_link
  • schema_description - No description provided in schema for parameter: rnacentral_sequences_link
  • schema_description - No description provided in schema for parameter: rosettafold2na_uniref30_path
  • schema_description - No description provided in schema for parameter: rosettafold2na_bfd_path
  • schema_description - No description provided in schema for parameter: rosettafold2na_pdb100_path
  • schema_description - No description provided in schema for parameter: rosettafold2na_weights_path
  • local_component_structure - post_processing.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_rosettafold_all_atom_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_rosettafold2na_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_alphafold3_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_colabfold_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - aria2_uncompress.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_esmfold_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_helixfold3_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_boltz_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
  • local_component_structure - prepare_alphafold2_dbs.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure

❔ Tests ignored:

❔ Tests fixed:

✅ Tests passed:

Run details

  • nf-core/tools version 3.5.1
  • Run at 2026-01-16 04:56:21

@keiran-rowell-unsw keiran-rowell-unsw marked this pull request as ready for review January 16, 2026 04:39
@keiran-rowell-unsw
Copy link
Author

@JoseEspinosa happy to edit however or knock back, just would have found this helpful getting up to speed on nf-core/proteinfold structure

@JoseEspinosa
Copy link
Member

@JoseEspinosa happy to edit however or knock back, just would have found this helpful getting up to speed on nf-core/proteinfold structure

Will take a look tomorrow. Thanks!

@JoseEspinosa JoseEspinosa self-requested a review January 18, 2026 18:18
Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added some suggestions and corrected few "tyops" 😛
Feel free to reject the ones you think are not ok.
This is really awesome @keiran-rowell-unsw ! 🚀

@@ -0,0 +1,128 @@
## Guidance on how to add a new --mode (i.e. structure prediction software) to ProteinFold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Guidance on how to add a new --mode (i.e. structure prediction software) to ProteinFold
## Adding structure prediction modes to nf-core/proteinfold
This section provides guidance on adding new structure prediction modes, implemented via the `--mode` option, to nf-core/proteinfold.


### Contributing

One of the great advantages of an `nf-core` pipeline is the community can add new protein structure prediction modules as they are released, while still leveraging the workflow infrastructure and reports developed for `proteinfold`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One of the great advantages of an `nf-core` pipeline is the community can add new protein structure prediction modules as they are released, while still leveraging the workflow infrastructure and reports developed for `proteinfold`.
One of the great advantages of an `nf-core` pipeline is that the community can extend workflows to add new functionalities. In nf-core/proteinfold, this allows adding new protein structure prediction modules as they are released, while still leveraging the existing workflow infrastructure and reporting.


One of the great advantages of an `nf-core` pipeline is the community can add new protein structure prediction modules as they are released, while still leveraging the workflow infrastructure and reports developed for `proteinfold`.

Please consider writing some code to become a [nf-core contributor](https://nf-co.re/contributors) and expand the pipeline! Reach out to a maintainer of contributor for guidance :)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also mention the #proteinfold_dev slack channel? I think it would be easy to people to just write a message on slack and we are all there

- `"""script block"""`:
- `program`: the script block calls the program from the Nextflow shell with the programs typical `--flags`, in whatever form (`binary` or `script.py`) the program is distributed from its codebase repository.
- `extract_metrics.py`: accesses the canonical data output formats from the structure prediction program and returns a core set of plain text `.tsv` metric files.
- `bin/extract_metrics.py`: a globally accessible program to go from serialised data -> `.tsv` plaintext. Currently runs particular extraction logic functions based upon file format (`.pkl`, `.json`, `.npz`). However, as the commnity adds more `--mode`s to the pipeline, different programs could use the same compressed output format. In which case `extract_metrics.py` should be refactored to match based on the passing the `--mode` to `extract_metrics.py`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `bin/extract_metrics.py`: a globally accessible program to go from serialised data -> `.tsv` plaintext. Currently runs particular extraction logic functions based upon file format (`.pkl`, `.json`, `.npz`). However, as the commnity adds more `--mode`s to the pipeline, different programs could use the same compressed output format. In which case `extract_metrics.py` should be refactored to match based on the passing the `--mode` to `extract_metrics.py`.
- `bin/extract_metrics.py`: a globally accessible program to go from serialised data into `.tsv` plaintext. It currently applies format specific extraction logic for `.pkl`, `.json` and `.npz` files. However, as the community adds more `--mode`s to the pipeline, different programs could use the same compressed output format. In which case `extract_metrics.py` should be refactored to match based on the passing the `--mode` to `extract_metrics.py`.


### Process labelling

At the top of a module's `RUN_[MODE_NAME]`{} process there are a series of labels that allow the `nextflow.config` to pass the job to the approriate resources on the compute cluster. `label 'process_gpu'` is very useful to specify this is the AI inference stage requiring GP-GPU grunt -- whereas other processes can have default labels that request CPU resources and, once finished, will naturally cascade onto GPUs due to Nextflow's dataflow paradigm.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At the top of a module's `RUN_[MODE_NAME]`{} process there are a series of labels that allow the `nextflow.config` to pass the job to the approriate resources on the compute cluster. `label 'process_gpu'` is very useful to specify this is the AI inference stage requiring GP-GPU grunt -- whereas other processes can have default labels that request CPU resources and, once finished, will naturally cascade onto GPUs due to Nextflow's dataflow paradigm.
At the top of a module's `RUN_[MODE_NAME]`{} process, there are a series of labels that allow the `nextflow.config` to pass the job to the appropriate resources on the compute cluster. `label 'process_gpu'` is very useful to specify the AI inference stages requiring GPU-intensive computation. Other processes can use default labels that request CPU resources and, once finished, will naturally cascade onto GPU-enabled steps due to Nextflow's dataflow paradigm.


### Processable structure prediction metrics

Metrics from AlphaFold-inspired protein strucutre prediction programs are structured in two ways: tabular or as a matrix (PAE values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Metrics from AlphaFold-inspired protein strucutre prediction programs are structured in two ways: tabular or as a matrix (PAE values)
Metrics from AlphaFold-inspired protein structure prediction programs are structured in two ways: tabular or as a matrix (PAE values)


When contributing a new mode to `proteinfold`, functionality should be added to `extract_metrics.py` to access the canonical ouput files of the new program, and extract data into compliant `.tsv` files that can be easily processed by downstream plotting and MultiQC functions.

Metrics files are **0 indexed**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Metrics files are **0 indexed**.
> [!WARNING]
> Metrics files are **0 indexed**.


#### pLDDT (`{meta.id}_plddt.tsv`)

Confidence values per residue, rounded to 2 decimal places. Each ranked result gets its own column. [For all-atom modules, atomic token confidences are processed to a naive mean value across the residue]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Confidence values per residue, rounded to 2 decimal places. Each ranked result gets its own column. [For all-atom modules, atomic token confidences are processed to a naive mean value across the residue]
Confidence values per residue, rounded to 2 decimal places. Each ranked result gets its own column (for all-atom modules, atomic token confidences are processed to a naive mean value across the residue).


#### (i)pTM (`{meta.id}_[i]ptm.tsv`)

(i)pTM scores, rounded to 3 decimal places, listed by the rank number. [Currently unsorted]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(i)pTM scores, rounded to 3 decimal places, listed by the rank number. [Currently unsorted]
(i)pTM scores, rounded to 3 decimal places, listed by the rank number (currently unsorted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants