-
Notifications
You must be signed in to change notification settings - Fork 60
Doc - How to contribute new modes #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Doc - How to contribute new modes #443
Conversation
|
|
@JoseEspinosa happy to edit however or knock back, just would have found this helpful getting up to speed on |
Will take a look tomorrow. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added some suggestions and corrected few "tyops" 😛
Feel free to reject the ones you think are not ok.
This is really awesome @keiran-rowell-unsw ! 🚀
| @@ -0,0 +1,128 @@ | |||
| ## Guidance on how to add a new --mode (i.e. structure prediction software) to ProteinFold | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Guidance on how to add a new --mode (i.e. structure prediction software) to ProteinFold | |
| ## Adding structure prediction modes to nf-core/proteinfold | |
| This section provides guidance on adding new structure prediction modes, implemented via the `--mode` option, to nf-core/proteinfold. |
|
|
||
| ### Contributing | ||
|
|
||
| One of the great advantages of an `nf-core` pipeline is the community can add new protein structure prediction modules as they are released, while still leveraging the workflow infrastructure and reports developed for `proteinfold`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| One of the great advantages of an `nf-core` pipeline is the community can add new protein structure prediction modules as they are released, while still leveraging the workflow infrastructure and reports developed for `proteinfold`. | |
| One of the great advantages of an `nf-core` pipeline is that the community can extend workflows to add new functionalities. In nf-core/proteinfold, this allows adding new protein structure prediction modules as they are released, while still leveraging the existing workflow infrastructure and reporting. |
|
|
||
| One of the great advantages of an `nf-core` pipeline is the community can add new protein structure prediction modules as they are released, while still leveraging the workflow infrastructure and reports developed for `proteinfold`. | ||
|
|
||
| Please consider writing some code to become a [nf-core contributor](https://nf-co.re/contributors) and expand the pipeline! Reach out to a maintainer of contributor for guidance :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also mention the #proteinfold_dev slack channel? I think it would be easy to people to just write a message on slack and we are all there
| - `"""script block"""`: | ||
| - `program`: the script block calls the program from the Nextflow shell with the programs typical `--flags`, in whatever form (`binary` or `script.py`) the program is distributed from its codebase repository. | ||
| - `extract_metrics.py`: accesses the canonical data output formats from the structure prediction program and returns a core set of plain text `.tsv` metric files. | ||
| - `bin/extract_metrics.py`: a globally accessible program to go from serialised data -> `.tsv` plaintext. Currently runs particular extraction logic functions based upon file format (`.pkl`, `.json`, `.npz`). However, as the commnity adds more `--mode`s to the pipeline, different programs could use the same compressed output format. In which case `extract_metrics.py` should be refactored to match based on the passing the `--mode` to `extract_metrics.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `bin/extract_metrics.py`: a globally accessible program to go from serialised data -> `.tsv` plaintext. Currently runs particular extraction logic functions based upon file format (`.pkl`, `.json`, `.npz`). However, as the commnity adds more `--mode`s to the pipeline, different programs could use the same compressed output format. In which case `extract_metrics.py` should be refactored to match based on the passing the `--mode` to `extract_metrics.py`. | |
| - `bin/extract_metrics.py`: a globally accessible program to go from serialised data into `.tsv` plaintext. It currently applies format specific extraction logic for `.pkl`, `.json` and `.npz` files. However, as the community adds more `--mode`s to the pipeline, different programs could use the same compressed output format. In which case `extract_metrics.py` should be refactored to match based on the passing the `--mode` to `extract_metrics.py`. |
|
|
||
| ### Process labelling | ||
|
|
||
| At the top of a module's `RUN_[MODE_NAME]`{} process there are a series of labels that allow the `nextflow.config` to pass the job to the approriate resources on the compute cluster. `label 'process_gpu'` is very useful to specify this is the AI inference stage requiring GP-GPU grunt -- whereas other processes can have default labels that request CPU resources and, once finished, will naturally cascade onto GPUs due to Nextflow's dataflow paradigm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| At the top of a module's `RUN_[MODE_NAME]`{} process there are a series of labels that allow the `nextflow.config` to pass the job to the approriate resources on the compute cluster. `label 'process_gpu'` is very useful to specify this is the AI inference stage requiring GP-GPU grunt -- whereas other processes can have default labels that request CPU resources and, once finished, will naturally cascade onto GPUs due to Nextflow's dataflow paradigm. | |
| At the top of a module's `RUN_[MODE_NAME]`{} process, there are a series of labels that allow the `nextflow.config` to pass the job to the appropriate resources on the compute cluster. `label 'process_gpu'` is very useful to specify the AI inference stages requiring GPU-intensive computation. Other processes can use default labels that request CPU resources and, once finished, will naturally cascade onto GPU-enabled steps due to Nextflow's dataflow paradigm. |
|
|
||
| ### Processable structure prediction metrics | ||
|
|
||
| Metrics from AlphaFold-inspired protein strucutre prediction programs are structured in two ways: tabular or as a matrix (PAE values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Metrics from AlphaFold-inspired protein strucutre prediction programs are structured in two ways: tabular or as a matrix (PAE values) | |
| Metrics from AlphaFold-inspired protein structure prediction programs are structured in two ways: tabular or as a matrix (PAE values) |
|
|
||
| When contributing a new mode to `proteinfold`, functionality should be added to `extract_metrics.py` to access the canonical ouput files of the new program, and extract data into compliant `.tsv` files that can be easily processed by downstream plotting and MultiQC functions. | ||
|
|
||
| Metrics files are **0 indexed**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Metrics files are **0 indexed**. | |
| > [!WARNING] | |
| > Metrics files are **0 indexed**. |
|
|
||
| #### pLDDT (`{meta.id}_plddt.tsv`) | ||
|
|
||
| Confidence values per residue, rounded to 2 decimal places. Each ranked result gets its own column. [For all-atom modules, atomic token confidences are processed to a naive mean value across the residue] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Confidence values per residue, rounded to 2 decimal places. Each ranked result gets its own column. [For all-atom modules, atomic token confidences are processed to a naive mean value across the residue] | |
| Confidence values per residue, rounded to 2 decimal places. Each ranked result gets its own column (for all-atom modules, atomic token confidences are processed to a naive mean value across the residue). |
|
|
||
| #### (i)pTM (`{meta.id}_[i]ptm.tsv`) | ||
|
|
||
| (i)pTM scores, rounded to 3 decimal places, listed by the rank number. [Currently unsorted] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (i)pTM scores, rounded to 3 decimal places, listed by the rank number. [Currently unsorted] | |
| (i)pTM scores, rounded to 3 decimal places, listed by the rank number (currently unsorted). |
A Pipeline structure and metrics precis, to guide addition of any new
proteinfold--modes.Designed to quickly orient new contributors of the main places to edit to add a newly release protein structure prediction program, not provide fine-grained implementation details.