-
Notifications
You must be signed in to change notification settings - Fork 7
Use augur subsample #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,7 +9,7 @@ REQUIRED INPUTS: | |
|
|
||
| OUTPUTS: | ||
|
|
||
| metadata = results/{build}/filtered.tsv | ||
| metadata = results/{build}/subsampled.tsv | ||
| alignment = results/{build}/aligned.fasta | ||
|
|
||
| This part of the workflow usually includes the following steps: | ||
|
|
@@ -51,9 +51,9 @@ rule decompress: | |
| zstd -d -c {input.metadata} > {output.metadata} | ||
| """ | ||
|
|
||
| rule filter: | ||
| rule subsample: | ||
| """ | ||
| Filtering to | ||
| Subsampling to | ||
| - various criteria based on the auspice JSON target | ||
| - from {params.min_date} onwards | ||
| - excluding strains in {input.exclude} | ||
|
|
@@ -63,35 +63,31 @@ rule filter: | |
| input: | ||
| sequences = "data/sequences.fasta", | ||
| metadata = "data/metadata.tsv", | ||
| exclude = resolve_config_path(config["filter"]["exclude"]), | ||
| include = resolve_config_path(config["filter"]["include"]), | ||
| config = "results/run_config.yaml", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [non-blocking commentary] Perhaps this is a moot point — I’m not sure how often people simply update the exclude list and expect to rerun from cache, versus just forcing a rerun each time anyway.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes it will no longer detect changes, this is a good point. I mentioned this in nextstrain/pathogen-repo-guide#98, and I think the functionality could potentially be brought back with a workflow-level helper function that dynamically adds |
||
| output: | ||
| sequences = "results/{build}/filtered.fasta", | ||
| metadata = "results/{build}/filtered.tsv", | ||
| sequences = "results/{build}/subsampled.fasta", | ||
| metadata = "results/{build}/subsampled.tsv", | ||
| log: | ||
| "logs/{build}/filtered.txt", | ||
| "logs/{build}/subsample.txt", | ||
| benchmark: | ||
| "benchmarks/{build}/filtered.txt", | ||
| "benchmarks/{build}/subsample.txt", | ||
| params: | ||
| min_length = config['filter']['min_length'], | ||
| group_by = config['filter']['group_by'], | ||
| filter_params = lambda wildcard: config['filter']['specific'][wildcard.build], | ||
| config_section = lambda w: ["custom_subsample" if config.get("custom_subsample") else "subsample", w.build], | ||
| strain_id = config.get("strain_id_field", "strain"), | ||
| threads: workflow.cores | ||
| shell: | ||
| r""" | ||
| exec &> >(tee {log:q}) | ||
|
|
||
| augur filter \ | ||
| augur subsample \ | ||
| --sequences {input.sequences:q} \ | ||
| --metadata {input.metadata:q} \ | ||
| --metadata-id-columns {params.strain_id:q} \ | ||
| --exclude {input.exclude:q} \ | ||
| --include {input.include:q} \ | ||
| --config {input.config:q} \ | ||
| --config-section {params.config_section:q} \ | ||
| --nthreads {threads:q} \ | ||
| --output-sequences {output.sequences:q} \ | ||
| --output-metadata {output.metadata:q} \ | ||
| --min-length {params.min_length:q} \ | ||
| --group-by {params.group_by} \ | ||
| {params.filter_params} | ||
| --output-metadata {output.metadata:q} | ||
| """ | ||
|
|
||
| rule align: | ||
|
|
@@ -100,7 +96,7 @@ rule align: | |
| - filling gaps with N | ||
| """ | ||
| input: | ||
| sequences = "results/{build}/filtered.fasta", | ||
| sequences = "results/{build}/subsampled.fasta", | ||
| reference = resolve_config_path(config['reference']), | ||
| output: | ||
| alignment = "results/{build}/aligned.fasta", | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non-blocking commentary]
Okay, I can see how this approach makes every filter parameter explicit in the config.yaml. I'm still not a fan of the duplication (e.g., identical settings for
min_length,group_by,max_sequences, etc.) under each build scope (north-americaandglobal). But since there are only two builds here (compared to dengue with 10, norovirus with 14, and flu potentially with 3×8), I can live with it for mumps.I could see an improvement by using something like YAML anchors at some point in the future:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could imagine someone adding "SH" gene builds (would double the builds to 4) but as of now, we only do full genome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After using YAML anchors in nextstrain/rsv#103, I don't think it's great and am seeking alternatives in nextstrain/public#27.