Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling#80
Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling#80
Conversation
…g and optimized processing
…necessary wrapper
There was a problem hiding this comment.
Pull request overview
This pull request enhances the robustness and monitoring of dataset building operations in scripts/datasets/seq_for_mut_prob.py. It introduces fallback mechanisms for BioMart metadata downloads and adds progress tracking for Ensembl CDS retrieval, improving reliability when external services are unavailable or slow.
Changes:
- Refactored
download_biomart_metadatato include retry logic, fallback from archive to latest Ensembl server, and Python-based downloader when wget is unavailable - Added progress monitoring with tqdm for Ensembl CDS sequence retrieval
- Simplified multiprocessing by removing unnecessary wrapper function
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…andling and logging
…andling and fallback mechanism
…s before fallback
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
scripts/datasets/seq_for_mut_prob.py:956
- This branch is effectively unreachable and
sys.exit()is risky inside a library-style helper (and especially problematic under multiprocessing).r.raise_for_status()will raise for 4xx/5xx, sostatus = "ERROR"; sys.exit()won’t run; if the logic changes later it could unexpectedly terminate the whole process. Consider removing this block and handling non-OK responses via exceptions/retries and returningnp.nanon terminal failure.
if not r.ok:
r.raise_for_status()
status = "ERROR"
sys.exit()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…are_samplesheet.py
…wait time in get_ref_dna_from_ensembl_batch;
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
scripts/datasets/get_pae.py:63
- If the response is HTTP 200 but the body doesn’t match the expected JSON pattern (
content.endswith(b'}]')),statusremainsINIT, which causes the loop to immediately retry without the 30s backoff. Setstatus = "ERROR"(or otherwise sleep) when content validation fails to avoid tight retry loops and rate-limiting.
content = response.content
if content.endswith(b'}]') and not content.endswith(b'</Error>'):
with open(file_path, 'wb') as output_file:
output_file.write(content)
status = "FINISHED"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…QRES already exists
…ce and reliability
…ing for improved reliability
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…te metadata_for_outputs assignment logic
…iomart_metadata to support archive usage
Summary
This PR makes the build pipeline faster and more reliable under current Ensembl/BioMart/AF DB constraints. Ensembl CDS is now batched, BioMart downloads have fallbacks and clearer logs, and PAE handling is robust to missing AF DB versions with a new
--custom_pae_dir. MANE builds are explicitly pinned to AF DB v4 for consistency, and several edge‑case crashes plus doc/debug ergonomics were cleaned up. Default AF DB version for non‑MANE builds is now v6 (latest at the time of this PR). Backtranseq batching/retry/timeouts were hardened to avoid multi‑hour hangs.Issue 1: Ensembl CDS retrieval speed
Issue 2: BioMart metadata download instability
--no-hsts, and SSL verify for HTTPS fallback.download_single_file().Issue 3: PAE availability and custom input
--custom_pae_dir; skip download when 10 consecutive 404/410s detected.pae/; probe first 10 IDs sequentially, then parallel download.Issue 4: MANE + AF version consistency
af_version=4when--mane/--mane_onlyis used.merge_af_fragmentsnow receivesaf_version(fixes v4 hardcode when default is v6).--af_versionis now6.Issue 5: Backtranseq robustness
Issue 6: Sequence/PDB hygiene & mapping
Issue 7: Developer UX & tooling
prepare_samplesheet.pyso it runs directly from its folder (adds repo root tosys.path, avoidingModuleNotFoundError: scripts).update_samplesheet_and_structures.pyalways adds symbol to final bundles;samplesheet.csvis kept clean (no symbol/CGC/length).--include-metadatanow only adds CGC/length.build_datasets.pyand kept CLI‑only guard.