feat: Add `tirmite search` command for ensemble search with hit merging by Copilot · Pull Request #37 · Adamtaranto/TIRmite

Copilot · 2026-01-21T12:58:05Z

Reviewer feedback addressed:

Security Summary

No security vulnerabilities were detected by CodeQL analysis.

Original prompt

This section details on the original issue you should resolve

<issue_title>Feature: ensemble search</issue_title>
<issue_description>## Feature description

When constructing a HMM with hmmbuild we use an aligned set of blast hits obtained from an initial seed sequence. If the sequence diversity in this alignment is too high then hits between the HMM model and any of the individual component sequences that was used to make the model may become weak or insignificant.

We can solve this issue by clustering initial seed hits at a high identity level (i.e. 90%) and then building multiple separate HMMs from each cluster. Taken together, hits from the collection of component models should detect a greater diversity of matches.

Our challenge then is to merge the overlapping hits from the separate component models into one set of hits that can be paired with tirmite pair

We will create a new entrypoint called tirmite search that will take multiple fasta records and/or HMM models (representing the termini of one or more transposable element families) and run the appropriate search tool (blast or nhmmer) against a user provided genome (or pre-built blast database).

If provided with a cluster mapping file (linking query features into clustered groups), then tirmite search should apply an ensemble search strategy, whereby overlapping hits, on the same strand, to features that belong to the same cluster are merged and given a new cluster name. Hit metadata will be inherited from the highest scoring hit in the cluster.

Output of the tirmite search tool should be a blast formatted table of merged hits that can be passed to tirmite pair.

Consider existing functions from tirmite.cli.hmm_build module that could be reused or modified.

Elements with asymmetrical termini, that is a distinct terminus model on the left and right ends, may share some limited homology between the left and right termini. We want to remove weak or short hits from one terminal model when those hits are nested within a longer higher-scoring hit to the corresponding member of a pair. After clustering overlapping hits from equivalent sub-models and assigning the merged hit a new cluster name, we should perform a check for nested weak hits between left and right pair members and remove those hits. This will require the user to provide a --pairing_map file that links left and right feature names (post clustering names).

Tasks:

Take as input one or more fasta files and/or HMM models and run blastn and/or nhmmer.

Alternatively, take as input multiple pre-computed blastn and/or nhmmer output files.

Load hit records and apply user specified quality filters (evalue, query coverage, identity, etc)

Report all query feature names present in the blastn and/or nhmmer files (or query names from input fasta and HMM files). Report number of hits for each query before and after filtering.

Optional input: User provided mapping of feature names (corresponding to sequence names in fasta file OR HMM model names) to be merged into a single new feature cluster. Format: First column is the new feature name, all subsequent columns in row are the names of distinct fasta records or HMM models whose hits will be merged into the new feature name.

Validate cluster mapping file: component feature names should only belong to one cluster, component feature names should be unique, cluster names should be unique.

Warn if any feature names from the input are not assigned to a cluster. If a cluster mapping file is provided any features that are not assigned to a cluster should be ignored.

For each cluster feature, composed of component features, merge overlapping hits for component features under new clustered feature names.

Merged hits should inherit the properties of the highest scoring component match

Warn user if hits from components of one cluster overlap with components belonging to another cluster.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Feature: ensemble search #33

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>

Adamtaranto · 2026-01-27T04:58:57Z

@copilot # TODO: Add info logging message with full commands for external cmd line tools before calling: i.e makeblastdb, blastn, nhmmer, any other external tools.

TODO: Add separate independent '--max-evalue' for use with running nhmmer and blastn, or for filtering pre-calculated results from nhmmer and blastn. i.e Add '--blast-max-evalue' and '--hmm-max-evalue'

TODO: `tirmite search` fails to generate or import blast results if --fasta and --genome provided, "ERROR Failed to load BLAST file. invalid literal for int() with base 10: '100.000' ensemble_search.py:293"

TODO: Review format used for output hits (after cluster merging), format should mimic blast fmt 6 (currently column count appears different.) Add header line to blast output. Ensure that `tirmite search` and `tirmite pair` ignore header lines begining with '#' when loading blast results as input file.

- Add separate --blast-max-evalue and --hmm-max-evalue options - Fix BLAST output format parsing by using standard format 6 - Add info logging with full command strings for BLAST and nhmmer - Update output format to BLAST tabular format 6 for tirmite pair compatibility Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>

Copilot · 2026-01-27T05:03:17Z

@copilot # TODO: Add info logging message with full commands for external cmd line tools before calling: i.e makeblastdb, blastn, nhmmer, any other external tools.

TODO: Add separate independent '--...

Addressed all items in commit 796f78f:

Command logging: Added info logging with full command strings before calling blastn and nhmmer
Separate e-value options: Added --blast-max-evalue and --hmm-max-evalue options (kept --max-evalue for filtering precomputed results)
BLAST parsing fix: Changed to use standard BLAST format 6 (qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore) which import_blast expects
Output format: Updated write_hits_table to output BLAST format 6 with header comments starting with # (which are properly ignored by import_blast)

Adamtaranto · 2026-01-27T05:32:43Z

@copilot - [ ] Add '--keep-temp' option so that results in tmp dir are not cleaned up at end of run.

Add '--logfile' option to all tirmite entry points to also write logging messages to file as well as terminal. Option used with existing function 'init_logging' from 'tirmite.utils.logs'
Support pre-built blast database as alternative input to --genome, this should be used to run blastn and/or to extract sequences from specific contig coords using blastdbcmd. If nhmmer needs to be run to search for '--hmm' queries then '--genome' option is still required.
The length of query sequences or query hmm models is required to calculate hit coverage of query. By default these values should be calculated directly from the sequence records imported from '--fasta' or from a hmm model provided via '--hmm'. However, if blast or nhmmer results are provided directly (and query sequences or hmm models are not seen) then require user to provide a "--lengthsFile" option with path to tab-delimited file with model_name and model_length columns.
Add option to set blastn option "-word_size" with default value "4"
Add genome list option to provide single column list of genome fasta (or gzipped fasta) paths. If multiple genomes provided blastn or nhmmer should be run on each independently if those tools are invoked (i.e. if query sequences are provided via --fasta or --hmm ). Ensure that gzipped --genome fasta files are supported, check when unzipped fasta is required (i.e. makeblastdb, pyfaidx access, nhmmer search etc.)

- Add --keep-temp option to retain temp files after run - Add --logfile option for writing logs to file - Add --blast-db option for pre-built BLAST databases - Add --lengths-file option for query lengths with precomputed results - Add --word-size option for BLAST word size (default: 4) - Add --genome-list option for multiple genome inputs - Support gzipped genome files via prepare_genome_file - Auto-extract query lengths from FASTA and HMM files Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>

Copilot · 2026-01-27T05:38:07Z

@copilot - [ ] Add '--keep-temp' option so that results in tmp dir are not cleaned up at end of run.

Add '--logfile' option to all tirmite entry points to also write logging messages to file as ...

Implemented all requested features in commit 96076d3:

--keep-temp: Retains temp directory after run completion
--logfile: Writes log messages to {prefix}_tirmite_search.log in output directory
--blast-db: Supports pre-built BLAST database as alternative to --genome for BLAST searches
--lengths-file: Tab-delimited file for query lengths when using precomputed results. Lengths are also auto-extracted from --fasta and --hmm files when provided.
--word-size: BLAST word size option with default value "4"
--genome-list: File with one genome path per line. When multiple genomes provided, searches run on each independently. Gzipped genome files are supported via existing prepare_genome_file utility.

Initial plan

0903491

Copilot AI assigned Copilot and Adamtaranto Jan 21, 2026

Copilot started work on behalf of Adamtaranto January 21, 2026 12:58 View session

Copilot AI and others added 2 commits January 21, 2026 13:11

feat: Add tirmite search command for ensemble search functionality

dd1b070

Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>

refactor: Address code review feedback for ensemble search

ae665b8

Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add ensemble search feature for HMM construction~~ feat: Add tirmite search command for ensemble search with hit merging Jan 21, 2026

Copilot AI requested a review from Adamtaranto January 21, 2026 13:18

Copilot finished work on behalf of Adamtaranto January 21, 2026 13:18

Copilot started work on behalf of Adamtaranto January 27, 2026 04:59 View session

Copilot finished work on behalf of Adamtaranto January 27, 2026 05:03

Copilot started work on behalf of Adamtaranto January 27, 2026 05:33 View session

Copilot finished work on behalf of Adamtaranto January 27, 2026 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Add `tirmite search` command for ensemble search with hit merging#37

feat: Add `tirmite search` command for ensemble search with hit merging#37
Copilot wants to merge 5 commits intomainfrom
copilot/add-ensemble-search-feature

Copilot AI commented Jan 21, 2026 •

edited

Loading

Uh oh!

Adamtaranto commented Jan 27, 2026

Uh oh!

Copilot AI commented Jan 27, 2026

TODO: Add separate independent '--...

Uh oh!

Adamtaranto commented Jan 27, 2026

Uh oh!

Copilot AI commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Copilot AI commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer feedback addressed:

Security Summary

Tasks:

Comments on the Issue (you are @copilot in this section)

Uh oh!

Adamtaranto commented Jan 27, 2026

TODO: Add separate independent '--max-evalue' for use with running nhmmer and blastn, or for filtering pre-calculated results from nhmmer and blastn. i.e Add '--blast-max-evalue' and '--hmm-max-evalue'

TODO: tirmite search fails to generate or import blast results if --fasta and --genome provided, "ERROR Failed to load BLAST file. invalid literal for int() with base 10: '100.000' ensemble_search.py:293"

TODO: Review format used for output hits (after cluster merging), format should mimic blast fmt 6 (currently column count appears different.) Add header line to blast output. Ensure that tirmite search and tirmite pair ignore header lines begining with '#' when loading blast results as input file.

Uh oh!

Copilot AI commented Jan 27, 2026

TODO: Add separate independent '--...

Uh oh!

Adamtaranto commented Jan 27, 2026

Uh oh!

Copilot AI commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 21, 2026 •

edited

Loading

TODO: `tirmite search` fails to generate or import blast results if --fasta and --genome provided, "ERROR Failed to load BLAST file. invalid literal for int() with base 10: '100.000' ensemble_search.py:293"