Skip to content

Comments

feat: Add tirmite search command for ensemble search with hit merging#37

Draft
Copilot wants to merge 5 commits intomainfrom
copilot/add-ensemble-search-feature
Draft

feat: Add tirmite search command for ensemble search with hit merging#37
Copilot wants to merge 5 commits intomainfrom
copilot/add-ensemble-search-feature

Conversation

Copy link
Contributor

Copilot AI commented Jan 21, 2026

  • Implement tirmite search CLI entrypoint for ensemble search functionality
  • Create new module src/tirmite/cli/ensemble_search.py with:
    • CLI argument parser for fasta files, HMM models, and precomputed result files
    • Support for cluster mapping file input (mapping component features to cluster names)
    • Quality filters (evalue, query coverage, identity)
    • Optional pairing map file for nested hit removal
  • Implement hit loading and parsing:
    • Load hits from BLAST and/or nhmmer output files
    • Apply quality filters with user-specified thresholds
    • Report query feature names and hit counts before/after filtering
  • Implement cluster mapping functionality:
    • Parse and validate cluster mapping file
    • Validate uniqueness of component names and cluster names
    • Warn about unassigned features
  • Implement ensemble search logic:
    • Merge overlapping hits from same-cluster components on same strand
    • Inherit properties from highest-scoring component hit
    • Warn about cross-cluster hit overlaps
  • Implement optional nested hit removal for asymmetric termini:
    • Parse pairing map linking left/right cluster names
    • Remove weak nested hits between paired terminals
  • Output merged hits in BLAST-compatible tabular format
  • Add comprehensive unit tests for new functionality (22 tests)
  • Integrate with existing CLI in cli.py
  • Run code review and security checks

Reviewer feedback addressed:

  • Add info logging with full commands for external tools (blastn, nhmmer)
  • Add separate --blast-max-evalue and --hmm-max-evalue options
  • Fix BLAST format parsing issue (was using wrong output format)
  • Output format now matches BLAST fmt 6 with header comments
  • Add --keep-temp option to retain temp directory after run
  • Add --logfile option for logging to file
  • Support pre-built BLAST database via --blast-db option
  • Add --lengths-file option for query lengths when using precomputed results
  • Add --word-size option for BLAST word size (default: 4)
  • Add --genome-list option for multiple genome inputs
  • Support gzipped genome files

Security Summary

No security vulnerabilities were detected by CodeQL analysis.

Original prompt

This section details on the original issue you should resolve

<issue_title>Feature: ensemble search</issue_title>
<issue_description>## Feature description

When constructing a HMM with hmmbuild we use an aligned set of blast hits obtained from an initial seed sequence. If the sequence diversity in this alignment is too high then hits between the HMM model and any of the individual component sequences that was used to make the model may become weak or insignificant.

We can solve this issue by clustering initial seed hits at a high identity level (i.e. 90%) and then building multiple separate HMMs from each cluster. Taken together, hits from the collection of component models should detect a greater diversity of matches.

Our challenge then is to merge the overlapping hits from the separate component models into one set of hits that can be paired with tirmite pair

We will create a new entrypoint called tirmite search that will take multiple fasta records and/or HMM models (representing the termini of one or more transposable element families) and run the appropriate search tool (blast or nhmmer) against a user provided genome (or pre-built blast database).

If provided with a cluster mapping file (linking query features into clustered groups), then tirmite search should apply an ensemble search strategy, whereby overlapping hits, on the same strand, to features that belong to the same cluster are merged and given a new cluster name. Hit metadata will be inherited from the highest scoring hit in the cluster.

Output of the tirmite search tool should be a blast formatted table of merged hits that can be passed to tirmite pair.

Consider existing functions from tirmite.cli.hmm_build module that could be reused or modified.

Elements with asymmetrical termini, that is a distinct terminus model on the left and right ends, may share some limited homology between the left and right termini. We want to remove weak or short hits from one terminal model when those hits are nested within a longer higher-scoring hit to the corresponding member of a pair. After clustering overlapping hits from equivalent sub-models and assigning the merged hit a new cluster name, we should perform a check for nested weak hits between left and right pair members and remove those hits. This will require the user to provide a --pairing_map file that links left and right feature names (post clustering names).

Tasks:

  • Take as input one or more fasta files and/or HMM models and run blastn and/or nhmmer.
  • Alternatively, take as input multiple pre-computed blastn and/or nhmmer output files.
  • Load hit records and apply user specified quality filters (evalue, query coverage, identity, etc)
  • Report all query feature names present in the blastn and/or nhmmer files (or query names from input fasta and HMM files). Report number of hits for each query before and after filtering.
  • Optional input: User provided mapping of feature names (corresponding to sequence names in fasta file OR HMM model names) to be merged into a single new feature cluster. Format: First column is the new feature name, all subsequent columns in row are the names of distinct fasta records or HMM models whose hits will be merged into the new feature name.
  • Validate cluster mapping file: component feature names should only belong to one cluster, component feature names should be unique, cluster names should be unique.
  • Warn if any feature names from the input are not assigned to a cluster. If a cluster mapping file is provided any features that are not assigned to a cluster should be ignored.
  • For each cluster feature, composed of component features, merge overlapping hits for component features under new clustered feature names.
  • Merged hits should inherit the properties of the highest scoring component match
  • Warn user if hits from components of one cluster overlap with components belonging to another cluster.</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits January 21, 2026 13:11
Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>
Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>
Copilot AI changed the title [WIP] Add ensemble search feature for HMM construction feat: Add tirmite search command for ensemble search with hit merging Jan 21, 2026
Copilot AI requested a review from Adamtaranto January 21, 2026 13:18
@Adamtaranto
Copy link
Owner

@copilot # TODO: Add info logging message with full commands for external cmd line tools before calling: i.e makeblastdb, blastn, nhmmer, any other external tools.

TODO: Add separate independent '--max-evalue' for use with running nhmmer and blastn, or for filtering pre-calculated results from nhmmer and blastn. i.e Add '--blast-max-evalue' and '--hmm-max-evalue'

TODO: tirmite search fails to generate or import blast results if --fasta and --genome provided, "ERROR Failed to load BLAST file. invalid literal for int() with base 10: '100.000' ensemble_search.py:293"

TODO: Review format used for output hits (after cluster merging), format should mimic blast fmt 6 (currently column count appears different.) Add header line to blast output. Ensure that tirmite search and tirmite pair ignore header lines begining with '#' when loading blast results as input file.

- Add separate --blast-max-evalue and --hmm-max-evalue options
- Fix BLAST output format parsing by using standard format 6
- Add info logging with full command strings for BLAST and nhmmer
- Update output format to BLAST tabular format 6 for tirmite pair compatibility

Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 27, 2026

@copilot # TODO: Add info logging message with full commands for external cmd line tools before calling: i.e makeblastdb, blastn, nhmmer, any other external tools.

TODO: Add separate independent '--...

Addressed all items in commit 796f78f:

  1. Command logging: Added info logging with full command strings before calling blastn and nhmmer
  2. Separate e-value options: Added --blast-max-evalue and --hmm-max-evalue options (kept --max-evalue for filtering precomputed results)
  3. BLAST parsing fix: Changed to use standard BLAST format 6 (qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore) which import_blast expects
  4. Output format: Updated write_hits_table to output BLAST format 6 with header comments starting with # (which are properly ignored by import_blast)

@Adamtaranto
Copy link
Owner

@copilot - [ ] Add '--keep-temp' option so that results in tmp dir are not cleaned up at end of run.

  • Add '--logfile' option to all tirmite entry points to also write logging messages to file as well as terminal. Option used with existing function 'init_logging' from 'tirmite.utils.logs'
  • Support pre-built blast database as alternative input to --genome, this should be used to run blastn and/or to extract sequences from specific contig coords using blastdbcmd. If nhmmer needs to be run to search for '--hmm' queries then '--genome' option is still required.
  • The length of query sequences or query hmm models is required to calculate hit coverage of query. By default these values should be calculated directly from the sequence records imported from '--fasta' or from a hmm model provided via '--hmm'. However, if blast or nhmmer results are provided directly (and query sequences or hmm models are not seen) then require user to provide a "--lengthsFile" option with path to tab-delimited file with model_name and model_length columns.
  • Add option to set blastn option "-word_size" with default value "4"
  • Add genome list option to provide single column list of genome fasta (or gzipped fasta) paths. If multiple genomes provided blastn or nhmmer should be run on each independently if those tools are invoked (i.e. if query sequences are provided via --fasta or --hmm ). Ensure that gzipped --genome fasta files are supported, check when unzipped fasta is required (i.e. makeblastdb, pyfaidx access, nhmmer search etc.)

- Add --keep-temp option to retain temp files after run
- Add --logfile option for writing logs to file
- Add --blast-db option for pre-built BLAST databases
- Add --lengths-file option for query lengths with precomputed results
- Add --word-size option for BLAST word size (default: 4)
- Add --genome-list option for multiple genome inputs
- Support gzipped genome files via prepare_genome_file
- Auto-extract query lengths from FASTA and HMM files

Co-authored-by: Adamtaranto <2160099+Adamtaranto@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 27, 2026

@copilot - [ ] Add '--keep-temp' option so that results in tmp dir are not cleaned up at end of run.

  • Add '--logfile' option to all tirmite entry points to also write logging messages to file as ...

Implemented all requested features in commit 96076d3:

  1. --keep-temp: Retains temp directory after run completion
  2. --logfile: Writes log messages to {prefix}_tirmite_search.log in output directory
  3. --blast-db: Supports pre-built BLAST database as alternative to --genome for BLAST searches
  4. --lengths-file: Tab-delimited file for query lengths when using precomputed results. Lengths are also auto-extracted from --fasta and --hmm files when provided.
  5. --word-size: BLAST word size option with default value "4"
  6. --genome-list: File with one genome path per line. When multiple genomes provided, searches run on each independently. Gzipped genome files are supported via existing prepare_genome_file utility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: ensemble search

2 participants