Skip to content

Conversation

@ddomenico
Copy link

Came across a situation when running maf2maf with a MAF that had the mitochondrial chromosome as "M" but the GrCh38 reference as "MT". The samtools chunking in maf2vcf.pl will process randomized chunks up until it hits an entry with this mismatch, at which point it skips the rest of the chunk. This results in non-deterministic outputs to the *skipped file so was a little difficult to debug why certain variants were being skipped at a glance.

When there are no reference/MAF contig mismatches the script works as expected (most cases) but otherwise it presents with inconsistent variant skipping.

Summary

This PR fixes a bug in maf2vcf.pl when run with mismatching contigs:

  1. Non-deterministic behavior: Same input produces different outputs on each run
  2. Silent data loss: Valid variants are incorrectly skipped when invalid chromosomes are present

1: Non-deterministic variant skipping

Root Cause:

my @regions = keys %uniq_regions;  # Returns keys in random order!

Perl's keys %hash returns keys in random order. This caused:

  • Different region order on each run
  • Different samtools processing order
  • Different variants skipped each time
  • Same input → different output

Impact: Users running the same MAF multiple times would get different results, making debugging difficult.

2: Invalid chromosomes cause cascade failures

Root Cause:
When samtools faidx encounters an invalid chromosome (e.g., "M" when reference uses "MT"), it:

  1. Fails with exit code 1
  2. Stops processing all subsequent regions in the same command
  3. Even post sorting, if invalid chromosome sorts before valid ones (e.g., "M" before "X"), all subsequent valid variants are lost

Example:

# If MAF has "M" (invalid) and "X" (valid), and they're in same samtools call:
samtools faidx ref.fa M:3565-3566 X:102937964-102937968
# → Fails on M, never processes X → X variants incorrectly skipped

This PR ensures:

  • Invalid chromosomes don't cause samtools to fail
  • Valid variants are always processed
  • Invalid chromosomes still appear in .skipped.tsv for user awareness
  • Maintains efficiency (still batches 5000 regions per samtools call)

Backward Compatibility

Fully backward compatible:

  • Same command-line interface
  • Same output format
  • Same .skipped.tsv file format
  • Invalid chromosomes still reported to user (in skipped file)
  • Only difference: deterministic behavior and correct handling of edge cases

Related Issues

I believe this fix addresses the root cause of issue #234.

Checklist

  • Code tested with real MAF files
  • Deterministic behavior verified (3+ runs produce identical results)
  • Invalid chromosomes still reported in skipped file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant