Skip to content

heart-gen/rfmix_reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RFMix-reader

RFMix-reader is a Python package for efficiently reading and processing output files generated by RFMix, a widely used tool for estimating local ancestry in admixed populations.
It employs a lazy loading approach to minimize memory usage, and leverages GPU acceleration for major speedups when available.


Installation

rfmix-reader requires Python 3.11+. Install from PyPI:

pip install rfmix-reader

Installation Options

  • Basic install (CPU only):

    pip install rfmix-reader
  • With GPU acceleration (cupy, cudf, dask-cudf, torch):

    pip install rfmix-reader[gpu]

    The default GPU extra targets CUDA 12 wheels (cupy-cuda12x, cudf-cu12, dask-cudf-cu12). Use the appropriate CUDA build strings if your environment requires a different version.

  • With documentation tools (sphinx, sphinx-rtd-theme):

    pip install rfmix-reader[docs]
  • With testing tools (pytest):

    pip install rfmix-reader[tests]

GPU Notes

  • PyTorch is not installed in the base package. It is pulled in only when you install the [gpu] extra (pip install rfmix-reader[gpu]) or add it yourself.
  • Install a PyTorch wheel that matches your platform and CUDA version using the official selector. For example:
    • CUDA 12 (Linux/Windows): pip install torch --index-url https://download.pytorch.org/whl/cu121
    • CUDA 11 (Linux/Windows): pip install torch --index-url https://download.pytorch.org/whl/cu118
    • CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu
  • RAPIDS (cudf, cupy) wheels are version- and CUDA-specific. See the RAPIDS install guide.
  • CPU-only installations will still run efficiently, just without GPU acceleration.

Quickstart

from rfmix_reader import read_rfmix

# Load RFMix outputs (two-population admixture example)
file_path = "examples/two_populations/out/"
loci_df, g_anc, local_array = read_rfmix(file_path)

print(loci_df.head())
print(g_anc.head())
print(local_array.shape)
"""See the phasing section below for how to phase per-chromosome outputs and
write them to Zarr with `phase_rfmix_chromosome_to_zarr`."""

Key Features

  • Lazy Loading: Reads data on-the-fly, reducing memory footprint.
  • Efficient Access: Query specific loci or regions of interest.
  • Seamless Integration: Works smoothly with pandas, dask, and other analysis tools.
  • Loci Imputation: Impute local ancestry loci to dense genotype variant sites.
  • GPU Acceleration: Automatic CUDA acceleration via PyTorch/CuPy when available.

Simulation Data

Test datasets for two- and three-population admixture are available on Synapse: Synapse Project syn61691659.


Usage

Binary Conversion

RFMix does not generate binary files directly. Use create_binaries to generate them (also available as a CLI):

create-binaries two_pops/out/
from rfmix_reader import create_binaries

create_binaries("two_pops/out/", binary_dir="./binary_files")

Preparing Reference Data for Phasing

Use prepare-reference to convert bgzipped, indexed reference VCF/BCF files into per-chromosome VCF-Zarr stores that the phasing pipeline consumes. The command writes one <chrom>.zarr directory per input file.

prepare-reference -h
usage: prepare-reference [-h] [--chunk-length CHUNK_LENGTH]
                         [--samples-chunk-size SAMPLES_CHUNK_SIZE]
                         [--worker-processes WORKER_PROCESSES]
                         [--verbose | --no-verbose] [--version]
                         output_dir vcf_paths [vcf_paths ...]

Convert one or more bgzipped reference VCF/BCF files into Zarr stores.

positional arguments:
  output_dir            Directory where the Zarr outputs will be written.
  vcf_paths             Paths to reference VCF/BCF files (bgzipped and
                        indexed).

options:
  -h, --help            show this help message and exit
  --chunk-length CHUNK_LENGTH
                        Genomic chunk size for the output Zarr stores
                        (default: 100000).
  --samples-chunk-size SAMPLES_CHUNK_SIZE
                        Chunk size for samples in the output Zarr stores
                        (default: library chosen).
  --worker-processes WORKER_PROCESSES
                        Number of worker processes to use for conversion
                        (default: 0, use library default).
  --verbose, --no-verbose
                        Print progress messages (default: enabled).
  --version             Show the version of the program and exit.

Example data preparation:

# Sample annotations: two columns (no header): sample_id<TAB>group
cat > sample_annotations.tsv <<'EOF'
NA19700	AFR
NA19701	AFR
NA20847	EUR
EOF

# Convert chromosome VCFs into a reference store directory
prepare-reference refs/ 1kg_chr20.vcf.gz 1kg_chr21.vcf.gz \
  --chunk-length 50000 --samples-chunk-size 512

Main Function

Once binaries are available, process RFMix results:

from rfmix_reader import read_rfmix

loci, g_anc, admix = read_rfmix("two_pops/out/")

Three Population Example

Binaries can also be generated on-the-fly within read_rfmix with generate_binary set to True.

loci, g_anc, admix = read_rfmix("examples/three_populations/out/",
                                binary_dir="./binary_files",
                                generate_binary=True)

Phasing data

For optimal memory and computational speed, phasing is done per chromosome.

from rfmix_reader import phase_rfmix_chromosome_to_zarr

# Use the reference store + annotations during phasing
admix = phase_rfmix_chromosome_to_zarr(
    file_prefix="two_pops/out/",
    ref_zarr_root="refs",
    sample_annot_path="sample_annotations.tsv",
    output_path="./phased_chr21.zarr",
    chrom="21",
)

The chunking is suboptimal for phasing, so remember to rechunk before using for optimal processing. This is only needed when loading individual chromosomes. The merged data is already optimized.

local_array = admix["local_ancestry"].chunk({"variant": 20000, "sample": 100})
# Compute into memory, if needed
local_array = local_array.compute()

This also saves the data to Zarr for later merging or data processing.

merge-phased-zarrs ./phased_all.zarr ./phased_chr21.zarr ./phased_chr22.zarr

Loci Imputation

The imputation workflow now lives in rfmix_reader.processing.imputation and is exported as interpolate_array. It interpolates the local ancestry matrix onto a denser variant grid and writes the result to <zarr_outdir>/local-ancestry.zarr as a Zarr array shaped (variants, samples, ancestries).

Inputs

  • variant_loci_df: a pandas DataFrame defining the variant grid. Provide at least chrom/pos and an i column that points to the source RFMix row index; rows with i set to NaN are treated as missing loci to interpolate. Sort the frame by genomic coordinate, and include pos if you plan to interpolate in base-pair space.
  • admix: the local ancestry Dask array returned by read_rfmix (shape (loci, samples, ancestries)).
  • zarr_outdir: an output directory where the new local-ancestry.zarr store will be created.

Key options

  • interpolation: "linear" (default), "nearest", or "stepwise".
  • use_bp_positions: set to True to interpolate along variant_loci_df['pos'] rather than treating loci as equally spaced indices.
  • chunk_size/batch_size: tune how many rows are materialized at a time when filling and interpolating the Zarr array.

Workflow example

import pandas as pd
from pathlib import Path
from rfmix_reader import interpolate_array, read_rfmix

# Load RFMix loci and local ancestry
loci_df, _, admix = read_rfmix("two_pops/out/", binary_dir="./binary_files")

# Build the variant grid by merging genotype sites with the RFMix loci index
variants = pd.read_parquet("genotypes/variants.parquet")  # must include chrom/pos
variants = variants.drop_duplicates(subset=["chrom", "pos"]).sort_values("pos")
variant_loci_df = (
    variants.merge(loci_df.to_pandas(), on=["chrom", "pos"], how="outer", indicator=True)
            .loc[:, ["chrom", "pos", "i", "_merge"]]
)

z = interpolate_array(
    variant_loci_df,
    admix,
    zarr_outdir=Path("./imputed_local_ancestry"),
    interpolation="linear",
    use_bp_positions=True,
    chunk_size=50_000,
)
print(z)

The interpolator uses GPU acceleration transparently when cupy and a CUDA PyTorch build are available; otherwise it falls back to NumPy. All methods other than "hmm" operate on diploid-summed trajectories and preserve the original ancestry dimension.

Phasing workflow (rfmix_reader.processing.phase)

rfmix_reader.processing.phase implements gnomix-style tail-flip corrections for local ancestry haplotypes. Phasing now lives outside read_rfmix so you can process each chromosome independently and write outputs directly to Zarr.

Required reference inputs

  • VCF-Zarr reference (ref_zarr_root): either a single *.zarr store or a directory containing per-chromosome stores (e.g., 1.zarr, chr1.zarr).
  • Sample annotations (sample_annot_path): two-column file mapping sample_id to group (ancestry label). One representative sample per group is pulled to build reference haplotypes.

Per-chromosome pipeline

  1. (Optional) generate RFMix binary caches with create_binaries.
  2. Call phase_rfmix_chromosome_to_zarr for each chromosome you want to process.
  3. Optionally concatenate those per-chromosome Zarr stores with merge_phased_zarrs.
from rfmix_reader.processing.phase import (
    PhasingConfig,
    merge_phased_zarrs,
    phase_rfmix_chromosome_to_zarr,
)

# Phase chromosome 21 and write to Zarr
dataset = phase_rfmix_chromosome_to_zarr(
    file_prefix="examples/two_populations/out/",
    ref_zarr_root="/refs/1kg_chr_zarr/",
    sample_annot_path="/refs/1kg_annotations.tsv",
    output_path="/tmp/phased_chr21.zarr",
    chrom="21",
)

# Merge multiple per-chromosome Zarr stores
merged = merge_phased_zarrs(
    ["/tmp/phased_chr21.zarr", "/tmp/phased_chr22.zarr"],
    output_path="/tmp/phased_all.zarr",
)

If you need fine-grained control, you can still start from unphased outputs and call phase_admix_dask_with_index directly:

from rfmix_reader import read_rfmix
from rfmix_reader.processing.phase import PhasingConfig, phase_admix_dask_with_index

loci_df, g_anc, admix, X_raw = read_rfmix(
    "examples/two_populations/out/",
    return_original=True,
    chrom="21",
)

config = PhasingConfig(window_size=100, min_block_len=10, max_mismatch_frac=0.3)
phased = phase_admix_dask_with_index(
    admix=admix,
    X_raw=X_raw,
    positions=loci_df.physical_position.to_numpy(),
    chrom=str(loci_df.chromosome.iloc[0]),
    ref_zarr_root="/refs/1kg_chr_zarr/",
    sample_annot_path="/refs/1kg_annotations.tsv",
    config=config,
)

Reading Haptools simulations

Use read_simu to load BGZF-compressed VCF files created by haptools simgenotype --pop_field:

from rfmix_reader import read_simu

loci_df, g_anc, admix = read_simu("/path/to/simulations/")

Haptools does not include the chromosome length in the ##contig header lines, but read_simu requires that metadata to index each VCF. Copy the contigs.txt file Haptools generates from the FASTA you used for simulation and reheader every file with the appropriate contig entry before calling read_simu. The following snippet shows one approach using bcftools and tabix:

CONTIGS="../../three_populations/_m/contigs.txt"
VCFDIR="gt-files"
CHR="chr${SLURM_ARRAY_TASK_ID}"
OUT="${VCFDIR}/${CHR}.vcf.gz"
IN="${VCFDIR}/back/${CHR}.vcf.gz"

CONTIG_LINE=$(grep -w "ID=${CHR}" "$CONTIGS")
if [[ -z "$CONTIG_LINE" ]]; then
    echo "ERROR: No contig line found for ${CHR} in $CONTIGS"
    exit 1
fi

bcftools view -h "$IN" \
    | sed "s/^##contig=<ID=${CHR}>.*/${CONTIG_LINE}/" > header.${CHR}.tmp
bcftools reheader -h header.${CHR}.tmp -o "$OUT" "$IN"
tabix -p vcf "$OUT"

Visualization

read_rfmix, read_flare, and read_simu all return the same (loci_df, g_anc, admix) tuple, so the plotting utilities in rfmix_reader._visualization work identically for RFMix, FLARE, and Haptools-simulated inputs. The snippet below shows the typical workflow for each reader:

from rfmix_reader import (
    plot_ancestry_by_chromosome,
    plot_global_ancestry,
    read_flare,
    read_rfmix,
    read_simu,
)

# RFMix run directory
loci_df, g_anc, admix = read_rfmix("two_pops/out/")
plot_global_ancestry(g_anc, save_path="rfmix_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="rfmix_local.png")

# FLARE output directory (contains *.anc.vcf.gz + global.anc.gz)
loci_df, g_anc, admix = read_flare("flare_runs/chr1/")
plot_global_ancestry(g_anc, save_path="flare_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="flare_local.png")

# Haptools simulations (after reheadering contigs)
loci_df, g_anc, admix = read_simu("/path/to/simulations/")
plot_global_ancestry(g_anc, save_path="simu_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="simu_local.png")

plot_global_ancestry builds per-individual stacked bars of global ancestry while plot_ancestry_by_chromosome summarizes local ancestry along each chromosome, giving you quick visual QC for every supported input format.


Development Install

For contributors:

git clone https://github.com/heart-gen/rfmix_reader.git
cd rfmix_reader
pip install -e ".[gpu,docs,tests]"

Citation

If you use this software, please cite:

DOI

Benjamin, K. J. M. (2024). RFMix-reader (Version 0.2.0) [Computer software]. https://github.com/heart-gen/rfmix_reader

Kynon JM Benjamin. "RFMix-reader: Accelerated reading and processing for local ancestry studies." bioRxiv (2024). DOI: 10.1101/2024.07.13.603370.


Funding

This work was supported by the National Institutes of Health, National Institute on Minority Health and Health Disparities (NIMHD) K99MD016964 / R00MD016964.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages