RFMix-reader is a Python package for efficiently reading and processing output
files generated by RFMix, a widely used tool
for estimating local ancestry in admixed populations.
It employs a lazy loading approach to minimize memory usage, and leverages GPU acceleration
for major speedups when available.
rfmix-reader requires Python 3.11+. Install from PyPI:
pip install rfmix-reader-
Basic install (CPU only):
pip install rfmix-reader
-
With GPU acceleration (
cupy,cudf,dask-cudf,torch):pip install rfmix-reader[gpu]
The default GPU extra targets CUDA 12 wheels (
cupy-cuda12x,cudf-cu12,dask-cudf-cu12). Use the appropriate CUDA build strings if your environment requires a different version. -
With documentation tools (
sphinx,sphinx-rtd-theme):pip install rfmix-reader[docs]
-
With testing tools (
pytest):pip install rfmix-reader[tests]
- PyTorch is not installed in the base package. It is pulled in only when you install the
[gpu]extra (pip install rfmix-reader[gpu]) or add it yourself. - Install a PyTorch wheel that matches your platform and CUDA version using the official selector. For example:
- CUDA 12 (Linux/Windows):
pip install torch --index-url https://download.pytorch.org/whl/cu121 - CUDA 11 (Linux/Windows):
pip install torch --index-url https://download.pytorch.org/whl/cu118 - CPU-only:
pip install torch --index-url https://download.pytorch.org/whl/cpu
- CUDA 12 (Linux/Windows):
- RAPIDS (
cudf,cupy) wheels are version- and CUDA-specific. See the RAPIDS install guide. - CPU-only installations will still run efficiently, just without GPU acceleration.
from rfmix_reader import read_rfmix
# Load RFMix outputs (two-population admixture example)
file_path = "examples/two_populations/out/"
loci_df, g_anc, local_array = read_rfmix(file_path)
print(loci_df.head())
print(g_anc.head())
print(local_array.shape)
"""See the phasing section below for how to phase per-chromosome outputs and
write them to Zarr with `phase_rfmix_chromosome_to_zarr`."""- Lazy Loading: Reads data on-the-fly, reducing memory footprint.
- Efficient Access: Query specific loci or regions of interest.
- Seamless Integration: Works smoothly with
pandas,dask, and other analysis tools. - Loci Imputation: Impute local ancestry loci to dense genotype variant sites.
- GPU Acceleration: Automatic CUDA acceleration via PyTorch/CuPy when available.
Test datasets for two- and three-population admixture are available on Synapse: Synapse Project syn61691659.
RFMix does not generate binary files directly.
Use create_binaries to generate them (also available as a CLI):
create-binaries two_pops/out/from rfmix_reader import create_binaries
create_binaries("two_pops/out/", binary_dir="./binary_files")Use prepare-reference to convert bgzipped, indexed reference VCF/BCF files
into per-chromosome VCF-Zarr stores that the phasing pipeline consumes.
The command writes one <chrom>.zarr directory per input file.
prepare-reference -h
usage: prepare-reference [-h] [--chunk-length CHUNK_LENGTH]
[--samples-chunk-size SAMPLES_CHUNK_SIZE]
[--worker-processes WORKER_PROCESSES]
[--verbose | --no-verbose] [--version]
output_dir vcf_paths [vcf_paths ...]
Convert one or more bgzipped reference VCF/BCF files into Zarr stores.
positional arguments:
output_dir Directory where the Zarr outputs will be written.
vcf_paths Paths to reference VCF/BCF files (bgzipped and
indexed).
options:
-h, --help show this help message and exit
--chunk-length CHUNK_LENGTH
Genomic chunk size for the output Zarr stores
(default: 100000).
--samples-chunk-size SAMPLES_CHUNK_SIZE
Chunk size for samples in the output Zarr stores
(default: library chosen).
--worker-processes WORKER_PROCESSES
Number of worker processes to use for conversion
(default: 0, use library default).
--verbose, --no-verbose
Print progress messages (default: enabled).
--version Show the version of the program and exit.
Example data preparation:
# Sample annotations: two columns (no header): sample_id<TAB>group
cat > sample_annotations.tsv <<'EOF'
NA19700 AFR
NA19701 AFR
NA20847 EUR
EOF
# Convert chromosome VCFs into a reference store directory
prepare-reference refs/ 1kg_chr20.vcf.gz 1kg_chr21.vcf.gz \
--chunk-length 50000 --samples-chunk-size 512Once binaries are available, process RFMix results:
from rfmix_reader import read_rfmix
loci, g_anc, admix = read_rfmix("two_pops/out/")Binaries can also be generated on-the-fly within read_rfmix with
generate_binary set to True.
loci, g_anc, admix = read_rfmix("examples/three_populations/out/",
binary_dir="./binary_files",
generate_binary=True)For optimal memory and computational speed, phasing is done per chromosome.
from rfmix_reader import phase_rfmix_chromosome_to_zarr
# Use the reference store + annotations during phasing
admix = phase_rfmix_chromosome_to_zarr(
file_prefix="two_pops/out/",
ref_zarr_root="refs",
sample_annot_path="sample_annotations.tsv",
output_path="./phased_chr21.zarr",
chrom="21",
)The chunking is suboptimal for phasing, so remember to rechunk before using for optimal processing. This is only needed when loading individual chromosomes. The merged data is already optimized.
local_array = admix["local_ancestry"].chunk({"variant": 20000, "sample": 100})
# Compute into memory, if needed
local_array = local_array.compute()This also saves the data to Zarr for later merging or data processing.
merge-phased-zarrs ./phased_all.zarr ./phased_chr21.zarr ./phased_chr22.zarrThe imputation workflow now lives in rfmix_reader.processing.imputation and
is exported as interpolate_array. It interpolates the local ancestry matrix
onto a denser variant grid and writes the result to <zarr_outdir>/local-ancestry.zarr
as a Zarr array shaped (variants, samples, ancestries).
Inputs
variant_loci_df: a pandas DataFrame defining the variant grid. Provide at leastchrom/posand anicolumn that points to the source RFMix row index; rows withiset toNaNare treated as missing loci to interpolate. Sort the frame by genomic coordinate, and includeposif you plan to interpolate in base-pair space.admix: the local ancestry Dask array returned byread_rfmix(shape(loci, samples, ancestries)).zarr_outdir: an output directory where the newlocal-ancestry.zarrstore will be created.
Key options
interpolation:"linear"(default),"nearest", or"stepwise".use_bp_positions: set toTrueto interpolate alongvariant_loci_df['pos']rather than treating loci as equally spaced indices.chunk_size/batch_size: tune how many rows are materialized at a time when filling and interpolating the Zarr array.
Workflow example
import pandas as pd
from pathlib import Path
from rfmix_reader import interpolate_array, read_rfmix
# Load RFMix loci and local ancestry
loci_df, _, admix = read_rfmix("two_pops/out/", binary_dir="./binary_files")
# Build the variant grid by merging genotype sites with the RFMix loci index
variants = pd.read_parquet("genotypes/variants.parquet") # must include chrom/pos
variants = variants.drop_duplicates(subset=["chrom", "pos"]).sort_values("pos")
variant_loci_df = (
variants.merge(loci_df.to_pandas(), on=["chrom", "pos"], how="outer", indicator=True)
.loc[:, ["chrom", "pos", "i", "_merge"]]
)
z = interpolate_array(
variant_loci_df,
admix,
zarr_outdir=Path("./imputed_local_ancestry"),
interpolation="linear",
use_bp_positions=True,
chunk_size=50_000,
)
print(z)The interpolator uses GPU acceleration transparently when cupy and a CUDA
PyTorch build are available; otherwise it falls back to NumPy. All methods other
than "hmm" operate on diploid-summed trajectories and preserve the original
ancestry dimension.
rfmix_reader.processing.phase implements gnomix-style tail-flip corrections
for local ancestry haplotypes. Phasing now lives outside read_rfmix so you can
process each chromosome independently and write outputs directly to Zarr.
- VCF-Zarr reference (
ref_zarr_root): either a single*.zarrstore or a directory containing per-chromosome stores (e.g.,1.zarr,chr1.zarr). - Sample annotations (
sample_annot_path): two-column file mappingsample_idtogroup(ancestry label). One representative sample per group is pulled to build reference haplotypes.
- (Optional) generate RFMix binary caches with
create_binaries. - Call
phase_rfmix_chromosome_to_zarrfor each chromosome you want to process. - Optionally concatenate those per-chromosome Zarr stores with
merge_phased_zarrs.
from rfmix_reader.processing.phase import (
PhasingConfig,
merge_phased_zarrs,
phase_rfmix_chromosome_to_zarr,
)
# Phase chromosome 21 and write to Zarr
dataset = phase_rfmix_chromosome_to_zarr(
file_prefix="examples/two_populations/out/",
ref_zarr_root="/refs/1kg_chr_zarr/",
sample_annot_path="/refs/1kg_annotations.tsv",
output_path="/tmp/phased_chr21.zarr",
chrom="21",
)
# Merge multiple per-chromosome Zarr stores
merged = merge_phased_zarrs(
["/tmp/phased_chr21.zarr", "/tmp/phased_chr22.zarr"],
output_path="/tmp/phased_all.zarr",
)If you need fine-grained control, you can still start from unphased outputs and
call phase_admix_dask_with_index directly:
from rfmix_reader import read_rfmix
from rfmix_reader.processing.phase import PhasingConfig, phase_admix_dask_with_index
loci_df, g_anc, admix, X_raw = read_rfmix(
"examples/two_populations/out/",
return_original=True,
chrom="21",
)
config = PhasingConfig(window_size=100, min_block_len=10, max_mismatch_frac=0.3)
phased = phase_admix_dask_with_index(
admix=admix,
X_raw=X_raw,
positions=loci_df.physical_position.to_numpy(),
chrom=str(loci_df.chromosome.iloc[0]),
ref_zarr_root="/refs/1kg_chr_zarr/",
sample_annot_path="/refs/1kg_annotations.tsv",
config=config,
)Use read_simu to load BGZF-compressed VCF files created by
haptools simgenotype --pop_field:
from rfmix_reader import read_simu
loci_df, g_anc, admix = read_simu("/path/to/simulations/")Haptools does not include the chromosome length in the ##contig
header lines, but read_simu requires that metadata to index each VCF.
Copy the contigs.txt file Haptools generates from the FASTA you used
for simulation and reheader every file with the appropriate contig entry
before calling read_simu. The following snippet shows one approach
using bcftools and tabix:
CONTIGS="../../three_populations/_m/contigs.txt"
VCFDIR="gt-files"
CHR="chr${SLURM_ARRAY_TASK_ID}"
OUT="${VCFDIR}/${CHR}.vcf.gz"
IN="${VCFDIR}/back/${CHR}.vcf.gz"
CONTIG_LINE=$(grep -w "ID=${CHR}" "$CONTIGS")
if [[ -z "$CONTIG_LINE" ]]; then
echo "ERROR: No contig line found for ${CHR} in $CONTIGS"
exit 1
fi
bcftools view -h "$IN" \
| sed "s/^##contig=<ID=${CHR}>.*/${CONTIG_LINE}/" > header.${CHR}.tmp
bcftools reheader -h header.${CHR}.tmp -o "$OUT" "$IN"
tabix -p vcf "$OUT"read_rfmix, read_flare, and read_simu all return the same
(loci_df, g_anc, admix) tuple, so the plotting utilities in
rfmix_reader._visualization work identically for RFMix, FLARE, and
Haptools-simulated inputs. The snippet below shows the typical workflow
for each reader:
from rfmix_reader import (
plot_ancestry_by_chromosome,
plot_global_ancestry,
read_flare,
read_rfmix,
read_simu,
)
# RFMix run directory
loci_df, g_anc, admix = read_rfmix("two_pops/out/")
plot_global_ancestry(g_anc, save_path="rfmix_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="rfmix_local.png")
# FLARE output directory (contains *.anc.vcf.gz + global.anc.gz)
loci_df, g_anc, admix = read_flare("flare_runs/chr1/")
plot_global_ancestry(g_anc, save_path="flare_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="flare_local.png")
# Haptools simulations (after reheadering contigs)
loci_df, g_anc, admix = read_simu("/path/to/simulations/")
plot_global_ancestry(g_anc, save_path="simu_global.png")
plot_ancestry_by_chromosome(loci_df, admix, save_path="simu_local.png")plot_global_ancestry builds per-individual stacked bars of global
ancestry while plot_ancestry_by_chromosome summarizes local ancestry
along each chromosome, giving you quick visual QC for every supported
input format.
For contributors:
git clone https://github.com/heart-gen/rfmix_reader.git
cd rfmix_reader
pip install -e ".[gpu,docs,tests]"If you use this software, please cite:
Benjamin, K. J. M. (2024). RFMix-reader (Version 0.2.0) [Computer software]. https://github.com/heart-gen/rfmix_reader
Kynon JM Benjamin. "RFMix-reader: Accelerated reading and processing for local ancestry studies." bioRxiv (2024). DOI: 10.1101/2024.07.13.603370.
This work was supported by the National Institutes of Health, National Institute on Minority Health and Health Disparities (NIMHD) K99MD016964 / R00MD016964.