Skip to content

TReconLM: Trace reconstruction using next-token prediction for DNA data storage.

Notifications You must be signed in to change notification settings

MLI-lab/TReconLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TReconLM

TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
(For more details, see our paper.)


Installation

Tested on Ubuntu 22.04.4 LTS.

Create the conda environment:

conda env create -f treconlm.yml

Install build-essential (required for compiling extensions):

sudo apt update && sudo apt install -y build-essential

Set the Python path:

export PYTHONPATH="${PYTHONPATH}:/path/to/treconlm"

Inference

Pretrained and fine-tuned models, as well as synthetic test datasets, are available on Hugging Face:

Getting Started

Start with the tutorial notebooks in tutorial/:

  • quick_start.ipynb: Download models from HuggingFace and run inference on synthetic datasets
  • custom_data.ipynb: Run inference on your own data or use the Microsoft/Noisy DNA datasets

Command-Line Inference

python src/inference.py exps=<experiment>

Quick test (runs inference on tutorial/example_data with a pretrained model):

# Download a model from HuggingFace
mkdir -p models
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('mli-lab/TReconLM', 'model_seq_len_110.pt', local_dir='models')"

# Run inference
python src/inference.py exps=test/inference_example

Data Format

For custom data, provide two files:

  • ground_truth.txt: one DNA sequence per line (ACGT only)
  • reads.txt:clusters of 2-10 noisy reads separated by ===============================

See tutorial/custom_data.ipynb for details.


Training

FlashAttention (optional)

To run with FlashAttention for faster training (see PyTorch issue):

pip install nvidia-cuda-nvcc-cu11
export TRITON_PTXAS_PATH=/opt/conda/envs/treconlm/lib/python3.11/site-packages/nvidia/cuda_nvcc/bin/ptxas

Pretraining

python src/pretrain.py exps=<experiment>

Quick test (runs 100 iterations with a small model):

python src/pretrain.py exps=test/pretrain_scratch

To reproduce paper results or train with different settings, choose an experiment from src/hydra/train_config/exps/ (e.g., ids_110nt/ids_110nt, ids_60nt/ids_60nt).

Use torchrun --nproc_per_node=<gpus> for multi-GPU training. Pretraining data is generated on the fly.

Fine-tuning

python src/finetune.py exps=<experiment>

Quick test (runs 100 iterations on tutorial/example_data):

python src/finetune.py exps=test/finetune_scratch

To fine-tune on real datasets, use experiments like microsoft/mic or noisyDNA/noisy from src/hydra/train_config/exps/.

Example cluster scripts can be found in src/slurm_pkg/.


Example Training Times

  • Pretraining:
    Training a ~38M parameter model on ~300M examples (sequence length L = 110, cluster sizes uniformly sampled between 2 and 10, totaling ~440B tokens) on 4 NVIDIA H100 GPUs takes approximately 71.1 hours.

  • Fine-tuning:
    Fine-tuning a ~38M parameter model on ~5.5M examples (sequence length L = 60, cluster sizes between 2 and 10, totaling ~4.39B tokens) takes approximately 20.6 hours.


Data

Configuration files for our synthetic data generation are in:

src/hydra/data_config

To generate new test datasets, run from src:

python data_pkg/data_generation.py

Baselines

Non–deep learning baselines

To run inference with non–deep learning baselines:

python src/eval_pkg/eval_all_baselines.py --alg <algorithm>

Available algorithms:

ALGS = {
    'bmala': BMALA,
    'itr': Iterative,
    'muscle': MuscleAlgorithm,
    'trellisbma': TrellisBMAAlgorithm,
    'vs': VSAlgorithm,
}

An example cluster script is available in src/slurm_pkg/baselines.

Deep learning baselines

To pretrain, fine-tune, or run inference with our deep learning baselines, see:

  • DeepLearningBaselines/DNAFormer/slurm_pkg
  • DeepLearningBaselines/RobuSeqNet/slurm_pkg

These contain example SLURM execution scripts.


Source Implementations

The original implementations of the baselines were taken from:

About

TReconLM: Trace reconstruction using next-token prediction for DNA data storage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published