ZIPA: A family of efficient speech models for multilingual phone recognition - ACL 2025

This repo is built upon the Icefall library in Next-gen Kaldi. The usage is almost the same.

Environment

Please refer to icefall_container.def for a complete setup of the environment.

A pre-built apptainer container can be found here.

You can build an apptainer (which works without root access on HPC) with the given definition file.

apptainer build icefall.sif icefall_container.def

Generally speaking, packages below are required for minimal usage:

torch torchaudio torchvision
lhotse (for audio preprocessing)
icefall and k2. They must exactly match the torch version and cuda version. Instructions are available here.
huggingface_hub (for downloading models and data)
Optional: kaldifeat. If you need to train from scratch, this library is also required. See instructions here. It must match the torch and cuda versions strictly.

Inference

Batch inference with detailed error logs

You might need to modify some paths in the data_module.py to point to your local data.

python zipformer_crctc/ctc_decode.py --iter 800000 --avg 10 --exp-dir /scratch/lingjzhu_root/lingjzhu1/lingjzhu/zipformer_exp/zipformer_large_crctc_75_pretrained  \
--use-transducer False --use-ctc True  --use-cr-ctc True --max-duration 600 --decoding-method ctc-greedy-search \
--bpe-model ipa_simplified/unigram_127.model --num-workers 1 --num-encoder-layers 4,3,4,5,4,4 \
--feedforward-dim 768,768,1536,2048,1536,768 --encoder-dim 512,512,768,1024,768,512 \
--encoder-unmasked-dim 192,192,256,320,256,192 --decoder-dim 1024 --joiner-dim 1024 \
--query-head-dim 64 --value-head-dim 48 --num-heads 6,6,6,8,6,6

Simple inference

Please check out zipa_ctc_inference.py and zipa_transducer_inference.py for example usage.

Here are some simple instructions:

Download models from Huggingface Hub (see Final Averaged Checkpoint column below). Use zipa_ctc_inference.py for CTCTC models and zipa_transducer_inference.py for transducer models.

Perform inference. You can directly pass a list of audio arrays. Batching and padding are supported. Greedy decoding is used for all models.

CRCTC Model

import torchaudio
from zipa_ctc_inference import initialize_model

# specify the path to model weights and tokenizers
model_path = "zipformer_weights/zipa_large_crctc_500000_avg10.pth"
bpe_model_path = "ipa_simplified/unigram_127.model"

# initialize model
model = initialize_model(model_path, bpe_model_path)

# Generate a dummy audio batch (3 samples of 2 seconds of silence)
# You can pass a list of audio arrays with any length.
# Batching, padding, and unpadding will be handled by the code. 
sample_rate = 16000
dummy_audio = [torch.zeros(int(sample_rate * 2)),
               torch.zeros(int(sample_rate * 2)),
               torch.zeros(int(sample_rate * 2))] 

# Run inference
output = model.inference(dummy_audio)
print("Predicted transcript:", output) # A list of predicted phone sequence.

Transducer model

import torchaudio
from zipa_transducer_inference import initialize_model

model_path = "zipformer_weights/zipa_large_noncausal_500000_avg10.pth"
bpe_model_path = "ipa_simplified/unigram_127.model"

model = initialize_model(model_path, bpe_model_path)

# Generate a dummy audio batch (3 sample of 2 seconds of silence)
sample_rate = 16000
dummy_audio = [torch.zeros(int(sample_rate * 2)),
               torch.zeros(int(sample_rate * 2)),
               torch.zeros(int(sample_rate * 2))]  

# Run inference
output = model.inference(dummy_audio)
print("Predicted transcript:", output)

Pretrained models

The huggingface page contains the last 10 checkpoints. The inference code will average across 10 checkpoints to make inference. After you download checkpoints to your local folder, you can use the inference code. --exp-dir should point to your local checkpoint folders. --iter should be the last iteration as specified in the checkpoint names. --avg 10 implies that the last 10 checkpoints will be averaged. Please don't change this argument, as we have only provided the last 10 checkpoints.

Model	Params	Training Steps	Raw Checkpoints	Final Averaged Checkpoint
Zipa-T-small	65M	300k	link	anyspeech/zipa-small-noncausal-300k
Zipa-T-large	302M	300k	link	anyspeech/zipa-small-noncausal-300k
Zipa-T-small	65M	500k	link	anyspeech/zipa-small-noncausal-500k
Zipa-T-large	302M	500k	link	anyspeech/zipa-small-noncausal-500k
Zipa-Cr-small	64M	300k	link	anyspeech/zipa-small-crctc-300k
Zipa-Cr-large	300M	300k	link	anyspeech/zipa-large-crctc-300k
Zipa-Cr-small	64M	500k	link	anyspeech/zipa-small-crctc-500k
Zipa-Cr-large	300M	500k	link	anyspeech/zipa-large-crctc-500k
Zipa-Cr-Ns-small	64M	700k	link	anyspeech/zipa-small-crctc-ns-700k
Zipa-Cr-Ns-large	300M	800k	link	anyspeech/zipa-large-crctc-ns-800k
No diacritics
Zipa-Cr-Ns-small	64M	700k	link	anyspeech/zipa-small-crctc-ns-no-diacritics-700k
Zipa-Cr-Ns-large	300M	780k	link	anyspeech/zipa-large-crctc-ns-no-diacritics-780k

Data

The tokenizer can be found here. You'll need the sentencepiece package to load it. This is the list of selected IPA symbols.

All data are distributed in the scalable shar format, similar to webdataset format but with indexes. It can be easily loaded with lhotse library. Audio files are downsampled to 16000Hz and stored in the flac format to save space.

After downloading all data, place all tar and json files within the same folder.

data-shar
├── cuts.000000.jsonl.gz
├── recording.000000.tar

Then you can construct a data loader with lhotse. Please refer to the lhotse documentation and their shar tutorial for further details.

cuts_full = CutSet.from_shar(
    fields={
        "cuts": ["data-shar/cuts.000000.jsonl.gz"],
        "recording": ["data-shar/recording.000000.tar"],
    }
)

Training

You might need to modify some paths in the data_module.py to point to your local data. Training a Zipformer-Large CRCTC model

python zipformer_crctc/train.py --world-size 2 --num-epochs 2 --start-epoch 1 --start-batch 500000  \
--use-fp16 1 --exp-dir /lustre07/scratch/lingjzhu/zipformer_exp/zipformer_large_crctc_0.5_scale --causal 0 \
--full-libri True --max-duration 120 --use-transducer False --use-ctc True  --use-cr-ctc True --base-lr 0.015 \
 --enable-spec-aug False --seed 2333 --wandb False --num-encoder-layers 4,3,4,5,4,4 \
--feedforward-dim 768,768,1536,2048,1536,768 --encoder-dim 512,512,768,1024,768,512 \
--encoder-unmasked-dim 192,192,256,320,256,192 --decoder-dim 1024 --joiner-dim 1024 \
--query-head-dim 64 --value-head-dim 48 --num-heads 6,6,6,8,6,6 --num-buckets 8 --num-workers 4 \
--unsup-cr-ctc-loss-scale 0.5 --use-unsup-cr-ctc True

Remove diacritics

python zipformer_crctc/train.py --world-size 2 --num-epochs 2 --start-epoch 1 --start-batch 500000 --use-fp16 1 \
 --exp-dir /lustre07/scratch/lingjzhu/zipformer_exp/zipformer_large_crctc_0.5_scale_no_diacritics --causal 0 \
--full-libri True --max-duration 120 --use-transducer False --use-ctc True  --use-cr-ctc True --base-lr 0.015 \
 --enable-spec-aug False --seed 2333 --wandb False --num-encoder-layers 4,3,4,5,4,4 \
--feedforward-dim 768,768,1536,2048,1536,768 --encoder-dim 512,512,768,1024,768,512 --encoder-unmasked-dim 192,192,256,320,256,192 \
--decoder-dim 1024 --joiner-dim 1024 --query-head-dim 64 --value-head-dim 48 --num-heads 6,6,6,8,6,6 \
--num-buckets 8 --num-workers 4 --unsup-cr-ctc-loss-scale 0.5 --use-unsup-cr-ctc True --remove-diacritics True

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
ipa_simplified		ipa_simplified
scripts		scripts
zipformer_crctc		zipformer_crctc
zipformer_transducer		zipformer_transducer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
icefall_container.def		icefall_container.def
zipa_ctc_inference.py		zipa_ctc_inference.py
zipa_transducer_inference.py		zipa_transducer_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZIPA: A family of efficient speech models for multilingual phone recognition - ACL 2025

Environment

Inference

Batch inference with detailed error logs

Simple inference

Pretrained models

Data

Training

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

lingjzhu/zipa

Folders and files

Latest commit

History

Repository files navigation

ZIPA: A family of efficient speech models for multilingual phone recognition - ACL 2025

Environment

Inference

Batch inference with detailed error logs

Simple inference

Pretrained models

Data

Training

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages