This repo is built upon the Icefall library in Next-gen Kaldi. The usage is almost the same.
Please refer to icefall_container.def for a complete setup of the environment.
A pre-built apptainer container can be found here.
You can build an apptainer (which works without root access on HPC) with the given definition file.
apptainer build icefall.sif icefall_container.def
Generally speaking, packages below are required for minimal usage:
torch torchaudio torchvisionlhotse(for audio preprocessing)icefallandk2. They must exactly match the torch version and cuda version. Instructions are available here.huggingface_hub(for downloading models and data)- Optional:
kaldifeat. If you need to train from scratch, this library is also required. See instructions here. It must match the torch and cuda versions strictly.
You might need to modify some paths in the data_module.py to point to your local data.
python zipformer_crctc/ctc_decode.py --iter 800000 --avg 10 --exp-dir /scratch/lingjzhu_root/lingjzhu1/lingjzhu/zipformer_exp/zipformer_large_crctc_75_pretrained \
--use-transducer False --use-ctc True --use-cr-ctc True --max-duration 600 --decoding-method ctc-greedy-search \
--bpe-model ipa_simplified/unigram_127.model --num-workers 1 --num-encoder-layers 4,3,4,5,4,4 \
--feedforward-dim 768,768,1536,2048,1536,768 --encoder-dim 512,512,768,1024,768,512 \
--encoder-unmasked-dim 192,192,256,320,256,192 --decoder-dim 1024 --joiner-dim 1024 \
--query-head-dim 64 --value-head-dim 48 --num-heads 6,6,6,8,6,6
Please check out zipa_ctc_inference.py and zipa_transducer_inference.py for example usage.
Here are some simple instructions:
-
Download models from Huggingface Hub (see Final Averaged Checkpoint column below). Use
zipa_ctc_inference.pyfor CTCTC models andzipa_transducer_inference.pyfor transducer models. -
Perform inference. You can directly pass a list of audio arrays. Batching and padding are supported. Greedy decoding is used for all models.
- CRCTC Model
import torchaudio from zipa_ctc_inference import initialize_model # specify the path to model weights and tokenizers model_path = "zipformer_weights/zipa_large_crctc_500000_avg10.pth" bpe_model_path = "ipa_simplified/unigram_127.model" # initialize model model = initialize_model(model_path, bpe_model_path) # Generate a dummy audio batch (3 samples of 2 seconds of silence) # You can pass a list of audio arrays with any length. # Batching, padding, and unpadding will be handled by the code. sample_rate = 16000 dummy_audio = [torch.zeros(int(sample_rate * 2)), torch.zeros(int(sample_rate * 2)), torch.zeros(int(sample_rate * 2))] # Run inference output = model.inference(dummy_audio) print("Predicted transcript:", output) # A list of predicted phone sequence.
- Transducer model
import torchaudio from zipa_transducer_inference import initialize_model model_path = "zipformer_weights/zipa_large_noncausal_500000_avg10.pth" bpe_model_path = "ipa_simplified/unigram_127.model" model = initialize_model(model_path, bpe_model_path) # Generate a dummy audio batch (3 sample of 2 seconds of silence) sample_rate = 16000 dummy_audio = [torch.zeros(int(sample_rate * 2)), torch.zeros(int(sample_rate * 2)), torch.zeros(int(sample_rate * 2))] # Run inference output = model.inference(dummy_audio) print("Predicted transcript:", output)
The huggingface page contains the last 10 checkpoints. The inference code will average across 10 checkpoints to make inference.
After you download checkpoints to your local folder, you can use the inference code. --exp-dir should point to your local checkpoint folders.
--iter should be the last iteration as specified in the checkpoint names. --avg 10 implies that the last 10 checkpoints will be averaged. Please don't change this argument, as we have only provided the last 10 checkpoints.
| Model | Params | Training Steps | Raw Checkpoints | Final Averaged Checkpoint |
|---|---|---|---|---|
| Zipa-T-small | 65M | 300k | link | anyspeech/zipa-small-noncausal-300k |
| Zipa-T-large | 302M | 300k | link | anyspeech/zipa-small-noncausal-300k |
| Zipa-T-small | 65M | 500k | link | anyspeech/zipa-small-noncausal-500k |
| Zipa-T-large | 302M | 500k | link | anyspeech/zipa-small-noncausal-500k |
| Zipa-Cr-small | 64M | 300k | link | anyspeech/zipa-small-crctc-300k |
| Zipa-Cr-large | 300M | 300k | link | anyspeech/zipa-large-crctc-300k |
| Zipa-Cr-small | 64M | 500k | link | anyspeech/zipa-small-crctc-500k |
| Zipa-Cr-large | 300M | 500k | link | anyspeech/zipa-large-crctc-500k |
| Zipa-Cr-Ns-small | 64M | 700k | link | anyspeech/zipa-small-crctc-ns-700k |
| Zipa-Cr-Ns-large | 300M | 800k | link | anyspeech/zipa-large-crctc-ns-800k |
| No diacritics | ||||
| Zipa-Cr-Ns-small | 64M | 700k | link | anyspeech/zipa-small-crctc-ns-no-diacritics-700k |
| Zipa-Cr-Ns-large | 300M | 780k | link | anyspeech/zipa-large-crctc-ns-no-diacritics-780k |
The tokenizer can be found here. You'll need the sentencepiece package to load it. This is the list of selected IPA symbols.
All data are distributed in the scalable shar format, similar to webdataset format but with indexes. It can be easily loaded with lhotse library. Audio files are downsampled to 16000Hz and stored in the flac format to save space.
- All processed data (~1.8TB)
- Training data only (~1.5TB)
- Pseudolabeled data (~1TB)
After downloading all data, place all tar and json files within the same folder.
data-shar
├── cuts.000000.jsonl.gz
├── recording.000000.tar
Then you can construct a data loader with lhotse. Please refer to the lhotse documentation and their shar tutorial for further details.
cuts_full = CutSet.from_shar(
fields={
"cuts": ["data-shar/cuts.000000.jsonl.gz"],
"recording": ["data-shar/recording.000000.tar"],
}
)You might need to modify some paths in the data_module.py to point to your local data.
Training a Zipformer-Large CRCTC model
python zipformer_crctc/train.py --world-size 2 --num-epochs 2 --start-epoch 1 --start-batch 500000 \
--use-fp16 1 --exp-dir /lustre07/scratch/lingjzhu/zipformer_exp/zipformer_large_crctc_0.5_scale --causal 0 \
--full-libri True --max-duration 120 --use-transducer False --use-ctc True --use-cr-ctc True --base-lr 0.015 \
--enable-spec-aug False --seed 2333 --wandb False --num-encoder-layers 4,3,4,5,4,4 \
--feedforward-dim 768,768,1536,2048,1536,768 --encoder-dim 512,512,768,1024,768,512 \
--encoder-unmasked-dim 192,192,256,320,256,192 --decoder-dim 1024 --joiner-dim 1024 \
--query-head-dim 64 --value-head-dim 48 --num-heads 6,6,6,8,6,6 --num-buckets 8 --num-workers 4 \
--unsup-cr-ctc-loss-scale 0.5 --use-unsup-cr-ctc True
Remove diacritics
python zipformer_crctc/train.py --world-size 2 --num-epochs 2 --start-epoch 1 --start-batch 500000 --use-fp16 1 \
--exp-dir /lustre07/scratch/lingjzhu/zipformer_exp/zipformer_large_crctc_0.5_scale_no_diacritics --causal 0 \
--full-libri True --max-duration 120 --use-transducer False --use-ctc True --use-cr-ctc True --base-lr 0.015 \
--enable-spec-aug False --seed 2333 --wandb False --num-encoder-layers 4,3,4,5,4,4 \
--feedforward-dim 768,768,1536,2048,1536,768 --encoder-dim 512,512,768,1024,768,512 --encoder-unmasked-dim 192,192,256,320,256,192 \
--decoder-dim 1024 --joiner-dim 1024 --query-head-dim 64 --value-head-dim 48 --num-heads 6,6,6,8,6,6 \
--num-buckets 8 --num-workers 4 --unsup-cr-ctc-loss-scale 0.5 --use-unsup-cr-ctc True --remove-diacritics True