- This repository contains a set of scripts and data types to prepare pronunciation dictionaries and audio/transcript files for alignment using the Montreal Forced Aligner (MFA).
Ensure all of the following programs are discoverable on your terminal PATH. Installation instructions linked where neceesary:
- Clone this directory:
git clone https://github.com/michaelhaaf/mfa-workbook.git - Run
pip install -r requirements.txtin your preferred python environment - Ensure your preferred python environment is version 3.9 or greater
- Run
python -m unittestto see sample use cases for each script/source file.
This guide assumes you are starting with a corpus in the IARPA format:
corpus/
|-- scripted/
| |-- reference_materials/
| | | `-- lexicon.txt
| | | `-- lexicon.sub-train.txt
| |-- training/
| | |-- audio/
| | | `-- recording1.sph
| | | `-- recording2.sph
| | | `-- ...
| | |-- transcript_roman/
| | | `-- recording1.txt
| | | `-- recording2.txt
| | | `-- ...
To use MFA to align the transcripts with the recordings, however, you need the following corpus structure:
`--pronunciation_dictionary.txt
`--textgrid_corpus/
| `-- recording1.wav
| `-- recording1.TextGrid
| `-- recording2.wav
| `-- recording2.TextGrid
| `-- ...
That is, the following steps need to be taken:
- Convert bulk generic audio files to 16kHz .wav files
- Convert bulk .txt transcripts to .TextGrids
- Convert Iarpa corpus lexicon to MFA-ready pronunciation dictionary
- Prepare all of the above for alignment with MFA
Instructions for each step, using the code in this repository, are given below. You can follow the steps with the data contained in the sample-data directory to get a sense of the process. Sample results for this dataset are also given in the sample-data directory.
Iarpa corpus audio files are stored as .sph files with an 8kHz sample rate. These files need to be converted to .wav and resampled to 16kHz to be recognizable to MFA.
There exist many tools to convert and resample audio formats, but none that can specifically (1) convert .sph to .wav (2) resample .wav to 16kHz without corrupting the pitch/speed of the audio file (3) handle gigabytes of audio files in bulk without running into RAM issues.
The bulk_sph_resample script performs all three functions using (1) sph2pipe for conversion (2) praat for resampling and (3) bash to manage praat scripts as independent shell processes. This allows praat to run resampling on each file individually, preventing all files in the directory from loading into memory at the same time.
The script interactive. Open a terminal in this repository and run:
$ bash ./scripts/bulk_sph_resample ./corpus/scripted/training/audio/
./corpus/scripted/training/audio/ should be replaced by the path to the directory where your .sph audio files are stored. Once run, the script will begin an interactive session. The process should look something as follows:
Checking if there are .sph files in sample-data/iarpa_corpus/scripted/training/audio/...
Found 10 .sph files (note: >1000, some steps may take several minutes)
Convert .sph files to .wav? (y/n)
y
Enter the path of the directory where the new .wav files will be stored:
textgrid_corpus
mkdir: created directory 'textgrid_corpus'
Copying files...
Converting to .wav...
Cleaning up .sph copies...
Resampling...
Resampling complete, see results in textgrid_corpus
Note that this script could take anywhere from several minutes to a few hours (the resampling portion especially) depending on the number of audio files you are processing. Once it is complete, the generated .wav audio files are ready for MFA.
You can verify that the conversion/resampling was successful by checking the new .wav files using sox --info:
$ find sample-data/textgrid_corpus/ -name "*.wav" -exec sox --info {} \; | grep -e "Input File" -e "Sample Rate"
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_38698_20111025_181550_C6_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_90313_20111019_153045_O3_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_84543_20111124_194834_SC_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_94149_20111027_122829_L1_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_74395_20111117_132438_T2_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_31441_20111026_001007_C5_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_34961_20111027_173059_S5_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_79495_20111017_194334_O2_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_92735_20111024_165237_SB_scripted.wav'
Sample Rate : 16000
Input File : 'sample-data/textgrid_corpus/BABEL_BP_101_31393_20111018_151856_M1_scripted.wav'
Sample Rate : 16000
Your audio files are now ready for MFA.
Iarpa transcript files are stored as .txt. These files need to be converted to .TextGrid.
The python library praatio has useful utilities for performing this conversion. The txt-to-textgrid.py script makes use of these utilities to convert a specified directory of .txt files to the .TextGrid format. The script also converts Iarpa tags (e.g. '<breath>') to their MFA equivalent (e.g. '{LG}') along with other syntax related substitutions.
To use txt-to-textgrid.py, open a terminal in this repository and run:
$ python3 scripts/txt-to-textgrid.py --input corpus/scripted/training/transcript_roman/ --dest textgrid_corpus/
Processing BABEL_BP_101_38698_20111025_181550_C6_scripted.txt...
Processing BABEL_BP_101_84543_20111124_194834_SC_scripted.txt...
Processing BABEL_BP_101_94149_20111027_122829_L1_scripted.txt...
Processing BABEL_BP_101_34961_20111027_173059_S5_scripted.txt...
Processing BABEL_BP_101_31441_20111026_001007_C5_scripted.txt...
Processing BABEL_BP_101_74395_20111117_132438_T2_scripted.txt...
Processing BABEL_BP_101_79495_20111017_194334_O2_scripted.txt...
Processing BABEL_BP_101_92735_20111024_165237_SB_scripted.txt...
Processing BABEL_BP_101_31393_20111018_151856_M1_scripted.txt...
Processing BABEL_BP_101_90313_20111019_153045_O3_scripted.txt...
BABEL_BP_101_94149_20111027_122829_L1_scripted.TextGrid
BABEL_BP_101_84543_20111124_194834_SC_scripted.TextGrid
BABEL_BP_101_79495_20111017_194334_O2_scripted.TextGrid
BABEL_BP_101_31393_20111018_151856_M1_scripted.TextGrid
BABEL_BP_101_92735_20111024_165237_SB_scripted.TextGrid
BABEL_BP_101_74395_20111117_132438_T2_scripted.TextGrid
BABEL_BP_101_34961_20111027_173059_S5_scripted.TextGrid
BABEL_BP_101_31441_20111026_001007_C5_scripted.TextGrid
BABEL_BP_101_90313_20111019_153045_O3_scripted.TextGrid
BABEL_BP_101_38698_20111025_181550_C6_scripted.TextGrid
The --input and --dest arguments should be replaced by the path to the directory where your .txt transcript files are stored and your desired output directory respectively. Once run, the script will complete automatically. The process should something like shown above.
Confirm that your textgrids look correct using a text editor.
Corpus lexicons can be converted to MFA dictionaries using the syllabify module. The MFA dictionaries stored in sample-data/pronunciation-dictionaries were generated using syllabify and the example lexicons provided in sample-data, like so:
# Merge all lexicons into one file
find sample-data/iarpa_canto_corpus/ -type f -name "*lexicon*" -exec cat {} + >> ./merged_canto_lexicon.txt
# Run syllabify on the combined lexicon
python ./syllabify.py -i ./merged_canto_lexicon.txt -o ./output.txt -f iarpa_canto
# Compare MFA dictionary against the existing gold standard:
diff ./output.txt ./sample-data/pronunciation-dictionaries/tones/canto_pd.txt
The default behavior for syllabify is to include tones in the output pronunciation dictionary. This behavior can be modified by configuration: config/mfa.yaml controls the formatting scheme for the MFA dictionary. the include_tones option can set to true or false as desired. This can be done via any text editor. If you prefer, there are .yaml processing tools like yq which can make this task very quick:
# update the property include_tones
yq -i '.include_tones = false' config/mfa.yaml
# verify that the include_tones property is false
yq e '.' config/mfa.yaml
# to switch tones back on
yq -i '.include_tones = true' config/mfa.yaml
With this configuration change, the the syllabify MFA pronunciation dictionary output will no longer include tones.
More configuration options and instructions for adding support for new corpora can be found in the config and src directories.
If you followed the steps shown above, you should have a resulting textgrid_corpus directory with prepared audio, transcript, and pronunciation dictionary files ready to be used by MFA. You can validate the results using MFA as follows:
$ conda activate aligner
(aligner) $ mfa validate ./textgrid_corpus/ ./pd.txt
MFA should run some diagnostics successfully if the previous steps worked. You can continue with the rest of the training/aligning tutorial here.
Note well: the sample-data/ given in this repository (10 randomly chosen ~10second speech files) is no where near enough data to train a performant model. With this data we are simply verifying that there are no syntax/format issues with your workflow. Once that is verified, you will need to acquire more data to train a performant model. See this blog post for rough guidelines as to the magnitude of data required to train a model for alignment/general usage.