This is the PyTorch-based model implementation of the paper "Listen, Attend and Spell" by Chan et al., 2015 from Google Brain and Carnegie Mellon University.
This is an attention-based encoder-decoder model transcribing speech utterance to text in a character-based manner. It utilizes the pyramidal RNN layer to reduce the length of input utterance and attention mechanism to decode information captured by the encoder.
The overall architecture could be visualized as belows: (pics taken from the paper)
-
Pyramidal LSTM Layer -> The utterance is typically quite long so as to exceed 1000. This make the attention mechanism difficult to focus on the right part of the speech during decoding and slower to converge. To tackle this problem, each pyramial LSTM layer reduce the length of input utterance by half, by concatenating the adjacent two. Essentially, after one layer of pyramial LSTM layer, a batch data of size
(batch_size, seq_len, feat_size)becomes(batch_size, seq_len / 2, feat_size * 2). If theseq_lenis odd, we just chop off the last one. -
Locked Dropout -> We self-implement and insert locked dropout layer in between pyramidal lstm layers. Locked dropout is the way apply the same dropout mask to every time step. This is an efficient way to enhance the generalizability of the encoder. The whole encoder's baseline architecture is therefore
[lstm -> plstm -> locked-dropout -> plstm -> locked-dropout -> plstm] -
Attention Mechanism -> The model utilizes the attention mechanism to help the decoder to focus on the right part of the speech utterance during decoding. There are many ways of implementin the attention. In this implementation, we use linear transformation to produce
attention_keyandattention_valueto be coupled withqueryduring each timestamp's decoding. -
Teacher Forcing -> It's difficult at early stage for the model to learn because if the decoding at current tiemstamp
tis wrong, then this wrong character's embedding would be feed intot+1timestamp's decoding, making it even harder to get it right. To tackle this problem, we utilize teacher forcing techniques. Essentially, with a high probability (90% initially), the embedding ofy_{t-1}to be fed into the decoding process fory_twould be the ground truth regardless of what the model predicts on last timestamp. As the training process goes, we could gradually decrease the teacher forcing rate and let the model rely wholly on itself. -
Beam Search -> To fully explore the possible decoding path, we implement beam search in the implementation. However, since it's pretty slow once the beam widthg get bigger, we only used it during validation and inference, and greedy search is applied in the training epochs.
src/
attention.py (attention module)
model.py (locked dropout, pyramidal lstm layer, encoder, decoder)
trainer.py (train, valid, inference and attention/graident plot helper)
dataset.py (dataset, dataloader)
search.py (greedy search, beam search)
utils.py (letter list, index to letter trans dictionary)
main.py (main driver, hyperparameter setting)
data/ (the train, valid, test dataset storage)
checkpoint/ (model checkpoint during training)
output/ (inference result)
pic/
Listen-Attend-Spell.pdf
requirements.txt
README.md
Since the data is too big to be put on the github, we package the data source and upload it to google drive for download (link). Please unzip the file and put the files in the data/ folder.
- Check and install dependent packages.
pip install -r requirements.txt cdto thesrcfolder- (Optionally) Change the hyperparameters as needed in the
main.pyscript - To train, run
python3 main.py train - To inference, run
python3 main.py infer
We adopt the baseline architecture setting as described in the paper.
Decoder:
- 1 layer of normal LSTM followed by 3 layers of pyramial LSTM of
hidden dim=256. This reduces the input utterance length by a factor of 8.
Attention:
- linear transformation of
key_value dim=128
Decoder:
- 2 layers of LSTM cells of
hidden dim=512 character embedding dim=128
Optimization:
batchsize=64Adamoptimzer oflr=0.001ReduceLROnPlateauscheduler of reduce factor0.75withpatience=2, start to step only after first10epochs- train for
40epochs - Teacher Forcing rate remains
95%for the first10epochs and then gradually decrease to70%by linear interpolation.
The model should be able to reach average Levenshtein distance below 30 on validation set after 40 epochs training.
Feel free to email me at yukunj@cs.cmu.edu for questions or discussion for this implementation.

