Grapheme-to-phoneme (G2P) conversion predicts phonetic pronunciations from written words, essential for text-to-speech and speech recognition systems. This project implements an LSTM encoder–decoder model trained on CMUdict that converts graphemes to phoneme sequences.
The encoder–decoder Bi-LSTM model uses Luong attention, scheduled sampling during training, and greedy decoding for inference. It achieves a test sequence-level accuracy of 77% (WER of 23%) on the CMUdict dataset. Beam search did not improve upon greedy decoding when evaluated on the testing set.
|
Loss vs Sequence Accuracy: Training loss plotted against sequence-level accuracy throughout training epochs. |
Loss vs Teacher Forcing: Training loss evolution with varying teacher forcing probabilities during scheduled sampling. |

