PLAYING FOR YOU: TEXT PROMPT-GUIDED JOINT AUDIO-VISUAL GENERATION FOR NARRATING FACES USING MULTI-ENTANGLED LATENT SPACE
Please find the checkpoints for our model that can be loaded into the torch.load() function in train.py at the following Google-Drive Link:
| Source Image | Audio Profile | Prompt Text | Generated Output | Description |
|
Audio | IN THE DISTRICT COURT LITIGATION ULTIMATELY THERE WERE A NUMBER OF UNANSWERED QUESTIONS AS YOU KNOW A NUMBER OF GAPS THAT WE BELIEVE COULD BE FILLED BY THE GRAND JURY MATERIAL | Sample Generation of a Man | |
|
Audio | OR PRODUCERS OR PROCESSORS OR ACCOUNTANTS OR AGRONOMISTS THE LIGHT BULB GOES ON I'VE SEEN IT OVER AND OVER AND PEOPLE WILL SAY WELL WE'RE DOING THIS | Sample Generation of an older man | |
|
Audio | GRANTED THAT I'VE BEEN AROUND FOR A WHILE AND DECIDED THEY'LL GET ME IN NOVEMBER SO TO SPEAK BUT I DON'T THINK WE CAN REALLY GO BACK AND RELITIGATE THAT ASPECT OF IT | Sample Generation of another old man | |
|
Audio | EVERYBODY THIS IS SENATOR MARSHA BLACKBURN FROM THE STATE OF TENNESSE AND I'M JUST SO EXCITED TO BE A PART OF THIS CELEBRATION | Sample Generation of a woman | |
|
Audio | EVERYBODY THIS IS SENATOR MARSHA BLACKBURN FROM THE STATE OF TENNESSE AND I'M JUST SO EXCITED TO BE A PART OF THIS CELEBRATION | Generation of the same woman but with a degraded audio input, generated using a reduced bitrate, downsampling the audio, and adding distortion. | |
|
Audio | SOMETIMES, THE BEST THING YOU CAN DO IS TO LET GO | Random child image from the internet, with an adult female voice profile as input. | |
|
Audio | BE THE CHANGE YOU WANT TO SEE IN THE WORLD, BROTHERS AND SISTERS. | A Stable-Diffusion Generated Image of a child from the internet, with an adult male voice as input. | |
|
Audio | I INVISCATE THE PAPER WITH GLUE TO CREATE MY ART PROJECT. | Voice profile of a child, with the face of an adult man. | |
|
Audio | PEOPLE DO WHAT THEY HATE FOR MONEY, AND USE THE MONEY TO DO WHAT THEY LOVE. | Generation of an Indian Man. |
- The model checkpoints can be accessed in the GoogleDrive Link above.
- Data Samples can be found at the following link: Data
- The following packages must be installed into your environment:
pip install -r requirements.txt- We will soon be publishing our model on Hugging Face 🤗
preprocess.pyprocesses the multimodal inputs required for our model- The
helper.pyfile has our transformer architectures and other helper functions get_videos.pymakes the train-test split and saves the outputs to the desired folder, clipping it to the desired length- The
audio_model.pyandvideo_model.pyfiles have the definitions of our models, which are called in the train.py file - The assets folder has some example outputs, that can be viewed in the ReadME.md file of this repository
