We want to build a model to interpret continuous American Sign Language (ASL) signing into English text. For this use case, we conducted experiments for fine-tuning large language vision models, specifically:
- LLaVA-NeXT-Video
- Video-LLaVA
We are using the How2Sign dataset which is ASL video footage aligned with English sentences. It includes RGB videos, green-screen frontal and side views, and 3D keypoints (hand, body, face). We focused on RGB frontal-view video data for fine-tuning to manage computational constraints.
data: Contains the cleaned CSV file used as the source dataset.data_profiling: Includes the code for data cleaning.llava-next-video: Scripts for fine-tuning the LLaVA-NeXT-Video model on the How2Sign dataset, along with quantitative analysis of the trained model.video-llava: Scripts for fine-tuning the Video-LLaVA model on the How2Sign dataset, as well as inference scripts for the trained model.
pip install -r requirements.txtNavigate to the huggingface_trainer directory within llava-next-video and execute the following command:
cd llava-next-video/huggingface_trainer
sbatch train.shWe used slurm jobs to trigger the training jobs.
- A
logsfolder will be created to store training logs. - An
outputdirectory will be generated to store checkpoints from training. - A
generated_texts.csvfile will be created for validation purposes.
- id: Incremental ID for each data item.
- video_id: Unique identifier for the video clip, also present in
valid_clips.csv. - generated: The text generated by the model for the specific clip.
- true: The expected text for the specific clip.
- epoch: The epoch at which the evaluation occurred.
Run the evaluation script to calculate validation scores:
cd llava-next-video/huggingface_trainer
python llava_next_video_eval.py- A
validation_scores.csvfile will be generated containing the following metrics after every epoch:- ROUGE-1
- ROUGE-2
- ROUGE-L
- BLEU
Navigate to the video-llava directory and execute the following command:
cd video-llava
sbatch train.shInference for the Video-LLaVA model can be performed using the Jupyter notebook located at:
video-llava/inference.ipynb