A deep learning-based application that utilizes Computer Vision and Natural Language Processing to generate descriptive captions for images.
This project implements an Encoder-Decoder neural network architecture to solve the problem of image captioning. The model takes an image as input and outputs a natural language sentence describing the visual content. The system integrates a Convolutional Neural Network (CNN) for image feature extraction and a Recurrent Neural Network (LSTM) for text generation, wrapped in a Flask web application for user interaction.
- Deep Learning Architecture: Utilizes a "Merge" architecture combining CNN and RNN.
- Transfer Learning: Leverages the VGG-16 model pre-trained on ImageNet for robust image feature extraction.
- Advanced Decoding: Implements both Greedy Search and Beam Search algorithms to optimize caption quality.
- Web Interface: A user-friendly Flask web app allowing users to upload images and view generated captions in real-time.
- Evaluation: Performance measured using BLEU (Bilingual Evaluation Understudy) scores (1-4) and human evaluation.
The model follows a standard Encoder-Decoder structure:
-
Encoder (Visual Layer): * We use VGG-16, removing the last classification layer to extract a 4096-element feature vector from the image.
- This vector is processed by a Dense layer (with ReLU) to compress the features.
-
Decoder (Language Layer):
- Embedding Layer: Converts the input text sequence into dense vectors.
- LSTM Layer: A Long Short-Term Memory network processes the sequence data to handle long-term dependencies in sentences.
-
Merge & Prediction:
- The outputs from the Visual and Language layers are merged (using addition) and passed to a final Dense layer.
- A Softmax function predicts the probability of the next word in the vocabulary.
The model was trained and evaluated using the Flickr30k Dataset.
- Images: ~30,000 images.
- Captions: 5 different captions per image (describing key objects and scenes).
- Preprocessing: Text was tokenized, normalized (lowercase, punctuation removal), and padded to a maximum sequence length.
We compared two caption generation methods:
- Greedy Search: Selects the most probable word at each step.
- Beam Search: Explores
kmost probable sequences (Beams) to find the optimal caption.
Observation: Beam Search consistently produced more descriptive and contextually accurate captions compared to the Greedy approach.
- Language: Python
- Deep Learning: TensorFlow, Keras
- Web Framework: Flask
- Image Processing: Pillow (PIL)
- Data Handling: NumPy, Pandas, TQDM
KISHORE K |
ANIRUDH |
DANIEL |
MANJUNATH |