Image Caption Generator

A deep learning-based application that utilizes Computer Vision and Natural Language Processing to generate descriptive captions for images.

Abstract

This project implements an Encoder-Decoder neural network architecture to solve the problem of image captioning. The model takes an image as input and outputs a natural language sentence describing the visual content. The system integrates a Convolutional Neural Network (CNN) for image feature extraction and a Recurrent Neural Network (LSTM) for text generation, wrapped in a Flask web application for user interaction.

Key Features

Deep Learning Architecture: Utilizes a "Merge" architecture combining CNN and RNN.
Transfer Learning: Leverages the VGG-16 model pre-trained on ImageNet for robust image feature extraction.
Advanced Decoding: Implements both Greedy Search and Beam Search algorithms to optimize caption quality.
Web Interface: A user-friendly Flask web app allowing users to upload images and view generated captions in real-time.
Evaluation: Performance measured using BLEU (Bilingual Evaluation Understudy) scores (1-4) and human evaluation.

System Architecture

The model follows a standard Encoder-Decoder structure:

Encoder (Visual Layer): * We use VGG-16, removing the last classification layer to extract a 4096-element feature vector from the image.
- This vector is processed by a Dense layer (with ReLU) to compress the features.
Decoder (Language Layer):
- Embedding Layer: Converts the input text sequence into dense vectors.
- LSTM Layer: A Long Short-Term Memory network processes the sequence data to handle long-term dependencies in sentences.
Merge & Prediction:
- The outputs from the Visual and Language layers are merged (using addition) and passed to a final Dense layer.
- A Softmax function predicts the probability of the next word in the vocabulary.

Dataset

The model was trained and evaluated using the Flickr30k Dataset.

Images: ~30,000 images.
Captions: 5 different captions per image (describing key objects and scenes).
Preprocessing: Text was tokenized, normalized (lowercase, punctuation removal), and padded to a maximum sequence length.

Results

We compared two caption generation methods:

Greedy Search: Selects the most probable word at each step.
Beam Search: Explores k most probable sequences (Beams) to find the optimal caption.

Observation: Beam Search consistently produced more descriptive and contextually accurate captions compared to the Greedy approach.

Technologies Used

Language: Python
Deep Learning: TensorFlow, Keras
Web Framework: Flask
Image Processing: Pillow (PIL)
Data Handling: NumPy, Pandas, TQDM

Team Members

_{KISHORE K}

_ANIRUDH

_DANIEL

_MANJUNATH

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
KSCST		KSCST
Model Creation Jupyter Notebook		Model Creation Jupyter Notebook
WebApp		WebApp
docs		docs
Image_Caption_Generator_Project_Final_Report.pdf		Image_Caption_Generator_Project_Final_Report.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Caption Generator

Abstract

Key Features

System Architecture

Dataset

Results

Technologies Used

Team Members

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Kishore1K/Image-Caption-Generator-project

Folders and files

Latest commit

History

Repository files navigation

Image Caption Generator

Abstract

Key Features

System Architecture

Dataset

Results

Technologies Used

Team Members

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages