Skip to content

Kishore1K/Image-Caption-Generator-project

Repository files navigation

Image Caption Generator

A deep learning-based application that utilizes Computer Vision and Natural Language Processing to generate descriptive captions for images.

Abstract

This project implements an Encoder-Decoder neural network architecture to solve the problem of image captioning. The model takes an image as input and outputs a natural language sentence describing the visual content. The system integrates a Convolutional Neural Network (CNN) for image feature extraction and a Recurrent Neural Network (LSTM) for text generation, wrapped in a Flask web application for user interaction.

Key Features

  • Deep Learning Architecture: Utilizes a "Merge" architecture combining CNN and RNN.
  • Transfer Learning: Leverages the VGG-16 model pre-trained on ImageNet for robust image feature extraction.
  • Advanced Decoding: Implements both Greedy Search and Beam Search algorithms to optimize caption quality.
  • Web Interface: A user-friendly Flask web app allowing users to upload images and view generated captions in real-time.
  • Evaluation: Performance measured using BLEU (Bilingual Evaluation Understudy) scores (1-4) and human evaluation.

System Architecture

The model follows a standard Encoder-Decoder structure:

  1. Encoder (Visual Layer): * We use VGG-16, removing the last classification layer to extract a 4096-element feature vector from the image.

    • This vector is processed by a Dense layer (with ReLU) to compress the features.
  2. Decoder (Language Layer):

    • Embedding Layer: Converts the input text sequence into dense vectors.
    • LSTM Layer: A Long Short-Term Memory network processes the sequence data to handle long-term dependencies in sentences.
  3. Merge & Prediction:

    • The outputs from the Visual and Language layers are merged (using addition) and passed to a final Dense layer.
    • A Softmax function predicts the probability of the next word in the vocabulary.

Dataset

The model was trained and evaluated using the Flickr30k Dataset.

  • Images: ~30,000 images.
  • Captions: 5 different captions per image (describing key objects and scenes).
  • Preprocessing: Text was tokenized, normalized (lowercase, punctuation removal), and padded to a maximum sequence length.

Results

We compared two caption generation methods:

  • Greedy Search: Selects the most probable word at each step.
  • Beam Search: Explores k most probable sequences (Beams) to find the optimal caption.

Observation: Beam Search consistently produced more descriptive and contextually accurate captions compared to the Greedy approach.

Technologies Used

  • Language: Python
  • Deep Learning: TensorFlow, Keras
  • Web Framework: Flask
  • Image Processing: Pillow (PIL)
  • Data Handling: NumPy, Pandas, TQDM

Team Members


KISHORE K

ANIRUDH

DANIEL

MANJUNATH

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages