ImageInfoExtractor

This project is a proof-of-concept for an intelligent pipeline designed to process and classify images. The primary objective is to automatically determine if an image is a scanned document and, if so, to identify its subject matter.

The core idea is to create a system that can:

Identify Documents: Read an image and apply Optical Character Recognition (OCR) to extract text. If text is present, the image is treated as a scanned document.
Enhance Text Quality: Apply a spelling correction model to the extracted text to improve its accuracy.
Classify Content: Use Natural Language Processing (NLP) techniques to classify the document's topic. For this proof-of-concept, the classification is focused on distinguishing between "Basic Sciences" and "Computer Science".

The pipeline consists of three main components: an OCR Engine, a Spell Checker, and a Document Classifier.

Status: Core components are complete; integration pipeline is pending.

Features

OCR Text Extraction: Extracts text from various image formats using the Tesseract OCR engine.
Text Correction: Improves the accuracy of the extracted text with an integrated spell checker.
Document Classification: A deep neural network classifies documents into "Computer Science" or "Science" categories with high accuracy (98-99%).
Advanced NLP Pipeline: Utilizes custom-trained Word2Vec and pre-trained GloVe embeddings for semantic understanding of the text.

How It Works

The project operates in a three-stage pipeline:

OCR Engine: An image is first processed by the OCR engine, which performs several preprocessing steps (resizing, grayscaling, adaptive thresholding) to enhance image quality before using pytesseract to extract the raw text.
Spell Checker: The extracted text is then passed to a spell-checking module. This script uses symspellpy to correct common OCR errors, significantly improving the quality of the text for the next stage.
Document Classifier: The corrected text is fed into a sophisticated Natural Language Processing (NLP) model. This model, detailed in the documentclassification.ipynb notebook, tokenizes the text, removes stopwords, and performs lemmatization. It then uses a neural network with embedding layers (trained on both custom Word2Vec and pre-trained GloVe vectors) to classify the document.

Codebase Structure

documentclassification.ipynb: Contains the complete pipeline for training and evaluating the document classification model.
OCR/python files/OCRengine.py: The core script for the OCR functionality.
OCR/python files/spellcheckpy.py: The script for spell-checking the extracted text.
Extracted_txts/: The output directory where extracted and corrected text files are stored.
OCR/python files/test.py and example.py: Example scripts showing how to use the OCR and spell-checking modules.

Setup and Installation

Install Tesseract OCR Engine: This project requires the Tesseract OCR engine to be installed on your system. You can find installation instructions for your OS at the official Tesseract repository.

Clone the repository:

git clone https://github.com/your-username/ImageInfoExtractor.git
cd ImageInfoExtractor

Install Python dependencies: It is recommended to use a virtual environment.

pip install -r requirements.txt

requirements.txt:

tensorflow
pandas
numpy
scikit-learn
nltk
gensim
opencv-python
symspellpy
Pillow
pytesseract

How to Use

Since there is no main entrypoint script, the components need to be run individually.

Extract Text from an Image:
- Place your image in the OCR/ directory.
- Modify OCR/python files/test.py to point to your image file.
- Run the script:
```
python OCR/python files/test.py
```
- The extracted text will be saved as a .txt file in the same directory.
Correct Spelling in a Text File:
- Place your text file in the OCR/python files/ directory.
- Modify OCR/python files/example.py to point to your text file.
- Run the script:
```
python OCR/python files/example.py
```
- A new file with the _corrected.txt suffix will be created.
Train the Document Classifier:
- Open and run the documentclassification.ipynb notebook in a Jupyter environment.
- The notebook loads the preprocessed data, trains the models, and saves the final trained models (model_C1 and model_C2) to disk.

Future Work

Image Detector: Implement the "Image Detector-cum-Processor" to filter scanned documents from regular pictures automatically.
Pipeline Integration: Create a main entrypoint script (main.py) to connect all the components into a seamless, end-to-end pipeline.
API and User Interface: Develop an API and a simple user interface to make the tool more accessible.

Contributors

Ankit Sinha (ankitdipto)
Chetanya Kumar Bansal (chetanyaba)
Anand Prakash (UnstoppableBRO)
Anwoy Chatterjee (C-anwoy)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Extracted_txts		Extracted_txts
OCR		OCR
.gitignore		.gitignore
README.md		README.md
documentclassification.ipynb		documentclassification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ImageInfoExtractor

Features

How It Works

Codebase Structure

Setup and Installation

How to Use

Future Work

Contributors

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ankitdipto/ImageInfoExtractor

Folders and files

Latest commit

History

Repository files navigation

ImageInfoExtractor

Features

How It Works

Codebase Structure

Setup and Installation

How to Use

Future Work

Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages