This project is a proof-of-concept for an intelligent pipeline designed to process and classify images. The primary objective is to automatically determine if an image is a scanned document and, if so, to identify its subject matter.
The core idea is to create a system that can:
- Identify Documents: Read an image and apply Optical Character Recognition (OCR) to extract text. If text is present, the image is treated as a scanned document.
- Enhance Text Quality: Apply a spelling correction model to the extracted text to improve its accuracy.
- Classify Content: Use Natural Language Processing (NLP) techniques to classify the document's topic. For this proof-of-concept, the classification is focused on distinguishing between "Basic Sciences" and "Computer Science".
The pipeline consists of three main components: an OCR Engine, a Spell Checker, and a Document Classifier.
Status: Core components are complete; integration pipeline is pending.
- OCR Text Extraction: Extracts text from various image formats using the Tesseract OCR engine.
- Text Correction: Improves the accuracy of the extracted text with an integrated spell checker.
- Document Classification: A deep neural network classifies documents into "Computer Science" or "Science" categories with high accuracy (98-99%).
- Advanced NLP Pipeline: Utilizes custom-trained Word2Vec and pre-trained GloVe embeddings for semantic understanding of the text.
The project operates in a three-stage pipeline:
-
OCR Engine: An image is first processed by the OCR engine, which performs several preprocessing steps (resizing, grayscaling, adaptive thresholding) to enhance image quality before using
pytesseractto extract the raw text. -
Spell Checker: The extracted text is then passed to a spell-checking module. This script uses
symspellpyto correct common OCR errors, significantly improving the quality of the text for the next stage. -
Document Classifier: The corrected text is fed into a sophisticated Natural Language Processing (NLP) model. This model, detailed in the
documentclassification.ipynbnotebook, tokenizes the text, removes stopwords, and performs lemmatization. It then uses a neural network with embedding layers (trained on both custom Word2Vec and pre-trained GloVe vectors) to classify the document.
documentclassification.ipynb: Contains the complete pipeline for training and evaluating the document classification model.OCR/python files/OCRengine.py: The core script for the OCR functionality.OCR/python files/spellcheckpy.py: The script for spell-checking the extracted text.Extracted_txts/: The output directory where extracted and corrected text files are stored.OCR/python files/test.pyandexample.py: Example scripts showing how to use the OCR and spell-checking modules.
-
Install Tesseract OCR Engine: This project requires the Tesseract OCR engine to be installed on your system. You can find installation instructions for your OS at the official Tesseract repository.
-
Clone the repository:
git clone https://github.com/your-username/ImageInfoExtractor.git cd ImageInfoExtractor -
Install Python dependencies: It is recommended to use a virtual environment.
pip install -r requirements.txt
requirements.txt:
tensorflow pandas numpy scikit-learn nltk gensim opencv-python symspellpy Pillow pytesseract
Since there is no main entrypoint script, the components need to be run individually.
-
Extract Text from an Image:
- Place your image in the
OCR/directory. - Modify
OCR/python files/test.pyto point to your image file. - Run the script:
python OCR/python files/test.py
- The extracted text will be saved as a
.txtfile in the same directory.
- Place your image in the
-
Correct Spelling in a Text File:
- Place your text file in the
OCR/python files/directory. - Modify
OCR/python files/example.pyto point to your text file. - Run the script:
python OCR/python files/example.py
- A new file with the
_corrected.txtsuffix will be created.
- Place your text file in the
-
Train the Document Classifier:
- Open and run the
documentclassification.ipynbnotebook in a Jupyter environment. - The notebook loads the preprocessed data, trains the models, and saves the final trained models (
model_C1andmodel_C2) to disk.
- Open and run the
- Image Detector: Implement the "Image Detector-cum-Processor" to filter scanned documents from regular pictures automatically.
- Pipeline Integration: Create a main entrypoint script (
main.py) to connect all the components into a seamless, end-to-end pipeline. - API and User Interface: Develop an API and a simple user interface to make the tool more accessible.
- Ankit Sinha (ankitdipto)
- Chetanya Kumar Bansal (chetanyaba)
- Anand Prakash (UnstoppableBRO)
- Anwoy Chatterjee (C-anwoy)