Skip to content

Kakarotprince/FileClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Document Classification using Machine Learning

This project is an intelligent document classification system that uses a range of machine learning models, traditional text vectorization techniques (TF-IDF), and modern embeddings (Word2Vec & Doc2Vec) to accurately classify text documents into predefined categories. Ensemble techniques like hard voting and soft voting are used to improve performance by combining multiple models.


πŸ“ Datasets Used

We combined two existing datasets to build a richer and more diverse text classification corpus:

  1. News Article Category Dataset
  2. Text Document Classification Dataset

These datasets were mapped into unified categories such as:

  • News & Current Affairs
  • Business & Finance
  • Science & Technology
  • Arts & Entertainment
  • Education & Academia
  • Sports

🧩 Word Embedding

We used the pre-trained Google News Word2Vec model (GoogleNews-vectors-negative300.bin) for document vectorization:

πŸ“₯ Download it here: Google News Word2Vec Embeddings


πŸš€ Project Pipeline

  1. Data Loading & Preprocessing
  2. Category Mapping & Merging Datasets
  3. Text Cleaning, Tokenization, and Lemmatization
  4. Vectorization:
    • TF-IDF
    • Word2Vec
    • Doc2Vec
  5. Oversampling for Class Imbalance (SMOTE, ADASYN, Random Oversampling)
  6. Model Training:
    • Naive Bayes
    • SVM
    • Random Forest
    • AdaBoost
    • XGBoost
    • Word2Vec + Logistic Regression
    • Doc2Vec + Logistic Regression
  7. Model Evaluation
  8. Ensemble Voting (Hard & Soft)
  9. Deployment via Streamlit Interface

πŸ“¦ First-Time Setup Instructions

# Clone the repository
git clone https://github.com/Kakarotprince/FileClassification.git
cd FileClassification

# Install dependencies
pip install -r Requirements.txt

⚠️ Important Note:

  • On first run, the Google News vectors need to be loaded and vector cache saved.
  • Subsequent runs will reuse the saved vectors to save time and memory.

πŸ“Š Models and Techniques

  • Vectorizers: TF-IDF, Word2Vec (Google News), Doc2Vec
  • Classifiers: SVM, RandomForest, AdaBoost, XGBoost, Naive Bayes, Logistic Regression
  • Imbalanced Data Handling: SMOTE, ADASYN, Random Oversampling
  • Evaluation Metrics: Accuracy, Precision, Recall, F1-Score
  • Ensembling: Hard Voting, Soft Voting (based on individual model accuracies)

πŸ–ΌοΈ Streamlit Web Interface

The project includes a user-friendly Streamlit-based UI where users can upload text or files and receive classification results in real-time.

To launch the app:

streamlit run app.py

πŸ“š References


Releases

No releases published

Packages

No packages published