This project is an intelligent document classification system that uses a range of machine learning models, traditional text vectorization techniques (TF-IDF), and modern embeddings (Word2Vec & Doc2Vec) to accurately classify text documents into predefined categories. Ensemble techniques like hard voting and soft voting are used to improve performance by combining multiple models.
We combined two existing datasets to build a richer and more diverse text classification corpus:
These datasets were mapped into unified categories such as:
- News & Current Affairs
- Business & Finance
- Science & Technology
- Arts & Entertainment
- Education & Academia
- Sports
We used the pre-trained Google News Word2Vec model (GoogleNews-vectors-negative300.bin) for document vectorization:
π₯ Download it here: Google News Word2Vec Embeddings
- Data Loading & Preprocessing
- Category Mapping & Merging Datasets
- Text Cleaning, Tokenization, and Lemmatization
- Vectorization:
- TF-IDF
- Word2Vec
- Doc2Vec
- Oversampling for Class Imbalance (SMOTE, ADASYN, Random Oversampling)
- Model Training:
- Naive Bayes
- SVM
- Random Forest
- AdaBoost
- XGBoost
- Word2Vec + Logistic Regression
- Doc2Vec + Logistic Regression
- Model Evaluation
- Ensemble Voting (Hard & Soft)
- Deployment via Streamlit Interface
# Clone the repository
git clone https://github.com/Kakarotprince/FileClassification.git
cd FileClassification
# Install dependencies
pip install -r Requirements.txt- On first run, the Google News vectors need to be loaded and vector cache saved.
- Subsequent runs will reuse the saved vectors to save time and memory.
- Vectorizers: TF-IDF, Word2Vec (Google News), Doc2Vec
- Classifiers: SVM, RandomForest, AdaBoost, XGBoost, Naive Bayes, Logistic Regression
- Imbalanced Data Handling: SMOTE, ADASYN, Random Oversampling
- Evaluation Metrics: Accuracy, Precision, Recall, F1-Score
- Ensembling: Hard Voting, Soft Voting (based on individual model accuracies)
The project includes a user-friendly Streamlit-based UI where users can upload text or files and receive classification results in real-time.
To launch the app:
streamlit run app.py- scikit-learn
- Gensim
- Streamlit
- Kaggle: GoogleNews Word2Vec
- Kaggle: News Article Categories
- Kaggle: Text Document Dataset