🧠 Document Classification using Machine Learning

This project is an intelligent document classification system that uses a range of machine learning models, traditional text vectorization techniques (TF-IDF), and modern embeddings (Word2Vec & Doc2Vec) to accurately classify text documents into predefined categories. Ensemble techniques like hard voting and soft voting are used to improve performance by combining multiple models.

📁 Datasets Used

We combined two existing datasets to build a richer and more diverse text classification corpus:

These datasets were mapped into unified categories such as:

News & Current Affairs
Business & Finance
Science & Technology
Arts & Entertainment
Education & Academia
Sports

🧩 Word Embedding

We used the pre-trained Google News Word2Vec model (GoogleNews-vectors-negative300.bin) for document vectorization:

📥 Download it here: Google News Word2Vec Embeddings

🚀 Project Pipeline

Data Loading & Preprocessing
Category Mapping & Merging Datasets
Text Cleaning, Tokenization, and Lemmatization
Vectorization:
- TF-IDF
- Word2Vec
- Doc2Vec
Oversampling for Class Imbalance (SMOTE, ADASYN, Random Oversampling)
Model Training:
- Naive Bayes
- SVM
- Random Forest
- AdaBoost
- XGBoost
- Word2Vec + Logistic Regression
- Doc2Vec + Logistic Regression
Model Evaluation
Ensemble Voting (Hard & Soft)
Deployment via Streamlit Interface

📦 First-Time Setup Instructions

# Clone the repository
git clone https://github.com/Kakarotprince/FileClassification.git
cd FileClassification

# Install dependencies
pip install -r Requirements.txt

⚠️ Important Note:

On first run, the Google News vectors need to be loaded and vector cache saved.
Subsequent runs will reuse the saved vectors to save time and memory.

📊 Models and Techniques

Vectorizers: TF-IDF, Word2Vec (Google News), Doc2Vec
Classifiers: SVM, RandomForest, AdaBoost, XGBoost, Naive Bayes, Logistic Regression
Imbalanced Data Handling: SMOTE, ADASYN, Random Oversampling
Evaluation Metrics: Accuracy, Precision, Recall, F1-Score
Ensembling: Hard Voting, Soft Voting (based on individual model accuracies)

🖼️ Streamlit Web Interface

The project includes a user-friendly Streamlit-based UI where users can upload text or files and receive classification results in real-time.

To launch the app:

streamlit run app.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Model Trainer.ipynb		Model Trainer.ipynb
README.md		README.md
Requirements.txt		Requirements.txt
app.py		app.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Document Classification using Machine Learning

📁 Datasets Used

🧩 Word Embedding

🚀 Project Pipeline

📦 First-Time Setup Instructions

⚠️ Important Note:

📊 Models and Techniques

🖼️ Streamlit Web Interface

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Kakarotprince/FileClassification

Folders and files

Latest commit

History

Repository files navigation

🧠 Document Classification using Machine Learning

📁 Datasets Used

🧩 Word Embedding

🚀 Project Pipeline

📦 First-Time Setup Instructions

⚠️ Important Note:

📊 Models and Techniques

🖼️ Streamlit Web Interface

📚 References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages