NLP-Text-Classification

Problem Statement:

This is a Classification problem where we have to classify the text into one of the 12 categories.

EDA
- Cleaning Data. (Tokenizing, Stemming, Stopward removal, Stemming)
- Additional data cleaning based on data.
Building Model.
- Baseline model using Naive Bayes.
- Comparing with Random Forest.
Hyperparameter tuning.
- Gridcvsearch
- Visualising tuning results.
Predicting on test dataset.

There are words in the training corpus that contain 2 words joint.
This had to be handled in order to not miss out the important vectors.
The approach to this problem is by identifying the two words in the words and then splitting at the appropriate length.
This has been achieved by using 2 steps
- package called wordninja that does processing at a high speed across the length of the string.
- in order to tailor the solution to the problem, we use a custom built lookup table that is more specific to our data rather than being generic.
- This look up table has been built by including the words in the decreasing order of their probability. Hence based on the highest probable words, the splitting decision is taken

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
model.pickle		model.pickle
predict.py		predict.py
requirements.txt		requirements.txt
text.txt.gz		text.txt.gz
text_classification_problem.ipynb		text_classification_problem.ipynb