This is a Classification problem where we have to classify the text into one of the 12 categories.
- EDA
- Cleaning Data. (Tokenizing, Stemming, Stopward removal, Stemming)
- Additional data cleaning based on data.
- Building Model.
- Baseline model using Naive Bayes.
- Comparing with Random Forest.
- Hyperparameter tuning.
- Gridcvsearch
- Visualising tuning results.
- Predicting on test dataset.
- There are words in the training corpus that contain 2 words joint.
- This had to be handled in order to not miss out the important vectors.
- The approach to this problem is by identifying the two words in the words and then splitting at the appropriate length.
- This has been achieved by using 2 steps
- package called
wordninjathat does processing at a high speed across the length of the string. - in order to tailor the solution to the problem, we use
a custom built lookup tablethat is more specific to our data rather than being generic. - This look up table has been built by including the words in the
decreasing order of their probability. Hence based on the highest probable words, the splitting decision is taken
- package called
predict.pyfile has been built to predict the class given a sentence.- Usage:
from predict import predictpredict('Input Sentence')
- Text cleaning based on domain expertise by expanding shortforms.
- improving approach to split joint words and incorrect spaces