This project focuses on automating the categorization of resumes using Natural Language Processing (NLP) and machine learning techniques. The goal is to analyze resume data and classify resumes into their respective job categories efficiently.
resume_categorization_notebook.ipynb: A Jupyter Notebook that contains the implementation for data preprocessing, feature extraction, model training, and evaluation.script.py: A Python script to automate the process of categorizing resumes based on the trained model and vectorizer.test_data/: Directory containing the test resumes in PDF format.requirements.txt: List of all Python packages required for this project.
This project employs various machine learning models and a deep learning approach to identify job categories from resumes. Using datasets, resumes are processed and classified using models such as:
- Random Forest Classifier
- Logistic Regression
- K-Nearest Neighbor
- Support Vector Machine (SVM)
- Deep Learning: Artificial Neural Networks (ANN) and Long Short-Term Memory (LSTM)
The model training is performed on preprocessed data, and the results are used to create a functional script that categorizes resumes automatically.
Ensure the following are installed before proceeding:
- Python 3.10.12 or higher
- Jupyter Notebook
- Necessary Python libraries (specified in
requirements.txt)
-
Clone the repository to your local machine:
git clone https://github.com/raselmeya94/Resume_Categorization.git
-
Navigate to the project directory:
cd Resume_Categorization -
Install the required dependencies:
pip install -r requirements.txt
-
Data Preprocessing:
- Split the dataset into
resume_data(training set) andresume_test_data(testing set). - Clean the text data by removing unnecessary symbols, spaces, and irrelevant content.
- Split the dataset into
-
Feature Extraction:
- Use NLP techniques such as TF-IDF vectorization to extract meaningful features from the resumes.
-
Model Training:
- Train multiple models to identify the best-performing one for classifying resumes into categories.
- Save the trained classifier (
best_clf.pkl) and the vectorizer (tfidf.pkl) as pickle files.
-
Automated Categorization:
- Use
script.pyto load the test data (PDF resumes intest_data/), vectorize the content, and predict categories. - Organize resumes into corresponding folders based on their predicted category.
- Use
-
Train the models and generate pickle files:
Open the Jupyter Notebook:jupyter notebook resume_categorization_notebook.ipynb
Follow the instructions in the notebook to train the models and generate the necessary
best_clf.pklandtfidf.pklfiles. -
Place the test resumes in the
test_data/folder. -
Run the categorization script:
python script.py
Provide the path to the
test_data/folder when prompted.
- Trained Model:
best_clf.pkl - Vectorizer:
tfidf.pkl - Categorized resumes in the following folder structure:
categorized_resumes/ ├── ENGINEERING ├── FINANCE ├── HEALTHCARE ├── TEACHER ├── ...
We welcome contributions to enhance the functionality of this project. To contribute:
- Fork the repository.
- Create a feature branch.
- Submit a pull request describing your changes.
This project is licensed under the MIT License. See the LICENSE file for more information.
For more details and hands-on usage, refer to the main notebook.