This repo:
- Collects data from codeforces API to train a rating predictor model
- Prepares the previous data to be consumed by a decision tree model, generating (features, label) pairs
- Trains a catboost model using previously generated data, achieving 65.15 of mean absolute error when predicting your rating 6 months in the future
Use the notebook retrieve_data_from_codeforces_api.ipynb to collect data from active codeforces users. Defaults to 1.5k best users + 50 k random users (rating distribution has a long tail, so we add all very high rated users to improve performance for them). The notebook collects submission, delta rating and blogs history. This is the slowest step, taking an entire day to complete. Ask the author for the compressed data if needed.
After collecting the data in the previous step run python generateLabeledData.py to prepare the labels.
For each user on each 1st day of each month from Jan 2021 to Nov 2024 we extract features from this user (using the previously collected data) as well as their rating 6 months in the future (30 * 24 * 60 * 60 seconds in the future to be precise). If the user had less than 5 rated contests at this point we DO NOT generate features and just ignore this sample. Otherwise we generate (features, label) pairs and save them in a csv file to be used later. Notice each user may generate 0 (if the user haven't participated in 5 rated contests) up to 47 (features, label) pairs.
The script training_notebook.ipynb performs the following steps:
- User Separation: Reads users from
data/users.jsonand separates them into training (80%) and validation (20%) classes using random selection - Data Loading: Reads labeled data from
data/labeled_data.csv - Temporal Separation:
- Training set: 2021-01-01 ≤ Data ≤ 2023-11-01
- Validation set: 2023-01-01 ≤ Data ≤ 2024-11-01
- Model Training: Trains a CatBoost regression model with Mean Absolute Error (MAE) loss
- Evaluation: Provides MAE evaluation metric
- Model Persistence: Saves the trained model for later use