An MLOps pipeline developed with ZenML to continuously train and update a predictive model on COVID patient data. The goal is to automatically detect and adapt to concept drift in the stratification data provided by the SMS, ensuring that deployed models remain up-to-date with excellent performance while properly notifying the team in case of any issues.
Concept drift refers to changes in the relationship between input variables (X) and the target variable (y) over time, causing a previously trained model to lose accuracy.
Our pipeline detects drift implicitly by evaluating performance monthly on new data (monthly chunks) and comparing it to the performance of previous models. If performance drops (e.g., balanced accuracy decreases), the pipeline avoids updating the model, ensuring retraining only occurs when the new model actually improves or maintains performance.
-
Performance monitoring:
Evaluates performance on the most recent chunks compared to previous models using metrics like balanced accuracy. -
Adaptive retraining:
If the data distribution changes, the pipeline trains updated ensembles with the latest data, adjusting the model to new patterns. -
Automation with CRON:
Automated execution withcrontabensures a continuous learning and updating system.
- Loads the data and checks the file’s last modification date.
- If no changes are detected since the last run (using a control file), the pipeline exits to optimize resources.
- If changes are detected, the updated dataframe is returned.
- Splits data into monthly chunks, treating them as independent mini-datasets.
- Creates multiple balanced bootstraps using the IPIP (Iterative Proportional Importance Pruning) strategy.
- Each bootstrap trains a Random Forest ensemble, retaining only models that improve the ensemble’s performance on a validation set.
- Generates predictions on the following chunk, simulating future data predictions.
- Saves results and metric comparisons with previous models.
- Calculates global and per-chunk the metric balanced accuracy.
- Generates time series plots of balanced accuracy evolution
- Saves metrics and charts as monitoring evidence.
- Compares the performance of the new model with the previous one.
- Only saves the new model if:
- Its performance exceeds the expected minimum threshold.
- The model improves or maintains performance compared to the previous model.
- Saves prediction examples for auditing and traceability.
The approach relies on segmenting data into monthly chunks to quickly detect and respond to behavioral changes. For each new chunk, the performance of the current model is compared with previous models, effectively functioning as a supervised drift test.
The pipeline avoids overwriting the model if the new one does not show significant improvement, preventing degradation due to temporary drift or noise. The IPIP methodology further improves handling of imbalanced datasets.
The IPIP technique combines bagging, ensembles, and iterative balancing to improve predictions on problems with class imbalance and time series. It allows:
- Balancing classes in each bootstrap.
- Selecting models that truly improve the ensemble.
- Dynamically adapting the model to changes in data distribution (concept drift).
Integrates scheduled tasks with crontab to periodically run the pipeline, ensuring a continuous learning system with no manual intervention required.
- ZenML
- Python 3.10
- Scikit-learn
- Pandas
- Matplotlib / Seaborn
- CRON for automation