This project develops a predictive modeling system to identify high-risk financial transactions within a large-scale dataset containing over 6 million records.
The solution includes data preprocessing, feature engineering, exploratory analysis, model development, and performance evaluation using industry-standard metrics.
- 6,362,620 transaction records
- 10 primary variables
- Highly imbalanced classification problem
- Stratified sampling (5%) for efficient computation
- Memory optimization
- Missing value handling
- Removal of ID-based leakage variables
- Balance difference variables
- Log transformation of transaction amount
- Ratio-based behavioral indicators
- Merchant & transaction type flags
- Risk distribution visualization
- Transaction type vulnerability analysis
- Temporal transaction pattern analysis
- Random Forest (150 estimators)
- Balanced class weights for imbalanced data
- Unified preprocessing + modeling pipeline
- Time-based 80:20 split (train/validation)
| Metric | Score |
|---|---|
| Precision | 1.00 |
| Recall | 0.64 |
| ROC AUC | 0.95 |
| PR AUC | 0.85 |
- Balance inconsistencies (
orig_balance_diff) - Transaction amount
- Certain transaction types (TRANSFER, PAYMENT)
- Account balance shifts
- Large balance mismatches combined with high transfer amounts significantly increase transaction risk.
- Precision is prioritized to reduce unnecessary investigation costs.
- Recall can be tuned further depending on business risk tolerance.
- Python
- Pandas, NumPy
- Scikit-learn
- Matplotlib, Seaborn
- Joblib
Akshad Goyanka