This project aims to build a robust machine learning classifier to distinguish between safe and unsafe/malicious AI prompts. Unsafe prompts include adversarial "prompt injections" and jailbreak attacks designed to bypass AI safety measures. By leveraging diverse datasets, the goal is to enhance the security and reliability of conversational AI systems.
- Safe Prompts: Collected from AI-generated prompt datasets, prompt engineering examples, and community-curated ChatGPT prompts. Labeled as
0. - Unsafe Prompts: Includes forbidden question sets, jailbreak prompts, and malicious prompt collections from real-world adversarial sources. Labeled as
1. - Combined dataset contains over 80,000 unique prompts after cleaning with a class imbalance favoring safe prompts.
- Loaded and unified multiple datasets with consistent labeling and column naming.
- Removed duplicate and null entries.
- Analyzed prompt length distribution, finding a majority of short prompts.
- Converted text prompts into numerical vectors using:
- TF-IDF Vectorizer (max 5000 features, removing English stop words)
- Bag of Words Vectorizer (max 5000 features)
- Used Logistic Regression and Random Forest classifiers.
- Trained with an 80/20 train-test split.
- Achieved excellent performance:
- Accuracy ~99.5–99.8%
- Precision and recall near 1.0 for both classes
- AUC ROC of 1.00 for all models
- Confusion matrices confirm very low misclassification rates.
- ROC curves demonstrate near-perfect ability to separate safe from unsafe prompts.
- Example predictions show consistent classification of benign and malicious inputs.
- Preprocess your prompts using the vectorizers.
- Load a trained model.
- Use the
predict_class(model, vectorizer, text)function to classify new prompts.
- Integrate transformer-based or deep learning classifiers.
- Enrich the adversarial dataset with more complex prompt injections.
- Implement ensemble methods and threshold optimization for enhanced detection.
Ananya P S
20221CSD0106
For detailed code, dataset processing, and evaluation, please refer to the accompanying project notebook.