- Project Overview
- Architecture & Workflow
- Project Structure
- Prerequisites & Setup
- Installation & Configuration
- Project Workflow
- API Endpoints
- Docker Setup
- CI/CD Pipeline
- Troubleshooting
Network Security is an end-to-end machine learning application designed to detect phishing and network security threats using classification models. The project implements a complete MLOps pipeline with data ingestion, validation, transformation, model training, and deployment capabilities.
- Data Pipeline: MongoDB integration for data ingestion
- Data Validation: Schema validation and drift detection
- Data Transformation: KNN imputation for missing values
- Model Training: Multiple classification algorithms with hyperparameter tuning
- MLflow Integration: Experiment tracking and model monitoring
- REST API: FastAPI for model serving and predictions
- Cloud Integration: AWS S3 for artifact storage
- Docker Support: Containerized deployment
- CI/CD: GitHub Actions for automated workflows
┌─────────────────────────────────────────────────────────────┐
│ DATA SOURCE (MongoDB) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. DATA INGESTION (CSV Export & Split) │
│ - Connects to MongoDB │
│ - Exports collection as DataFrame │
│ - Splits into train/test sets (80/20) │
│ - Saves to: Artifacts/<timestamp>/data_ingestion/ │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. DATA VALIDATION (Schema & Drift Detection) │
│ - Validates number of columns against schema.yaml │
│ - Detects data drift using KS-2 sample test │
│ - Generates drift report │
│ - Saves valid data to: data_validation/validated/ │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. DATA TRANSFORMATION (Feature Engineering & Imputation)│
│ - Handles missing values with KNNImputer (k=3) │
│ - Creates preprocessing pipeline │
│ - Transforms features using fitted preprocessor │
│ - Saves transformed data as .npy files │
│ - Saves preprocessor object for inference │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. MODEL TRAINING (Hyperparameter Tuning & Evaluation) │
│ - Trains multiple algorithms: │
│ • Random Forest │
│ • Decision Tree │
│ • Gradient Boosting │
│ • Logistic Regression │
│ • AdaBoost │
│ - GridSearchCV for hyperparameter optimization │
│ - Evaluates on train/test sets │
│ - Calculates metrics: F1, Precision, Recall │
│ - Tracks experiments with MLflow │
│ - Selects best model based on test R² score │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. MODEL DEPLOYMENT (S3 & Final Model Creation) │
│ - Saves artifacts to S3 bucket │
│ - Creates NetworkModel wrapper (preprocessor + model) │
│ - Saves final_model/ directory with: │
│ • model.pkl (trained classifier) │
│ • preprocessor.pkl (transformation pipeline) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. INFERENCE (REST API Endpoint) │
│ - Load preprocessor and model from final_model/ │
│ - Create NetworkModel instance │
│ - Transform input features with preprocessor │
│ - Generate predictions │
│ - Return predictions to user │
└─────────────────────────────────────────────────────────────┘
Network_Security/
├── Network_security/
│ ├── components/
│ │ ├── data_ingestion.py # MongoDB → CSV conversion
│ │ ├── data_validation.py # Schema & drift validation
│ │ ├── data_transformation.py # Feature preprocessing
│ │ └── model_trainer.py # Model training & selection
│ ├── entity/
│ │ ├── config_entity.py # Configuration classes
│ │ └── artifacts_entity.py # Artifact data classes
│ ├── exception/
│ │ └── exception.py # Custom exception handling
│ ├── logging/
│ │ └── logger.py # Logging configuration
│ ├── utils/
│ │ ├── main_utils/
│ │ │ └── utils.py # Helper functions
│ │ └── ml_utils/
│ │ ├── metric/
│ │ │ └── classification_metric.py # Metrics calculation
│ │ └── model/
│ │ └── estimator.py # NetworkModel wrapper
│ ├── constants/
│ │ └── training_pipeline/
│ │ └── __init__.py # Pipeline constants
│ ├── cloud/
│ │ └── s3_syncer.py # AWS S3 integration
│ └── pipeline/
│ └── training_pipeline.py # Main orchestration
├── data_schema/
│ └── schema.yaml # Data validation schema
├── templates/
│ └── table.html # HTML template for predictions
├── logs/ # Training logs
├── Artifacts/ # Generated artifacts
├── final_model/ # Deployed model artifacts
├── prediction_output/ # Prediction results
│
├── app.py # FastAPI application
├── main.py # Direct execution entry point
├── push_data.py # MongoDB data loader
│
├── requirements.txt # Python dependencies
├── setup.py # Package configuration
├── DOCKERFILE # Container configuration
├── .github/
│ └── workflows/
│ └── main.yml # GitHub Actions CI/CD
├── .env # Environment variables
├── .env.example # Environment template
├── .gitignore # Git ignore rules
├── LICENSE # GPL v3 License
└── README.md # This file
git clone https://github.com/ashmijha/Network_Security.git
cd Network_Security# Create virtual environment
python -m venv env
# Activate virtual environment
# On Linux/macOS:
source env/bin/activate
# On Windows:
env\Scripts\activatepip install -r requirements.txt# Create .env file from template
cp .env.example .env
# Edit .env with your credentials
nano .envRequired .env variables:
MONGODB_URI=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/<database>?retryWrites=true&w=majority
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=us-east-1# Run the data push script
python push_data.py
# This will:
# 1. Read Network_Data/phisingData.csv
# 2. Convert CSV to JSON records
# 3. Insert into MongoDB ASHMI_DB.NetworkData collection# Run the main training pipeline
python main.py
# Or use the modular approach in main.py:
# - Data Ingestion
# - Data Validation
# - Data Transformation
# - Model Training# Start the API server
python app.py
# The server will run at: http://localhost:8000
# Access interactive docs:
# - Swagger UI: http://localhost:8000/docs
# - ReDoc: http://localhost:8000/redoc# Build the Docker image
docker build -t network-security:latest .
# Run the container
docker run -p 8000:8000 \
-e MONGODB_URI="your_mongodb_uri" \
-e AWS_ACCESS_KEY_ID="your_aws_key" \
-e AWS_SECRET_ACCESS_KEY="your_aws_secret" \
network-security:latest# Loads data from MongoDB
# File: Network_security/components/data_ingestion.py
DataIngestion:
├── export_collection_as_dataframe()
│ └── Connects to MongoDB and fetches NetworkData
├── export_data_into_feature_store()
│ └── Saves full dataset as CSV
└── split_data_as_train_test()
├── 80% training data
└── 20% testing dataOutput Artifacts:
Artifacts/<timestamp>/data_ingestion/feature_store/phisingData.csvArtifacts/<timestamp>/data_ingestion/ingested/train.csvArtifacts/<timestamp>/data_ingestion/ingested/test.csv
# Validates data quality and detects drift
# File: Network_security/components/data_validation.py
DataValidation:
├── validate_number_of_columns()
│ └── Checks against data_schema/schema.yaml
└── detect_dataset_drift()
└── Uses Kolmogorov-Smirnov test (p-value threshold: 0.05)Output Artifacts:
Artifacts/<timestamp>/data_validation/validated/train.csvArtifacts/<timestamp>/data_validation/validated/test.csvArtifacts/<timestamp>/data_validation/drift_report/report.yaml
# Transforms features using KNN imputation
# File: Network_security/components/data_transformation.py
DataTransformation:
├── get_data_transformer_object()
│ └── Creates Pipeline with KNNImputer(n_neighbors=3)
└── initiate_data_transformation()
├── Fits preprocessor on training data
├── Transforms train features
├── Transforms test features
└── Appends target columnOutput Artifacts:
Artifacts/<timestamp>/data_transformation/transformed/train.npyArtifacts/<timestamp>/data_transformation/transformed/test.npyArtifacts/<timestamp>/data_transformation/transformed_object/preprocessing.pklfinal_model/preprocessor.pkl(for inference)
# Trains and selects best model
# File: Network_security/components/model_trainer.py
ModelTrainer:
├── train_model()
│ ├── Initialize 5 classifiers
│ ├── GridSearchCV for hyperparameters
│ ├── Train and evaluate each model
│ └── Select best by test score
├── track_mlflow()
│ └── Log metrics: F1, Precision, Recall
└── initiate_model_trainer()
└── Create ModelTrainerArtifactModels Trained:
| Model | Hyperparameters |
|---|---|
| Random Forest | n_estimators: [8, 16, 32, 128, 256] |
| Decision Tree | criterion: ['gini', 'entropy', 'log_loss'] |
| Gradient Boosting | learning_rate: [0.1, 0.01, 0.05, 0.001] |
| Logistic Regression | default |
| AdaBoost | n_estimators, learning_rate |
Output Artifacts:
Artifacts/<timestamp>/model_trainer/trained_model/model.pklfinal_model/model.pkl(for inference)
GET /
Redirects to Swagger documentation at /docs
GET /train
Description: Triggers the complete training pipeline
Response:
{
"message": "Training is successful"
}Example:
curl -X GET "http://localhost:8000/train"POST /predict
Description: Uploads CSV file and returns predictions
Parameters:
file(multipart/form-data): CSV file with features
Response: HTML table with predictions
Example:
curl -X POST "http://localhost:8000/predict" \
-F "file=@input_data.csv"Input CSV Format:
feature1,feature2,feature3,...,featureN
0.5,0.3,0.8,...,0.2
0.6,0.4,0.7,...,0.3FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y awscli git
# Copy requirements
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy project
COPY . .
# Install package
RUN pip install -e .
# Expose port
EXPOSE 8000
# Run application
CMD ["python", "app.py"]# Build image
docker build -t network-security:latest .
# Run container with environment variables
docker run -d \
--name network-security \
-p 8000:8000 \
-e MONGODB_URI="$MONGODB_URI" \
-e AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" \
-e AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" \
network-security:latest
# View logs
docker logs -f network-security
# Stop container
docker stop network-security# File: .github/workflows/main.ymlWorkflow Steps:
- Trigger: On push to
mainbranch - Checkout: Clone repository
- Setup Python: Install Python 3.9
- Install Dependencies:
pip install -r requirements.txt - Linting (optional): Code quality checks
- Run Tests (optional): Unit tests
- Build Docker Image: Create container
- Push to Registry (optional): Docker Hub or ECR
- Deploy (optional): Deploy to cloud
# The workflow runs automatically on:
git push origin main
# Or manually trigger from GitHub UI:
# Actions → Select workflow → Run workflowIn GitHub repository settings, add these secrets:
MONGODB_URI → Your MongoDB connection string
AWS_ACCESS_KEY_ID → Your AWS access key
AWS_SECRET_ACCESS_KEY → Your AWS secret key
DOCKER_USERNAME → Docker Hub username (optional)
DOCKER_PASSWORD → Docker Hub token (optional)
logs/
├── MM_DD_YYYY_HH_MM_SS.log
└── (New log created for each run)
# Follow logs in real-time
tail -f logs/01_01_2025_10_30_45.log
# Search for errors
grep "ERROR" logs/*.log
# View specific component logs
grep "DataTransformation" logs/*.log# Start MLflow UI
mlflow ui --host 0.0.0.0 --port 5000
# Access at: http://localhost:5000- F1 Score: 0.80-0.95 (depending on dataset)
- Precision: 0.80-0.90
- Recall: 0.75-0.90
- Training Time: 2-5 minutes (depends on hardware)
- Total Records: ~11,000 (phisingData.csv)
- Features: 30
- Target Classes: Binary (0, 1)
- Missing Values: Handled by KNN Imputation
- FastAPI Documentation: https://fastapi.tiangolo.com/
- MLflow Documentation: https://mlflow.org/docs/
- MongoDB Documentation: https://docs.mongodb.com/
- AWS S3 Documentation: https://docs.aws.amazon.com/s3/
- scikit-learn Documentation: https://scikit-learn.org/
This project is licensed under the GNU General Public License v3.0 - see LICENSE file for details.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request