A deep learning system for image tags generation. Uses ResNET-18 backbone.
VectorTag automatically tags images with semantic labels (e.g., "building", "food", "person", "nature"). The system:
- Multi-label classification: Each image can have multiple tags simultaneously.
- Interpretable predictions: Uses Grad-CAM to visualize which image regions influenced each tag.
- Interactive UI: Streamlit-based web interface for inference and exploration.
- Production-ready: Docker containerization for easy deployment.
- ResNet-18 backbone pretrained on ImageNet
- BCEWithLogitsLoss with class weighting to handle imbalanced datasets
- Grad-CAM visualization for model interpretability
- Data augmentation (rotation, flips, color jitter)
- LR Scheduler with early stopping to prevent overfitting
- Streamlit UI for interactive inference
- Docker deployment ready
VectorTag/
โโโ src/
โ โโโ core/
โ โ โโโ config.py # Central configuration (paths, hyperparams)
โ โโโ data/
โ โ โโโ tagged_dataset.py # PyTorch Dataset class
โ โ โโโ loaders.py # DataLoader creation with transforms
โ โ โโโ taxonomy.py # Tag synonyms and hierarchies
โ โโโ models/
โ โ โโโ baseline.py # ResNet-18 model definition
โ โโโ scripts/
โ โ โโโ train.py # Training loop with auto-plotting
โ โโโ ui/
โ โ โโโ app.py # Main Streamlit app
โ โ โโโ modes/
โ โ โ โโโ base.py # Abstract inference mode
โ โ โ โโโ standard.py # Standard inference mode
โ โ โโโ components/ # Reusable UI components
โ โโโ utils/
โ โโโ gradcam.py # Grad-CAM heatmap generation
โ โโโ plotting.py # Training visualization
โโโ models/
โ โโโ standard/
โ โโโ weights/ # Saved model weights (.pth)
โ โโโ classes/ # Class definitions (.json)
โโโ assets/
โ โโโ exp_00X_*.png # Training curves from experiments
โ โโโ comparison_*.png # Grad-CAM comparisons
โโโ data/
โ โโโ raw/
โ โโโ various_tagged_images/ # Dataset images + metadata.csv
โโโ experiments.md # Detailed experiment logs
โโโ requirements.txt # Python dependencies
โโโ Dockerfile # Container definition
โโโ README.md # This file
# Clone repository (choose one method below)
# HTTPS
git clone https://github.com/ZenbiteXYZ/VectorTag.git
# Or SSH
git clone git@github.com:ZenbiteXYZ/VectorTag.git
# Navigate to directory
cd VectorTag
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtDownload the Kaggle Various Tagged Images dataset and extract to:
data/raw/various_tagged_images/
โโโ metadata.csv
โโโ image_1.jpg
โโโ image_2.jpg
โโโ ...
streamlit run src/ui/app.pyThen navigate to http://localhost:8501 in your browser.
Note: Pre-trained model weights are already included in
models/standard/weights/. For retraining with custom settings, see Training Your Own Model section below.
Configuration:
- Loss: BCEWithLogitsLoss with
pos_weight(balanced classes) - Data: 200K samples, Top-150 tags, stratified split
- Training: 12 epochs, LR Scheduler, Weight Decay=1e-5
Results:
| Epoch | Train Loss | Val Loss | Notes |
|---|---|---|---|
| 1 | 0.2569 | 0.2028 | High loss due to pos_weight |
| 2 | 0.2107 | 0.1925 | Quick descent |
| 7 | 0.1726 | 0.1835 | Best validation point |
| 12 | 0.1372 | 0.1932 | Training continues |
Key Insights:
- โ Sharp Grad-CAM: Model focuses on relevant image regions without noise.
- โ High Confidence: Predictions reach 70%+ for confident tags.
โ ๏ธ Overfitting starts: After epoch 7, validation loss increases (expected with imbalanced data).- โ Best generalization: Compared to other loss functions (Focal Loss performed worse).
Building Tag (Grad-CAM Comparison):
- Exp 002 (BCE): Clear boundary, but low confidence (37%).
- Exp 004 (Focal): Blurry boundary, more noise (43% confidence).
- Exp 005 (Weighted BCE): โญ Best: Sharp boundary, high confidence (70%), captures building outline without sky.
Food Tag (Grad-CAM Comparison):
Edit src/core/config.py to customize:
# Model
BATCH_SIZE = 16 # Reduce for high-res images
LEARNING_RATE = 1e-4 # Baseline learning rate
EPOCHS = 12 # Total epochs
WEIGHT_DECAY = 1e-5 # L2 regularization
# Data
TOP_K = 150 # Use top-150 most frequent tags
MAX_SAMPLES = 200_000 # Limit dataset size (for speed)# Build image
docker build -t vectortag-ui .
# Run container
docker run --rm -p 8501:8501 \
-v $(pwd)/models:/app/models \
vectortag-uiAccess UI at http://localhost:8501
To train or retrain the model with custom settings:
python src/scripts/train.pyWhat happens:
- Loads data with augmentation (crop, flip, rotation, color jitter)
- Computes class weights for imbalanced tags
- Trains ResNet-18 for N epochs
- Saves best model to
models/standard/weights/ - Auto-generates learning curve plot
All settings are configurable in src/core/config.py:
TOP_K: Number of tags (default: 150)BATCH_SIZE: Batch size (default: 32)EPOCHS: Training epochs (default: 12)LEARNING_RATE: Base LR (default: 1e-4)WEIGHT_DECAY: Regularization (default: 1e-5)
- TaggedImagesDataset: Multi-label dataset with tag synonyms and hierarchies.
- Stratified split: Ensures rare classes are equally distributed in train/val.
- Smart subsampling: Weights samples by class rarity for balanced mini-batches.
ResNet-18 (ImageNet pretrained)
โ
Feature Extractor โ 512D
โ
Linear(512 โ 256)
โ
ReLU()
โ
Dropout(0.4)
โ
Linear(256 โ num_classes)
โ
BCEWithLogitsLoss (per-class)- Computes class activation maps using gradients.
- Class weight computation:
pos_weight = (N_neg / N_pos)clamped to [1.0, 20.0] - LR Scheduler: ReduceLROnPlateau reduces learning rate on validation plateau
- Early stopping: Saves only best model based on validation loss
- Auto-plotting: Generates learning curve after training
- Dynamic tag addition: Add new tags without full model retraining
- Vision Transformer (ViT): Replace ResNet-18 with ViT for better accuracy
Detailed experiments are documented in experiments.md:
- Exp 001: Baseline (Overfitting issue discovered)
- Exp 002: Synonyms + Dropout + Augmentation (Overfitting solved)
- Exp 003: LR Scheduler + Weight Decay (Better convergence)
- Exp 004: FocalLoss (Poor Grad-CAM quality)
- Exp 005: Weighted BCE โญ (Current best)
- torch, torchvision: Deep learning
- pillow, pandas: Image & data processing
- scikit-learn: Stratified split
- streamlit: Web UI
- pydantic: Configuration management
Contributions welcome! Areas of interest:
- Better loss functions for imbalanced multi-label classification
- Improved data augmentation strategies
- Alternative backbone architectures
- Performance optimizations
See LICENSE file.


