YOLOv8 Custom Object Detection for Desk Objects

Project Overview

This project implements a custom YOLOv8 object detection model trained to identify 9 specific desk objects. The motivation for creating a custom model arose from poor performance of pre-trained models on domain-specific objects.

Problem Statement & Motivation

Initial Challenge

When testing with pre-trained YOLOv8s and YOLOv8l models on desk objects, the results were highly inaccurate:

Monitor was misclassified as a refrigerator
Mechanical pencil was detected as a toothbrush
Other objects showed similar misclassification issues

Solution Approach

Rather than relying on general-purpose pre-trained models, I decided to create a custom dataset and train a domain-specific model for better accuracy on the target objects.

Dataset Creation

Video Recording Strategy

5 training videos of 30 seconds each
Multiple viewpoints: birds-eye, front, horizontal, left, and right views
Frame extraction: Every 2 seconds (frame_interval=15 at 30fps)
Total dataset: ~351 labeled images

Target Objects (9 Classes)

Water bottle
Mouse
AirPod case
Mechanical pencil
Keys
Guitar pick
Keyboard
Monitor
Laptop

Data Split

Training: 80% (~281 images)
Validation: 20% (~70 images)

Usage Instructions

Quick Start

The main pipeline (src/main_pipeline.py) provides an interactive interface that handles the complete workflow:

cd YOLOv8Model
python src/main_pipeline.py

Pipeline Workflow

When you run the main script, it will:

Configure Paths: Automatically sets up the directory structure
- Images: dataset/images/train_upright/
- Labels: dataset/labels/all/
- Test Video: assets/testVideo/deskVideo.mp4
Dataset Setup: Converts JSON annotations to YOLO format and creates train/val splits
- 80% training, 20% validation split
- Generates dataset.yaml configuration file

Training Options: Interactive menu with two choices:

a) Train custom model (takes time but better for your specific objects)
b) Skip training and use pre-trained model (faster)

Object Detection: Runs detection on your test video and outputs:
- Annotated video file: results/detection/video_results.mp4
- CSV file with detection data: detection_results.csv

Option A: Custom Training (Recommended)

Trains a YOLOv8n model specifically on your desk objects
Takes ~4 hours on CPU (faster with GPU)
Produces better results for domain-specific objects
Saves best model as runs/train/desk_objects/weights/best.pt

Option B: Pre-trained Model (Faster)

Uses general YOLOv8 model without custom training
Faster execution (~5-10 minutes)
May have lower accuracy on specific desk objects
Good for quick testing and demonstration

Prerequisites

Install Dependencies:

pip install ultralytics opencv-python pandas pyyaml

Directory Structure: Ensure your project follows this structure:

YOLOv8Model/
├── src/main_pipeline.py
├── dataset/
│   ├── images/train_upright/    # Your training images
│   └── labels/all/              # JSON annotation files
├── assets/testVideo/
│   └── deskVideo.mp4           # Your test video
└── results/                    # Output directory (created automatically)

Test Video: Place your 1-minute desk/room video as assets/testVideo/deskVideo.mp4

Output Files

After running the pipeline, you'll find:

Annotated Video: results/detection/video_results.mp4
Detection CSV: detection_results.csv with columns:
- frame_number: Frame index in video
- object_class: Detected object name
- confidence_score: Detection confidence (0-1)
- bounding_box: Coordinates (x, y, width, height)

Alternative Scripts (For Advanced Users)

If you need to run individual components:

# Extract frames from training videos
python src/extract_frames.py

# Rotate images if orientation is incorrect
python src/rotate_images.py

# Run detection on trained model
python src/run_best_pt.py

Pre-trained Model Foundation

yolov8n.pt is the pre-trained YOLOv8 nano model weights downloaded from Ultralytics. This serves as the foundation for transfer learning - instead of training from scratch, the model starts with these general object detection capabilities and fine-tunes them for your specific desk objects. This approach significantly reduces training time and improves performance on small datasets.

Dependencies

ultralytics
opencv-python
pandas
pyyaml
pathlib

Technical Review

File Structure

YOLOv8Model/
├── README.md
├── dataset/
│   ├── images/
│   │   ├── train/
│   │   ├── train_upright/     # Rotated training images
│   │   └── val/
│   └── labels/
│       ├── train/
│       ├── val/
│       └── all/               # Original JSON annotations
├── dataset.yaml               # YOLO dataset configuration
├── assets/
│   ├── classes.txt            # Class definitions
│   ├── trainingVideos/        # 5 training videos
│   │   ├── birdsView.mp4
│   │   ├── frontView.mp4
│   │   ├── horizontalView.mp4
│   │   ├── leftView.mp4
│   │   └── rightView.mp4
│   └── testVideo/
│       └── deskVideo.mp4      # Test video for evaluation
├── src/                       # Python scripts
│   ├── extract_frames.py      # Frame extraction from videos
│   ├── main.py                # Complete training pipeline
│   ├── rotateImages.py        # Image rotation utility
│   └── run_best_pt.py         # Run inference with best model
├── runs/                      # Training results & metrics
│   └── train/
│       └── desk_objects/      # Training run results
│           ├── weights/       # Model checkpoints
│           │   ├── best.pt    # Best performing model
│           │   ├── last.pt    # Final epoch model
│           │   └── epoch*.pt  # Periodic checkpoints
│           ├── confusion_matrix.png
│           ├── confusion_matrix_normalized.png
│           ├── F1_curve.png
│           ├── PR_curve.png
│           ├── P_curve.png
│           ├── R_curve.png
│           ├── results.csv    # Training metrics per epoch
│           ├── results.png    # Training curves
│           └── val_batch*     # Validation visualizations
├── results/                   # Detection outputs
│   ├── detection/
│   │   └── video_results.mp4  # Annotated output video
│   ├── detection_results.csv  # Detection data
│   └── debug_*.jpg            # Debug frames
└── yolov8n.pt                # Pre-trained YOLOv8 nano weights

Key Scripts

1. Frame Extraction (`extract_frames.py`)

def extract_frames(video_path, output_dir, prefix, frame_interval=15):
    """Extract every Nth frame from video for dataset creation"""

Processes all 5 training videos
Extracts frames every 15 frames (2-second intervals)
Generates systematic naming convention

2. Image Rotation (`rotate_images.py`)

# Rotate images 90° clockwise to correct orientation
img = cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE)

Challenge Solved: Videos were recorded in portrait mode but appeared rotated during training. Rather than modifying metadata, implemented runtime rotation.

3. Complete Pipeline (`main_pipeline.py`)

JSON to YOLO format conversion
Dataset organization and splitting
Model training with optimized parameters
Video detection with rotation handling

Training Configuration & Results

Model Settings

Base Model: YOLOv8n (nano) for speed and efficiency
Epochs: 84 (with early stopping)
Batch Size: 16
Image Size: 640x640
Patience: 20 epochs for early stopping
Device: CPU (adaptable to GPU)
Training Time: 14,751.6 seconds (~4.1 hours)

Training Results

Final Performance (Epoch 83/84)

Metric	Value
Box Loss (Training)	0.8000
Classification Loss (Training)	0.3908
DFL Loss (Training)	0.9377
DFL Loss (Validation)	0.9377
Learning Rate	0.000137
Fitness Score	0.7944

The model achieved good convergence with consistent training and validation loss values, indicating proper learning without overfitting. The fitness score of 0.7944 demonstrates solid overall performance across all metrics.

Detection Performance Analysis

Initial Testing (Confidence = 0.3)

✅ Successfully detected: 8/9 objects
❌ Issue: Mechanical pencil not detected
Hypothesis: Training data showed pencil vertically, test video showed it horizontally

Optimized Testing (Confidence = 0.1)

✅ Success: All 9 objects detected including mechanical pencil
✅ Mechanical pencil confidence: ~40%+
⚠️ Trade-off: Slight confidence decrease for keys and guitar pick
Decision: Acceptable trade-off for complete object detection

Model Artifacts & Visualizations

The training process generates comprehensive evaluation materials:

Training Metrics & Curves

runs/train/desk_objects/results.csv: Complete training metrics per epoch
runs/train/desk_objects/results.png: Training/validation curves visualization
Performance Curves:
- F1_curve.png: F1-score across confidence thresholds
- PR_curve.png: Precision-Recall curve
- P_curve.png: Precision curve
- R_curve.png: Recall curve

Model Analysis

confusion_matrix.png: Raw confusion matrix showing classification accuracy
confusion_matrix_normalized.png: Normalized confusion matrix for balanced view
Validation Batches: val_batch*_pred.jpg and val_batch*_labels.jpg for visual validation

Technical Challenges & Solutions

best.pt: Best performing model (use for inference)
last.pt: Final epoch model
epoch*.pt: Checkpoint models saved every 10 epochs

Note: All visualization files and model weights are suitable for GitHub sharing and provide comprehensive insights into model performance.

1. Image Orientation Issues

Problem: Videos and images rotated during training/testing Solution:

def rotate_video(input_path, output_path):
    rotated_frame = cv2.rotate(frame, cv2.ROTATE_90_CLOCKWISE)

Implemented runtime rotation instead of metadata modification
Applied to both training images and test videos

2. Annotation Format Conversion

Problem: LabelMe JSON format → YOLO format conversion Solution: Custom conversion function handling both rectangles and polygons

def convert_json_to_yolo(json_path, img_width, img_height, class_mapping):
    # Converts LabelMe annotations to YOLO format
    # Handles normalization and coordinate transformation

3. Confidence Threshold Optimization

Problem: Standard confidence threshold missed objects Solution: Systematic threshold testing (0.3 → 0.1)

Monitored precision/recall trade-offs
Validated against overfitting indicators

Model Performance Monitoring

Overfitting Prevention

Early Stopping: 20 consecutive epochs without improvement
Validation Monitoring: Continuous tracking of validation metrics (DFL loss: 0.9377)
Checkpoint Saving: Every 10 epochs for model recovery

Training Stability Analysis

Box Loss Convergence: Training box loss stabilized at 0.8000, indicating good localization learning
Classification Performance: Low classification loss (0.3908) shows strong object recognition
Loss Consistency: Training and validation DFL losses matched (0.9377), confirming no overfitting
Learning Rate Schedule: Optimized at 0.000137 for stable convergence

Future Improvements

Dataset Enhancement

Add more diverse angles for mechanical pencil (horizontal orientations)
Increase dataset size with additional lighting conditions
Include occlusion scenarios for robust detection
Add background variations to improve generalization

Model Optimization

Experiment with YOLOv8s/m for potentially better accuracy
Data augmentation techniques (rotation, brightness, contrast)
Transfer learning from domain-specific models
Ensemble methods combining multiple model outputs

Deployment Considerations

Real-time optimization for live video streams
Mobile deployment with model quantization
Edge device compatibility testing
API integration for production systems

Key Learnings

Domain-specific training significantly outperforms general models for specialized objects
Data quality (orientation, angles) is crucial for model performance
Confidence threshold tuning can dramatically impact detection results
Systematic approach to dataset creation and validation prevents common pitfalls
Early stopping and monitoring prevent overfitting in small datasets
Training efficiency achieved good results in 84 epochs with 4.1 hours of training time

Conclusion

This project demonstrates the effectiveness of custom YOLOv8 training for domain-specific object detection, perfectly suited for the take-home assignment requirements. By creating a targeted dataset and systematically addressing technical challenges, the model achieved solid performance with a fitness score of 0.7944 and successfully detected all target objects in real-world scenarios after confidence threshold optimization.

Assignment Deliverables Met

This implementation provides all required outputs for the take-home task:

✅ Object Detection in Video: YOLOv8 processes 1-minute room/desk videos
✅ Annotated Video Output: results/detection/video_results.mp4 with bounding boxes and labels
✅ CSV Detection Results: Complete detection data with frame numbers, object classes, confidence scores, and bounding box coordinates
✅ Complete Code: Full pipeline with main_pipeline.py providing interactive workflow
✅ Setup Instructions: Comprehensive README with step-by-step usage guide

Key Technical Achievements

Domain Specialization: Custom training significantly outperformed general-purpose models on desk objects
Robust Pipeline: Interactive main function handles complete workflow from data setup to detection output
Efficient Training: Achieved good results in 84 epochs with 4.1 hours of training time
Production Ready: Consistent training/validation losses indicate a well-balanced model suitable for deployment

The approach validates the principle that specialized models often outperform general-purpose solutions when dealing with specific object domains, even with relatively small datasets (~351 images). The systematic implementation demonstrates practical machine learning engineering skills applicable to real-world computer vision challenges.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
dataset		dataset
results		results
runs/train/desk_objects		runs/train/desk_objects
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
dataset.yaml		dataset.yaml

RamonsArchive/YOLOv8Model

Folders and files

Latest commit

History

Repository files navigation