About This Project
Accurate 3D object detection is essential in applications such as augmented reality, virtual reality, robotics, and human-computer interaction. Traditional object detection methods depend heavily on large-scale datasets with precise bounding box annotations, a process that is both costly and labor-intensive. This project explores a weakly supervised learning strategy that eliminates the need for manual bounding box annotations by using only image-level labels to train a high-performance object detector.
The pipeline combines EfficientNet-B0 for multi-label classification, Grad-CAM for spatial localization, and YOLOv8 for accurate object detection, trained entirely on pseudo-labels generated without any bounding box supervision.
This project uses the PASCAL VOC 2012 Dataset, a standard benchmark for object detection and segmentation containing 20 object categories with image-level labels, bounding boxes, and segmentation masks. A total of 11,540 images were annotated with valid multi-label vectors for this work.
Key Features
- Classifies 20 object categories from images using only image-level labels (no bounding boxes)
- Generates Grad-CAM heatmaps to localize discriminative object regions automatically
- Converts heatmaps into pseudo bounding box annotations in YOLO format
- Trains YOLOv8 on these pseudo-labels to perform accurate object detection
- Achieves competitive precision, recall, and mAP without any manual spatial annotations
Pipeline Overview
| Stage | Method | Purpose |
|---|---|---|
| 1. Classification | EfficientNet-B0 | Multi-label image classification (20 VOC classes) |
| 2. Localization | Grad-CAM | Generate class-specific heatmaps |
| 3. Pseudo-labeling | Contour detection + thresholding | Convert heatmaps to YOLO bounding boxes |
| 4. Detection | YOLOv8 (small variant) | Train object detector on pseudo-labels |
| 5. Evaluation | Precision, Recall, mAP@0.5, mAP@0.5:0.95 | Assess detection performance |
Final Results
| Metric | Value | Description |
|---|---|---|
| Precision (all classes) | 0.99 at confidence 1.0 | High precision across all 20 VOC categories |
| Recall (all classes) | 0.89 at confidence 0.0 | Strong recall across varying thresholds |
| F1 Score | 0.71 at confidence 0.439 | Optimal precision-recall balance |
| mAP@0.5 | 0.727 | Mean Average Precision at IoU ≥ 0.5 |
| mAP@0.5:0.95 | ~0.45–0.50 | Averaged across IoU thresholds 0.5–0.95 |
Publication
| Authors | E. M. P. J. De Saram, R. G. N. Meegama |
| Conference | International Conference on Advanced Research in Computing (ICARC) 2026 |
