Back to Projects

Home / Projects / Enhanced Weakly Supervised Learning for 3D Hand Pose Estimation

Associated with University of Sri JayewardenepuraPublished

Enhanced Weakly Supervised Learning for 3D Hand Pose Estimation

MAY 2024 - MARCH 2025

Enhanced Weakly Supervised Learning for 3D Hand Pose Estimation

An enhanced weakly supervised learning framework using EfficientNet-B0 with a regression head for accurate 3D hand joint prediction from RGB images, effectively leveraging limited annotated data.

About This Project

Accurate 3D hand pose estimation plays a vital role in applications such as augmented reality, virtual reality, robotics, and human-computer interaction. This project explores weakly supervised learning as an alternative to traditional fully supervised approaches, which require large amounts of expensive and time-consuming 3D annotations.

This project develops a hybrid deep learning pipeline that estimates 3D hand joint positions from standard RGB images using minimal labeled data, combining EfficientNet-B0 for spatial feature extraction, pseudo-labeling for weak supervision, and an LSTM network for temporal motion refinement across video frames.

This project uses the FreiHAND Dataset, a benchmark for 3D hand pose estimation from single color images, containing 3,960 evaluation samples with RGB images, hand scale, and camera intrinsics.

Key Features

  • Predicts the 3D coordinates of 21 hand keypoints from a single RGB image
  • Proposed a hybrid architecture combining EfficientNet-B0 with a regression head for accurate joint prediction.
  • Integrated LSTM-based temporal modeling to capture sequential hand motion dynamics.
  • Improved performance under limited annotation scenarios through efficient feature learning strategies.
  • Designed a scalable pipeline suitable for real-world applications with minimal supervision.

Models Used

ModelRole
ResNet18Baseline regression model
EfficientNet-B0Enhanced backbone with pseudo-labeling
DeepLabV3 (ResNet-101)Segmentation mask generation
LSTM (2-layer)Temporal motion modeling across video frames

Final Results

MetricValueDescription
MPJPE0.1126Mean joint position error
PCK@0.0514.39%Keypoints within 5% of hand size from ground truth
Inference Speed27.19 ms/frameReal-time capable
LSTM Loss0.0006879Final temporal model evaluation loss
LSTM MAE0.0189Mean absolute error on keypoint predictions

Publication

AuthorsE. M. P. J. De Saram, R. G. N. Meegama
Conference6th International Conference on Advanced Research in Computing (ICARC) 2026

Technologies Used

PythonDeep LearningEfficientNet-B0LSTMOpenCVNumPyPandas