The primary goal of the thesis is to enhance human-robot collaboration by addressing the challenge of real-time dynamic gesture recognition. This initiative is a significant part of the ROMATRIS project, funded by THW at the DFKI, and the focus of the thesis is on the identification and real-time interpretation of specific predefined dynamic gestures. These gestures serve as commands issued by humans to our fully autonomous robots. The main application area for these robots is in disaster relief operations, where they play an important role in assisting rescue operations by autonomously transporting materials, thereby mitigating the risk to the emergency workers. The robot should be able to swiftly and accurately recognize and respond to dynamic gestures, as per the requirement of high-pressure scenarios like disaster relief efforts. And the robot's design shall have integrated robot architecture with dedicated hardware, capable of performing such complex tasks.
The Thesis incorporated Multimodal Fusion, Data Augmentation and Ensemble learning techniques in addition to the Computer Vision task of Dynamic Gesture Recognition. Different training pipelines were created using state-of-the-art methods based on Neural network-based models such as CNN, LSTM and MediaPipe, as well as Ensemble-based models such as Random Forest, XGBoost and Stacking. Each training pipeline will be elaborated and their performance for dynamic gesture recognition tasks will be compared. A specific focus has been given to CNN+LSTM, MediaPipe+LSTM and XGBoost models, for the creation of the hybrid heterogeneous Ensemble architectures.