Reliable recognition of people's gestures and mimics is crucial to a natural and intuitive interaction between robots (machines) and humans. Voice Activity Detection (VAD) may play a major role in this context. VAD algorithms detect if the person of interest is currently speaking or not. This can be achieved by using acoustic signals only and analyzing the received audio data called Audio VAD (A-VAD).
In very noisy places or whenever acoustic signals are unable to analyze or even not accessible, Visual VAD (V-VAD) comes into play using visual information from a camera for classification.
However, to perform well, applied algorithms rely on good data sets. With VVAD-LRS3, the team around Adrian Lubitz created and described the biggest and presumably most robust and accurate VVAD dataset so far. But, it has not been shown yet that models using the VVAD-LRS3 dataset do perform significantly better than those trained with other state-of-the-art datasets like WildVVAD. The DNN algorithm trained with VVAD-LRS3 provides 92% accuracy on a test set. Although the algorithms trained on WildVVAD show a similar accuracy on its own test set (91.01%) as those trained on VVAD-LRS3, a cross-comparison can reveal a more accurate picture of the algorithms’ performances. Furthermore, the VVAD-LRS3 dataset can be even more robust and less prone to false positive and false negative classifications (wrong predictions) by improving it with cleaned and extended data and optimizing the code structure (refactoring) to enhance maintainability.
Within my presentation, I want to give an overview of the current state of the project including the advantages and disadvantages, which steps I want to perform to showcase the capabilities of VVAD-LRS3, and how to deal with the cross-validation setup.