Visual Voice Activity Detection with state-of-the-art video classifiers

Voice Activity Detection (VAD) is a key component in Human Robot Interactions, as it can be used to initiate interactions.
There are audio and video based VAD systems, however both have their shortcomings. 
A recently presented dataset named VVAD-LRS3 enables the training of larger Deep Neural Networks, because of its increased dataset size. 
In this thesis, four models are trained on that dataset.
The models reach remarkable improvements in performance, in particular, the model R(2+1)D Light reaches an accuracy improvement of 7% while having only a fraction of the size parameter-wise, compared to the previous state-of-the-art model VVAD LRS3 LSTM.
Furthermore, we investigated the different effects of varying temporal lengths and densities, it showed that the temporal density is more important than the length.
By modifying the temporal dimension, we were able to gain resource improvements while keeping the performance equal or at least similar to the original models. 

In der Regel sind die Vorträge Teil von Lehrveranstaltungsreihen der Universität Bremen und nicht frei zugänglich. Bei Interesse wird um Rücksprache mit dem Sekretariat unter sek-ric(at)dfki.de gebeten.

zuletzt geändert am 31.03.2023
nach oben