Visual Voice Activity Detection with state-of-the-art video classifiers

Visual Voice Activity Detection (V-VAD) is the task to determine whether a person is speaking, based on images or video footage. It is a central point in Human Computer Interaction, as it is one of the first signals that can indicate that a human is interacting with a computer.

Two new big datasets were proposed, in the last years. They enable the usage of image based Deep Neural Networks to classify the V-VAD task. Those image based solutions reach higher accuracy scores than previous non image ones.

Three video based learning algorithms will be applied to the V-VAD task in this thesis, namely two 3D CNN based approaches and one transformer. Firstly, (2+1)D CNN which split the 3D convolution into a spatial and a temporal convolution. Secondly, MoViNets that uses a Neural Architecture Search to increase accuracy and a stream buffer to reduce memory usage. Thirdly, a Video Visual Transformer applies the attention mechanism to videos. All three video based learning algorithms will be evaluated using their performance and resource usage as criteria.

In der Regel sind die Vorträge Teil von Lehrveranstaltungsreihen der Universität Bremen und nicht frei zugänglich. Bei Interesse wird um Rücksprache mit dem Sekretariat unter sek-ric(at)dfki.de gebeten.

zuletzt geändert am 31.03.2023