Visual Voice Activity Detection (V-VAD) is the task to determine whether a person is speaking, based on images or video footage. It is a central point in Human Computer Interaction, as it is one of the first signals that can indicate that a human is interacting with a computer.
Two new big datasets were proposed, in the last years. They enable the usage of image based Deep Neural Networks to classify the V-VAD task. Those image based solutions reach higher accuracy scores than previous non image ones.
Three video based learning algorithms will be applied to the V-VAD task in this thesis, namely two 3D CNN based approaches and one transformer. Firstly, (2+1)D CNN which split the 3D convolution into a spatial and a temporal convolution. Secondly, MoViNets that uses a Neural Architecture Search to increase accuracy and a stream buffer to reduce memory usage. Thirdly, a Video Visual Transformer applies the attention mechanism to videos. All three video based learning algorithms will be evaluated using their performance and resource usage as criteria.