Cross-lingual Voice Activity Detection for Human-Robot Interaction
Nils Höfling, Su-Kyoung Kim, Elsa Andrea Kirchner
In The Third Neuroadaptive Technology Conference, (NAT-2022), 09.10.-12.10.2022, Lübbenau, n.n., pages 100-103, Nov/2022.
The recognition of language is a two-step process: speech must be recognized as such (1,2) and then
the semantics must be understood. For human-robot interaction voice activity detection (VAD) is of
great importance (3). Once it is known that a human is talking, speech recognition can be triggered
and additional modules in the robot can produce responses to the human, or other robotic
behaviors. For online interaction with precise timing especially when using multimodal data (4), it
might also be necessary to integrate VAD into a microcontroller or similar embedded system in the
robot. Advanced methods exist to enable online and embedded VAD (3). However, some of these
methods are trained on biased data, i.e., data in one language, usually English, which can cause
problems when used in applications where the interacting human speaks a different language. This is
well investigated for speech recognition 5) but poorly for VAD. Language-related issues need to be
considered in some applications, such as supporting patients in non-English speaking environments,
and may be as important as approaches that handle strong background noise (6, 7).
voice activity detection, speech recognition, embedded, robot