Vision-Inspired Keyword Spotting Framework for Streaming Audio
Vision-Inspired Keyword Spotting Framework for Streaming Audio
Main Ideas:
Researchers propose an architecture with input-dependent dynamic depth for processing streaming audio.
The architecture extends a Conformer encoder with trainable binary gates that can skip network modules based on input audio.
The approach improves detection and localization accuracy on continuous speech using Librispeech's 1,000 most frequent words.
The architecture maintains a small memory footprint and reduces the average amount of processing without affecting overall performance.
Author's take:
This research presents an innovative approach to keyword spotting in streaming audio by leveraging a vision-inspired framework. The use of input-dependent dynamic depth and trainable binary gates allows...