Speech technologies have met with an enormous success and have become a central element of our society. With the ubiquity of mobile devices, technically enabled speech communication at any time and from any place has become a commodity.
Advanced applications, such as untethered teleconferencing systems, smart rooms, telepresence or audio-visual systems for ambient assisted living or for surveillance and monitoring, are gaining importance. But their performance still often falls short of user expectations: They require a significant amount of user cooperation and attention, are tailored to specific use cases, depend on close-talking microphones, and cannot tolerate highly variable acoustic environments. Furthermore, systems that incorporate a classification and recognition component rely on an expert a priori definition of the events to be recognized, and on the availability of labeled and application-specific training data. These requirements render such systems expensive and inflexible, and their performance is still limited.
On the other hand, there is the inexpensiveness and proliferation of acoustic sensors and the ubiquity of wireless communications, which allow a cost-effective realization of the infrastructure for the aforementioned applications, based on wireless acoustic sensor networks (ASNs). Traditional microphone arrays sample a sound field only locally, often at a relatively large distance from the target source(s), resulting in poor signal quality. ASNs use many more microphones to cover an area of interest, making it likely that at least one microphone is close to every relevant sound source, capturing a signal with improved quality and saliency. But to leverage the enormous potential of ASNs for spatio-temporal signal processing and classification significant challenges have to be overcome. They are related to the loose coupling of the sensors, the spatial extent of the network, the natural acoustic environment it is deployed in, and the increased signal enhancement and analysis demands of novel applications.
This Research Unit (RU) represents a coordinated effort to transgress the boundaries of existing speech and audio technology to enable flexible, hardware-, environment-, and usage-adaptive, high-quality speech communication and acoustic scene classification over acoustic sensor networks. We categorize the corresponding applications as small-space applications (SSA) and large-space applications (LSA) according to the geometric extent of the space covered by the sensor network. The RU will address the specific challenges of these application categories as identified in the sequel.
Small-space applications (SSA) will typically rely on a few acoustic sensor nodes distributed in a single or in multiple rooms. Key applications include
- Ambient assisted living (AAL) and smart homes: Current technical solutions for AAL hardly ever consider acoustic sensors, owing to insufficient signal quality, other technical limitations, and privacy concerns. If solved, an advanced AAL system using audio will bring about many benefits: It will enhance the user's voice over arbitrary nonstationary distortions, glean context information by observing the acoustic scene, and will be able to detect abnormal and possibly hazardous events by their acoustic signature. It will hence become an important technology for an aging society to support a prolonged self-determined life at home.
- Personal communication scenarios with a natural look & feel: This includes teleconferencing without headsets and the use of multiple sound capturing devices. While today's solutions rely on pre-installed audio equipment, a possible future solution would be to let the participants' smartphones spontaneously form a multichannel sound capturing and speech enhancement system.
Large-space applications (LSA) employ acoustic sensor networks that cover an extended geographical area. Example applications include
- Surveillance of public or private spaces: While traditional audio-visual systems conduct event detection based on energy thresholding and classify signals into one of a few pre-trained classes, an advanced system will build models of the typical sounds of the environment using unsupervised and semi-supervised learning techniques, thus being more adaptive to the task at hand and requiring less costly labeled training data.
- Environmental and habitat monitoring: An acoustic sensor network can serve a multiplicity of purposes from monitoring adherence to noise control regulations to monitoring endangered animal species in a wildlife refuge. By using appropriate signal representations and local processing at the sensor nodes the amount of data the network must transmit and store will be greatly reduced.
This RU is dedicated to address the key scientific challenges which are common to these relevant applications.