The sound of the digital world of tomorrow
At around 60 centimetres tall, Nao is not much larger than a baby. However, he can already walk, sit and crouch down without falling over. He has even already learnt to talk. Nao is not a person, he is a robot – one that could provide valuable services for humans in the future, as a concierge in a hotel or as a helper in the home, for example. Experts call these types of robots ‘humanoid’.
Nao is being trained to communicate with humans verbally and with gestures at FAU’s Chair of Multimedia Communications and Signal Processing. This is quite a challenge – after all, humans are used to speaking to other humans. Squeaky or croaky machines with jerky movements are not considered to be particularly sociable. ‘In practice the success of humanoid robots will largely depend on how well their communication can mimic that of humans,’ explains Prof. Dr. Walter Kellermann. This is what his research group is working on and they want to use Nao to make a breakthrough.
Speech is the most important medium that humans use to communicate. In contrast to sight, hearing still works when the person we are speaking to is not in our direct line of vision. Nature has equipped us for this purpose not just with two ears but also with an extremely powerful brain that can filter and interpret the signals that it receives in a very flexible manner.
Good hearing is important for socialising, interacting with others and exchanging information. If we want to, we are able to understand another person even when they are standing at some distance away in the middle of a noisy crowd. Researchers refer to this as the cocktail party problem:
machines are still currently unable to deal with difficult acoustic situations. So far no one has been able to unravel the method that the human brain uses to deal with surrounding noise.
The solution is unlikely to be a simple one, as the mathematical representation of the problem is too complex. Standard telephones today have echo suppressors which use highly complex algorithms to remove distracting feedback. These tiny computers are specially designed to optimise up to 500 different parameters within a fraction of a second to ensure that the two parties can speak to each other without any irritating background noise. The parameters change with each conversation as they are influenced by factors such the architecture and furnishings of the rooms that the people having the conversation are in.
When this technology is applied to situations other than telephone conversations, the task of filtering the actual message out of an acoustic signal becomes much more difficult. As soon as the two parties are no longer communicating over a defined connection (such as a copper cable), the signal can no longer be measured immediately and the number of parameters that need to be optimised increases dramatically. This is already the case for conversations that are transmitted physically through the air. The situation becomes even more complex when the two parties move around during their conversation, like Nao the robot can.
Microphones in the arms
Researchers at FAU’s Chair of Multimedia Communications and Signal Processing have been investigating ways of improving communication between humans and machines for many years. They have developed considerable expertise in statistical signal processing and high-performance real-time algorithms that use digital tools to remove distracting background noise from the actual signal. One area to which their findings are applied today is the improvement of hearing aids.
This technology is also being used for Nao, who receives signals via a dozen microphones rather than just two. As a general rule in signal processing, the greater the number of microphones, the more accurately the direction from which a person is speaking can be determined.
However, this is still not enough. Several microphones are attached to Nao’s arms, allowing the distance between them to be altered when the robot opens his arms. ‘The greater the distance between the microphones, the higher the potential acoustic resolution of the microphone system,’ Walter Kellermann explains. ‘And the higher the resolution, the more accurately Nao can focus on acoustic sources that are some distance away.’
This phenomenon is well known in the field of optics. Distant stars in our universe can be best observed individually using large lenses or huge telescope dishes. This is because the maximum possible image resolution increases parallel to the diameter of the lens. Experts call this the aperture.
Will Nao’s flexible acoustic aperture make him the first robot that is able to hold a perfect conversation at cocktail parties? What about all the other requirements that he needs to fulfil to make humans like him? Will he succeed because he can not only hear and speak like a human, but mimic their gestures too?
These are the key questions being examined in the research project ‘Embodied Audition for RobotS’ (EARS) that has received 3.52 million euros of funding from the European Union for a period of three years and that FAU researchers are working on in collaboration with colleagues from Be’er Sheva, Paris, London, Berlin and Grenoble. The project is being led by the FAU researchers in Erlangen.
Suppressing disruptive noises effectively
Erlangen is an important location when it comes to audio coding and signal processing technology in general. Why? ‘Because, with our Chair of Multimedia Communications and Signal Processing, AudioLabs and the Fraunhofer Institute for Integrated Circuits IIS, we are world leaders in the field,’ Walter Kellermann says. This success is based on the innovative developments that have been made in the Nuremberg Metropolitan Region. One such example is the mp3 audio encoding format developed by Fraunhofer IIS and FAU that went on to achieve international fame. The productive collaboration between Fraunhofer IIS and FAU is now centred around International Audio Laboratories Erlangen (AudioLabs). Around 50 researchers, postdoctoral researchers, doctoral candidates and students work at this joint institute with the aim of improving the audio quality of the digital world.
The technologies developed here are of great interest to the communications electronics and electronic entertainment industries, for example. Using hands-free technology is an excellent way to increase efficiency and easy in both human-to-human and human-to-machine communication, such as in telephone conferences or when operating technical devices using voice commands. However, in addition to the desired signal, the signal that is received also contains various undesired background noise and disruptive acoustic sources, such as the ringing of a telephone, for example. Suitable signal processing technology helps to improve the quality and intelligibility of speech and minimises disruptive background noise.
Smart TVs that can be used for Internet telephony and can be controlled by voice command are another example of where this technology is used. Although these devices are currently able to suppress slowly changing background noise very well, they are unable to handle changes in the acoustic field that happen quickly, such as a telephone ringing or the noise of a vacuum cleaner, and, in particular, changes in the positions of these disruptive sources, which can sometimes bring human-to-machine communication to a standstill. In these cases, the quality and intelligibility of the speech suffer greatly, as does voice recognition.
Maja Taseska, a doctoral candidate at AudioLabs, is investigating this problem and looking for alternative solutions as part of the Spotforming project. She is developing a recording technique that is based on several microphone arrays that are spread out. The spatial diversity of the various microphone arrays allows changes in the spectrum and position of sources of disruption that change quickly to be detected. This enables the algorithms that improve the signal to adapt quickly to specific acoustic conditions, even when there are several moving speakers or other undesired acoustic sources in the background.
Attention to detail
Another task being attempted in a second AudioLabs project is improving the co-ordination of images and audio. While it is already possible to zoom in on any part of an image in a digital video in order to see details more clearly, an equivalent acoustic procedure does not yet exist. How great would it be if it were possible to focus on the middle person in an image with three people, while the other two people’s voices became quieter in the background? What if when we zoomed in the voice of the person on the left came from the same direction?
Oliver Thiergart, a research associate at AudioLabs, is working towards making this a reality in the ‘Akustisches Zoomen’ project. The approach involves breaking the recordings from the microphones down into different signal types: direct sound and diffuse sound. Direct sound is the term used for signals that transport the actual message intended for the recipient and reach them from a specific direction. Humans are able to determine this direction with great accuracy on the basis of the differences in the phases and levels of the sound waves that reach their two ears. This does not work with diffuse sound – a good example of which is street noise – as it comes from all directions.
Thiergart is constructing his zooming system on the basis of a flexible algorithm. The user does not have to choose which details they want to focus on while recording, but can wait until they play it back to decide. ‘Technically a few microphones and a small computer like the ones that are already available in portable digital devices are sufficient for this,’ Thiergart says. ‘That’s why I hope that it will soon be possible to apply this technology to apps for smartphones or digital cameras.’
Hearing the acoustic surroundings
Prof. Dr. Rudolf Rabenstein from the Chair of Multimedia Communications and Signal Processing, by contrast, is researching how acoustic scenes can be reproduced as authentically as possible. ‘More and more people today are using smart glasses to enter virtual spaces in which their movements are purely optical and not physical,’ he explains. ‘What is missing here is accompanying acoustic surroundings that change when they change their position in the space.’ This might be when they leave a quiet museum in cyberspace, coming out onto a noisy street with cars passing by, or when they move from the stalls to a box in a concert hall.
The field of research that could provide solutions in this area is called sound field synthesis. Unlike speech, the sound in this context is not directed at humans, who have specific sensory perception abilities. For this reason, Rudolf Rabenstein is searching for technical solutions that would allow a real or imagined space to be synthesised so that each aspect of the virtual version sounds exactly like the original. This challenge also corresponds to an optical phenomenon. The sound field that Rabenstein aims to create can be compared to a flat screen on which it is possible to control each part of the image temporally and spatially.
Standard audio techniques cannot produce this kind of sound field. Neither the stereo technique, which uses two audio channels, nor the surround sound technique, which has five channels, is able to reproduce the acoustic complexity of a room that changes over time. The architecture and furnishings of a room with columns that reflect sound or curtains that absorb it, for example, make the surroundings too complex from an acoustic perspective.
128 speakers that can be arranged flexibly
Rudolf Rabenstein and his team are therefore experimenting with sound systems with up to 128 speakers. ‘One thing that is very important is to ensure that the speakers can be arranged with a high degree of flexibility,’ he says. ‘After all, these speakers will need to be placed in different positions in each different acoustic environment.’
The researchers have already demonstrated the practicability of their concept. Hearing aid manufacturers have used it to test their products in different acoustic environments, for example, and the team has worked with aeroplane manufacturers who want to use the acoustics of the speakers to make the cabins of their jets appear larger. What is currently stopping the system from being used in the second case is the weight of the speakers – in the aviation industry every additional gram counts.
However, the technology is suitable for a wide variety of other applications. For example, Rudolf Rabenstein is using it to model direct current networks in buildings in a project that he is currently working on. ‘On a theoretical level, the problems are very similar,’ he says. ‘In acoustics we deal with sound waves that travel through the air. In electronics we investigate electrical impulses that travel along wires and through circuits.
But regardless of whether the projects are developing hearing robots, intelligent remotes or acoustic zooming, one thing is certain – the acoustics experts in Erlangen will continue to play their part in innovation in this field in the future.’
This article – and many others on interesting topics related to the senses and sensory perception – was originally published in German in the current issue of FAU’s research magazine ‘friedrich’.
BU 1: It might still be a while yet before robot Nao is able to communicate with a person in the middle of a noisy crowd, as there is no easy solution to what researchers call the cocktail party problem. (Image: FAU/David Hartfiel)
BU 2: Behind the scenes at AudioLabs’ acoustic laboratory where researchers are designing the audio of the digital world of tomorrow. (Image: FAU/Kurt Fuchs)
BU 3: Stereo and Dolby Surround might soon be history: the new sound system being developed at FAU uses up to 128 speakers to reproduce acoustics as authentically as possible. (Image: FAU/David Hartfiel)