--- 2 June 2016 ---

The Theory of Spatial Rendering

Sound field

Positioning sound source(s) in an environment results in a sound field. This sound field depends on the position of sound sources, their characteristics, and also reflections of sound waves. The sound waves of individual sources can affect each other depending on their characteristic, position, and timing, i. e., elimate each other or amplify each other.

Video: Sound field for two non-continuous sound sources.
Data, and animation courtesy to Hagen Wierstorf.

A sound field inherently contains spatial information about each sound source. The spatial information can be derived to a certain degree by sampling the sound field using multiple microphones. For constant sound sources signals, also one microphone can used while changing the position and, for unidirectional microphones, the direction of the microphone over time. Evaluating the difference between the recorded signals, allows to estimate the position of individual sound sources.

Spatial Hearing

Humans are able to extract spatial information of a sound field using their ears. A human ear is, in fact, a biologically microphone covered with a pinna. Using both ears, humans can derive spatial information of sound sources in their environment. Depending on the position of a sound source, the sound waves reaching each ear are different (except if the source is directly in front or in the back of the head), as sound waves are traveling different paths to each ear. These differences occur in timing (Interaural Time Difference, ITD), level in terms of volume (Interaural Level Difference, ILD), and frequency. Differences in timing occur if the distance between the sound source and both ears are not equal. A longer distance requires more time for sound waves to reach the ear and also reduces their energy (volume). Differences in frequency for example occur due to objects that are accoustically not transparent. If such objects are between the sound source and the ear, the sound waves cannot pass freely and this might block some frequency spectra partly or even completely (such as the head). These characteristics might enable to derive the direction of a sound source relative the rotation of the head to a certain degree. The distance of a sound source can be determined by the level of the signals reaching the ears (if a ground truth about a sound source is available), and by timing differences of reflections of the sound waves.

However, human hearing is not perfect. For example, the angular resolution is not uniformly precise. The highest resolution (horizontally) is available in front of the head (1-2 degrees) while to left and right the resolution decreases to up to 30 degrees. In the vertical, the angular resolution is even lower.

Spatial Rendering using a Pair of Headphones

The impression of spatiality of one or more sound source can be recreated using a pair of headphones. One approach is to create binaural recordings of a real sound enviroment. Here, a dummy head, representing a human head with two artificial ears, is used to record the sound field at the desired position (potentially with movements if desired). Presenting the recorded signal of each ear binaurally via a pair of headphones to a person, enables this person to experience the sound field. The listener experiences the sound field, as he would have been at the position of the dummy head includings its motion and head rotation.

Video: Impulse responses for both ears depending on head rotation for a sound source at 0°, e.g., in front of the listener.
Data, images, and animation courtesy to Gnuplotting.org and Hagen Wierstorf.

A spatial representation can also be rendered. If this is done in real-time (here in terms of fast enough) then head rotation and position changes of the listener can be taken into account. This is simulated by rendering the resulting signals of the left and right ear for all sound sources in the simulated sound field depending on their position, direction, and also reflections. Basically, this adds the spatial cues to the signals (e. g., ITD and ILD), which can then be processed by the human hearing system.

Spatial rendering can be conducted using convolution. Here, the signal of a sound source is convolved for each ear using a so-called Head Related Transfer Function (HRTF). The HRTF describes the modifications applied to the signal from a sound source depending on its relative position to the ear. The HRTF is actually the Fourier Transform of the Head Related Impulse Response (HRIR). For convolution, the Fourier Transform of the source signal is multiplied with the HRTF and then the inverse Fourier transformation applied. For a binaural representation, this is conducted for each ear individually. Presenting these two signals at the same time to each ear correctly should result a spatial representation for a listener. The distance of a sound source to the listener is simulated by adjusting the loudness (i. e., with an increasing distance a sound source gets less loud). Multiple sound sources at the same time can be presented by mixing (i. e. adding) the convoluted signals of each ear.

Practical Limitations

Providing a spatial representation using a pair of headphones has some practical issues. This includes delay due to spatial rendering (problematic if head movements need to be taken into account), differences in audio equipment (e. g., the frequency response of headphones is not identical), and differences between humans and their individual learned spatial audio processing. While delay can be reduced (e. g., overlapp-add method) and the differences in frequency response of headphones can be compensated to a certain degree, the signal modifications due to individual pinna shapes, ear distance, hairstyle, and also reflections of the human body can only be accounted for to a certain degree. In fact, signal modifications due to differing pinna shapes can be accounted for by creating individualized HRTFs, often HRTFs are derived using a standard pinna are sufficiently to provide a satifisfying spatial representation.

Another practical issue occurs due to the fact that humans move their head to explore a their sound environment. This allows to avoid front-back confusion, i. e., it is hard to determine if a sound source is in front or back of the listener as the signal modifications are very similar (same ILD and ITD). In addition, also improved angular resolution in front of the listener can be exploited by rotating the head towards the sound source. Moreover, also a listener might shift their position in a sound field to explore it. The absence of the ability to move or rotate the head might limits the usefulness of a spatial representation.

Further Reading