Audio spatialization + remote conferencing

Jun 27, 2020

I’ve been thinking for a while about audio spatialization in the context of remote collaboration. [sidenote: Parts of this post are inspired by some research papers I read (ask me if you want to see them) and a brief conversation with Phillip Wang, one of the founders of Gather/Online Town which is a very cool related project, and another conversation with my dad. Also useful is the MDN docs page for the HTML5 web audio API, which includes some nice spatialization capabilities, and the OpenAL-soft project which implements a cross-platform API for 3D sound, including spatialization among many other goodies.]

Audio spatialization is when you process audio in such a way that it sounds like it’s coming from a location in space, around you, rather than from your device loudspeaker or kind of ‘inside your head’ when you’re using head-/ear-phones. The effect is easier to achieve with -phones because then you have independent control over what each ear hears and no interference between the two; but you can also do it with an array of well-placed speakers, like a home theater surround sound system, and it sounds more natural because natural audio usually doesn’t materialize directly inside your ear (‘bzzzz’ – ‘aagh get out!’ – <a vague disgust lingers>).

The processing is a little complicated, [sidenote: Briefly, there are two ways to do it. The first is to use a dummy model of a person (a detailed one—the shape and material of the ears matters a lot!), play predetermined sound samples (pure sine waves) from various angles around the dummy, record just inside the ears of the dummy, and when you want to play actual audio, modulate (convolve) the source audio with the recorded samples. You can read more about these head-related transfer functions. The second is to build a mathematical model to simulate the effect by playing with the millisecond delay between when sound reaches one ear or the other, the frequency-filtering effects of the outer ear and head (and sometimes torso), varying amplitude in relation to distance and how much sound is blocked by the head/body, etc. This is simpler to program and understand but can be computationally more expensive and less realistic.] but it is certainly possible to get quite realistic, and modern technology makes it not that computationally challenging either. (Although virtual reality research has explored this topic since at least 1993. Really there was so much VR research in the 90s—sadly, then nor now, the rest of the technological world hasn’t made it commonplace.)

Tying this into remote collaboration: it often happens over group video or voice calls, and the majority of widely-used platforms don’t do spatialization at all. [sidenote: They do other advanced stuff like automatic echo cancellation and background noise reduction and turning down people who aren’t currently speaking. They also have to deal with the headaches of having so many different media formats and web media protocols and differing browser and system requirements and connection quality and encryption and aaagh.] What would spatialization look like in this setting? My proposal is that each person in a call would have a seat at a virtual table, and would hear the others talking as if from their positions around the table. (This can be optionally augmented by tracking the direction of the listener’s head so you can ‘look at’ someone who’s speaking and the audio changes accordingly. Could use the webcam stream or mobile phone accelerometer data or something else.) Depending on the nature of the work/collaboration, there could be multiple tables within earshot of each other, or each person could have their own table where others can join if they want to talk (and this will be faintly but not distractingly audible to others).

For now let’s stick with the single table. Why might a collaboration platform want to support this? I conjecture:

The ‘cocktail party’ problem: real-life discussions can organically generate side-discussions and cross-talk. These are hard to manage in a superimposed mono stream of everyone’s audio, but our brains have a lot of experience in focusing on some (spatial) sources of sound and tuning out others.
Keeping track of the conversation without video: if you can’t or aren’t using video, using positioning information can make it easier to tell who’s talking, in addition to recognizing their voice. It’s easier for someone to jump in when their half-word–interruption isn’t mistaken for noise; it’s easier for others to tell that they haven’t spoken for a while.
Aesthetics: it just seems more natural. Like actually being at a table together. If not practical value, it has a sort of artistic value. [sidenote: You could even add reverb to make it not only sound like people are positioned in space, but also like people are in a room, i.e. the sound effect of being in a small padded office vs. a large glass-window conference room. This leads me into another augmentation/variation: back when I took 21M.080 (intro to music technology), my final project was to simulate the reverberation profiles of various spaces on the MIT campus (which I did with the jankiest recording equipment possible). You could play an audio file or speak into a microphone and have it sound like it was in Lobby 7. But what if you could have a conversation in Lobby 7? What if you could in addition play the background noise of people talking and passing through Lobby 7? What if you could have a conversation while ‘walking through’ campus, and the ‘sound environment’ changed accordingly as you entered a hallway or a classroom, stepped outside, etc.? I’ve been ruminating on this sort of thing since then.]

Of course there are good reasons that major platforms have for not implementing this, probably a big one being computation overhead, either on their servers or on users’ machines, the latter which may not always be capable and the former which may not be worth the (perceived) benefit. But then I discovered that someone has already started to work on just this: High Fidelity. They’re in a beta phase right now so I signed up for a trial, and if you want to join me in trying it out (and if you’re still reading at this point in this long post), I’ll be hanging out in my space on Monday starting 13:00 Eastern and going for maybe 30 minutes or more.

Jun 14, 2021

Updates in this space:

High Fidelity has pivoted from being a Gather competitor to offering a spatial audio API: “Add immersive, high quality voice chat to any application”. This was exactly what my dad and I had talked about in June 2020.
Dolby (!) is now also offering a spatialization + other virtual audio tools API, and Agora.io offers a broader audio API that includes spatialization (but they evidently don’t emphasize this). Exciting progress towards making technology more human—as they say, good technology is the parts that you don’t notice, and voice/video calls are certainly not there yet.
Gather seems to have decided not to go the route of audio spatialization but to include other audio goodies. They now have a fleshed-out rooming system (private rooms where everyone is connected, conference rooms spotlighting the speaker, bubbles and quiet modes for individual interaction, and other tools) and they added ambient noise effects. They also seem to be doing pretty darn well as a company; how many of you know someone (other than me) who is excited about Gather?

Jul 30, 2022

Spatial audio has since become very mainstream!

photograph of Apple AirPods advertisement — Apple AirPods advertisement on a roadside billboard in Boston.

This sentence from the official AirPods launch article sounds like it could have come straight from my original essay: “With spatial audio and dynamic head tracking, voices in a Group FaceTime call sound like they’re coming from the direction in which the person is positioned on the screen, making it seem as if everyone is in the same room.”

Apple announced “Spatial Audio with Dolby Atmos” (and Lossless Audio[sidenote: However, the two are mutually exclusive for now as the format Apple uses to stream Atmos music has lossy compression. What’s even funnier is that AirPods, or any wireless speakers for that matter, are incapable of playing lossless formats because no Bluetooth codec currently has sufficient bandwidth. At least the AirPods Max still has a wire port…]) over a year ago in May 2021, along with its spatial-audio–compatible AirPods Max over-ear headphones and the latest AirPods earphones, but it was only a couple days ago that I was finally inspired to go check it out for myself at the neighborhood Apple Store. (That advertisement really worked, I guess, as well as Apple’s iconic and slightly dystopian open showroom concept…)

Dolby Atmos is a pretty cool technology.[sidenote: Although I think its significance comes not from any conceptual breakthrough, but from Dolby’s immense market power that has allowed it to roll out Atmos to thousands of movie theaters, consumer electronics including home theater, audio production tools and plug-ins, etc., finally coming to Apple Music as mentioned last year. It’s a great engineering accomplishment but not a leap in imagination like, say, the iPhone was.] Typically, when an audio engineer produces a track for a standard non-Atmos format like 5.1 (5 regular speakers, front left-center-right and surround left-right, plus 1 subwoofer), individual tracks are assigned to one or more fixed output channels, and what is exported to the consumer is the precomputed total content of each channel. For instance, in a movie, suppose a car crosses the screen from left to right while a character is talking. If you listened to just the front left channel of the final soundtrack, you might hear the car sound start louder and fade out as the car passes (over to the right side), while the dialog stays constant. There is no way to separate those sounds; the front left channel of the soundtrack is just one single composite stream of audio that doesn’t know anything about what component sounds are in it. The audio engineer made some decisions about the components and burned them into discrete channels.

But the problem is that everyone’s speaker setup is different. Many will of course be watching the movie with just their TV’s two inbuilt speakers. Some will have elaborate home theater setups with even more speakers than 5.1. Commercial movie theaters use extensive arrays of overhead and wall-mounted speakers. So either the soundtrack has to be distributed in multiple formats, and/or each consumer has to map the soundtrack onto their specific setup, which will necessarily be an approximation of what the audio engineer would have intended.

Dolby Atmos allows the engineer to specify sounds in 3D space directly. Audio tracks can be assigned to “dynamic objects” that follow a trajectory through 3D space, like the car in the movie going left to right. That description—one level of abstraction higher than mixing for a format like 5.1—is what comprises the soundtrack. On the consumer’s end, an Atmos-equipped sound system will calculate how that soundtrack should map onto the speakers present in the space, whether that’s a pair of headphones or a movie theater.[sidenote: Technical aside: I don’t know whether Atmos can account for the precise placement and frequency response of individual speakers within, say, a 5.1 setup. But some articles online seemed to imply that it can produce binaural output, i.e. stereo output as if subject to acoustic characteristics of the human head and ears, so I think the answer is theoretically yes, because you could apply similar corrections to any number of output channels, it would just be impractical to expect the average user to do so. Or the binaural output may actually stem from Apple’s spatial audio features in conjuction.] The former is what happens when I listen to an Atmos track on Apple Music with my regular headphones.

Apple’s “Spatial Audio” combines Dolby Atmos with another feature: head-tracking in the latest AirPods and Beats ear-/head-phones, i.e. your device knows the orientation of your head when you’re wearing them. So Atmos’ 3D space of sounds can remain fixed as you turn your head inside it! If you turn to your right, an object that used to be front-center should now be coming from your left, but an object directly overhead shouldn’t sound any different. Your device can compute the new assignment of sounds to (two) channels on the fly. (Note that this would also work for regular stereo audio, which could be adjusted to sound like it is always coming from your device, or from two fixed speakers suspended in space, but both of those would sound weird.)

How does spatial audio sound for Atmos music? First off, I didn’t expect this to be a fair trial at all. You cannot make 3D sounds with two small speakers stuck to the sides of your head. In fact, you cannot make true 3D sounds even with some expensive home theater setups advertised as “Atmos-enabled” that have upwards-firing speakers, which reflect sound off of your ceiling to simulate sounds coming from above. For the illusion to be believable would require the degree of calibration and room correction typically found in professional mixing studios. A fair trial of Atmos music needs speakers all around and above you, physically.

That being said, I tried a few different tracks from Apple’s suggested “Hits for Spatial Audio”, and results were varied. For some tracks I did experience a novel way of being really surrounded by the music. I think what really contributes to the feeling is compensating not for big head turns but for the small, natural movements of your head, instantaneously and seamlessly without it feeling like there is anything ‘going on’. (Apple is very good at this throughout their interface design.) It was definitely different from stereo music, which always feels to me like it is coming from my headphones rather than from sources in space. But I say results were varied because not all the music I listened to actually seemed to make good, artistic use of placing sources in space—some tracks felt like they were still ‘thinking in stereo’ and, suddenly given a whole other dimension to work in, used it awkwardly, non-natively. It’s like the difference between a building that can be understood pretty well from floorplans and 2D cross-sections, and a building that really needs you to think in three dimensions to understand what’s going on.

I guess this is expected, as Atmos and Apple’s Spatial Audio are still relatively new in the world of streaming average-consumer music. (I doubt the average consumer will purchase $550 headphones, anyway.) But I think spatial music will get better with time, as both producers and consumers learn to think in 3D and use space as an additional creative dimension in music.

I’m just troubled by some of the testimonials on Apple’s Spatial Audio article:

“There are no words to describe the immersive, overpowering experience of being a conductor, leading a performance of Mahler’s towering ‘Symphony of a Thousand.’ But now, technology is advancing to bring that experience closer to our ears, our minds, and our souls.”

– Gustavo Dudamel

“From the feeling of hearing your favorite artist in the same room as you, to the experience of sitting directly in the middle of a symphony orchestra, the listening experience is transformative and the possibilities for the creator are endless.”

– Giles Martin

What I listened to in my brief trial was a far cry from hearing an artist in the same room as me, let alone being inside a sympony orchestra. I don’t think technology will ever get there—and nor should it try to. Countless details separate live music from a recording: inherent losses of digitization, lossy compression, shortcomings of all microphones and speakers, true spatialization (not the simulacrum[sidenote: My electronic music professor (Peter Whincop, 21M.361) loved this word. I would like to think he would agree vehemently with the whole paragraph because really it is an idea/philosophy that I learned from him. When I asked him how I could train my ear to better hear what is lost with digitization and lossy compression, he said simply, “go listen to a live orchestra”.] of Spatial Audio), and most of all, the intangible experience of being in a performance… Three-dimensional music has been an art form in its own right for several decades (see quadraphonic and octaphonic music dating back to the 1950s, including Pink Floyd’s quadraphonic mix of The Dark Side of the Moon (1973) or the experimental Zaireeka (1997) by The Flaming Lips), not trying to mimic the experience of live music, but producing something genuinely novel and unreal to listen to. To Gustavo Dudamel I would say: the experience that technology is actually advancing to bring us is the one envisioned by these artists half a century ago.