



Imagine the booming chords of a pipe organ echoing through the cavernous sanctuary of a massive stone cathedral. The sound that a visitor to the cathedral will hear is affected by many factors, including the location of the organ, where the listener is standing, whether there are any columns, pillars or other obstacles between them, what the walls are made of, the location of windows or doors, etc. . Hearing a sound can help someone visualize their surroundings. Researchers at MIT and the MIT-IBM Watson AI Lab are exploring the use of spatial acoustic information to help machines better predict their environments as well. They developed a machine learning model that can capture how any sound in a room will propagate through space, enabling the model to simulate what a listener would hear in different locations. By accurately modeling the acoustics of a scene, the system can learn the basic 3D geometry of a room from sound recordings. Researchers can use the acoustic information their system captures to build accurate visual representations of a room, similar to how humans use sound when evaluating properties of their physical environment. In addition to its potential applications in virtual and augmented reality, this technique could help AI agents develop better understandings of the world around them. For example, by modeling the acoustic properties of sound in its environment, an underwater exploration robot can sense things that are further away than it can with vision alone, says Yilun Du, a graduate student in the Department of Electrical Engineering and Computer Science. EECS) and co-author of a letter describing the model. Most researchers have only focused on vision modeling so far. But as humans, we have multimodal perception. Not only the sight is important, but also the voice. I think this work opens up an exciting line of research on how to better use sound to model the world, says Du. Joining Du on the paper are lead author Andrew Luo, a graduate student at Carnegie Mellon University (CMU); Michael J. Tarr, Kavi-Moura Professor of Cognitive and Brain Science at CMU; and senior authors Joshua B. Tenenbaum, Paul E. Newton Career Development Professor of Cognitive and Computational Sciences in the Department of Brain and Cognitive Sciences at MIT and a member of the Computational Science and Artificial Intelligence Laboratory (CSAIL); Antonio Torralba, Professor of Electrical Engineering and Computer Science at Delta Electronics and member of CSAIL; and Chuang Gan, a core research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems. Sound and vision In computer vision research, a type of machine learning model called an implicit neural representation model has been used to generate smooth and continuous reconstructions of 3D scenes from images. These models use neural networks, which contain layers of interconnected nodes, or neurons, that process data to complete a task. The MIT researchers used the same type of model to capture how sound travels continuously across a stage. But they found that vision models benefit from a property known as photometric constancy that doesn’t apply to sound. If one looks at the same object from two different places, the object looks roughly the same. But with sound, change location and the sound you hear can be completely different due to obstacles, distance, etc. This makes predicting audio very difficult. The researchers overcame this problem by including two properties of acoustics in their model: the reciprocal nature of sound and the influence of local geometric features. Sound is reciprocal, meaning that if the source of a sound and the listener switch positions, what the person hears is unchanged. Furthermore, what one hears in a given area is greatly affected by local features, such as a barrier between the listener and the sound source. To incorporate these two factors into their model, called a neural acoustic field (NAF), they augment the neural network with a network that captures objects and architectural features in the scene, such as gates or walls. The model randomly samples points in that network to learn features at specific locations. If you imagine standing next to a door, what most affects what you hear is the presence of that door, not necessarily the geometric features away from you on the other side of the room. We found that this information enables better generalization than a simple fully connected network, says Luo.

From predicting sounds to visualizing scenes Researchers can feed NAF visual information about a scene and several spectrograms that show how a piece of audio would sound when the emitter and listener are located at target locations around the room. The model then predicts how that audio would sound if the listener moves to any point in the scene. The NAF outputs an impulse response, which captures how a sound should change as it propagates across the stage. The researchers then apply this impulse response to different sounds to hear how those sounds should change as a person walks around a room. For example, if a song is playing from a speaker in the center of a room, their model will show how that sound gets louder as a person approaches the speaker, and then becomes muffled as they exit into an adjacent hallway. When the researchers compared their technique with other methods that model acoustic information, it generated more accurate sound models in every case. And because it learned local geometric information, their model was able to generalize to new locations in a scene much better than other methods. Additionally, they found that applying the acoustic information their model learns to a computer vison model can lead to a better visual reconstruction of the scene. When you only have a sparse set of footage, using these acoustic features enables you to capture boundaries more clearly, for example. And maybe that’s because to accurately represent the acoustics of a scene, you need to capture the 3D geometry of that scene, Du says. The researchers plan to continue to improve the model so that it can be generalized to brand new scenes. They also want to apply this technique to more complex impulse responses and larger scenes, such as entire buildings or even a town or city. This new technique could open up new opportunities to create an immersive multimodal experience in the metaverse app, Gan adds. My group has done a lot of work on using machine learning methods to speed up acoustic simulation or model the acoustics of real-world scenes. This paper by Chuang Gan and his co-authors is definitely a big step forward in this direction, says Dinesh Manocha, the Paul Chrisman Iribe Professor of Computer Science and Electrical and Computer Engineering at the University of Maryland, who was not involved in the work. . . In particular, this paper presents a nice implicit representation that can capture how sound can propagate in real-world scenes by modeling it using a time-invariant linear system. This work could have many applications in AR/VR as well as real-world scene understanding. This work is supported, in part, by the MIT-IBM Watson AI Lab and the Tianqiao and Chrissy Chen Institute.

