Imagine the pounding chords of a pipe organ echoing through the cavernous sanctuary of a massive stone cathedral.
The sound that a visitor to the cathedral hears is affected by many factors, including the position of the organ, where the listener stands, whether there are pillars, pews or other obstacles in between, what the walls are made of, the locations of windows or doors, etc. The Hearing a noise can help someone visualize their surroundings.
Researchers at MIT and the MIT-IBM Watson AI Lab are investigating the use of spatial acoustic information to help machines also better visualize their surroundings. They developed a machine learning model that can capture how each sound in a room propagates through space, allowing the model to simulate what a listener would hear in different locations.
By accurately modeling the acoustics of a scene, the system can learn the underlying 3D geometry of a room from audio recordings. The researchers can use the acoustic information captured by their system to create accurate visual representations of a space, much like how humans use sound to estimate the properties of their physical environment.
In addition to its potential applications in virtual and augmented reality, this technique could help artificial intelligence agents develop a better understanding of the world around them. For example, by modeling the acoustic properties of sound in its environment, an underwater exploration robot could perceive things that are farther away than mere vision would allow, says Yilun Du, a doctoral student at the Faculty of Electrical Engineering and Computer Science (EECS) and Co -Author of an article describing the model.
“Most researchers have so far only focused on modeling vision. But as humans, we have multimodal cognition. Not only sight is important, sound is also important. I think this work opens an exciting direction of research into better using sound to model the world,” says Du.
Joining you on the paper are lead author Andrew Luo, a graduate student at Carnegie Mellon University (CMU); Michael J. Tarr, Kavčić Moura Professor of Cognitive and Brain Sciences at CMU; and senior authors Joshua B. Tenenbaum, Paul E. Newton Career Development Professor of Cognitive Science and Computation in MIT’s Department of Brain and Cognitive Sciences and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Antonio Torralba, Professor of Electrical Engineering and Computer Science at Delta Electronics and member of CSAIL; and Chuang Gan, a senior research fellow at the MIT-IBM Watson AI Lab. The research results will be presented at the Conference on Neural Information Processing Systems.
sound and image
In computer vision research, a type of machine learning model called the implicit neural representation model has been used to produce smooth, continuous reconstructions of 3D scenes from images. These models use neural networks, which contain layers of interconnected nodes, or neurons, that process data to complete a task.
The MIT researchers used the same type of model to capture how sound propagates continuously through a scene.
But they found that vision models benefit from a property known as photometric consistency that doesn’t apply to sound. If you look at the same object from two different places, the object looks roughly the same. But for sound, you change location and the sound you hear may be totally different due to obstacles, distance, etc. This makes predicting audio very difficult.
The researchers overcame this problem by incorporating two properties of acoustics into their model: the reciprocal nature of sound and the influence of local geometric features.
Sound is reciprocal, meaning that when the source of a sound and a listener swap positions, what the person hears remains unchanged. In addition, what one hears in a given area is heavily influenced by local specifics, such as: B. an obstacle between the listener and the sound source.
To incorporate these two factors into their model, called the neural acoustic field (NAF), they augment the neural network with a grid that captures objects and architectural features in the scene, such as doors or walls. The model randomly samples points on this grid to learn the features at specific locations.
“If you imagine yourself standing near a door, what you hear most strongly influences the presence of that door, not necessarily geometric features far away from you on the other side of the room. We found that this information allows for better generalization than a simple, fully connected network,” says Luo.
From noise prediction to scene visualization
Researchers can feed the NAF visual information about a scene and some spectrograms that show how a piece of audio would sound when the sender and listener are at target locations in space. Then the model predicts how that audio would sound as the listener moves to any point in the scene.
The NAF outputs an impulse response that captures how a sound should change as it propagates through the scene. The researchers then apply this impulse response to different sounds to hear how those sounds should change as a person walks across a room.
For example, if a song is played from a speaker in the middle of a room, their model would show how that sound gets louder as a person approaches the speaker, and then gets muffled as they walk into an adjacent hallway.
When the researchers compared their technique to other methods that model acoustic information, it always produced more accurate sound models. And because it learned local geometric information, their model was able to generalize to new locations in a scene much better than other methods.
In addition, they found that applying the acoustic information their model learns to a computer visualization model can lead to a better visual reconstruction of the scene.
“For example, if you only have a few views, these acoustic features can help you capture boundaries more sharply. And maybe that’s because in order to accurately render a scene’s acoustics, you have to capture the underlying 3D geometry of that scene,” says Du.
The researchers plan to further improve the model so that it can be generalized to brand new scenes. They also want to apply this technique to more complex impulse responses and larger scenes, such as entire buildings or even a city.
“This new technique could open up new possibilities to create a multimodal, immersive experience in the Metaverse application,” adds Gan.
“My group has done a lot of work using machine learning methods to speed up acoustic simulation or to model the acoustics of real scenes. This paper by Chuang Gan and his co-authors is clearly a big step in that direction,” says Dinesh Manocha, the Paul Chrisman Iribe Professor of Computer Science and Electrical and Computer Engineering at the University of Maryland, who was not involved in the work . “Specifically, this article presents a nice implicit representation that can capture how sound can propagate in real scenes by modeling it with a linear time-invariant system. This work can have many applications in AR/VR as well as in understanding real scenes.”
This work is supported in part by the MIT-IBM Watson AI Lab and the Tianqiao and Chrissy Chen Institute.