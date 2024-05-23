



This complexity becomes problematic when AI models need to operate in real-time on headphones with limited computing power and battery life. To meet these constraints, neural networks needed to be small and energy efficient. So the team used an AI compression technique called knowledge distillation. This meant taking a huge AI model (the teacher) trained on millions of voices and having it train a much smaller model (the student) to mimic its behavior and performance to the same standards. .

Next, students were instructed to extract the audio pattern of a specific voice from ambient noise captured by a microphone attached to a commercially available noise-cancelling headphone.

To activate the Targeted Speech Hearing system, the wearer presses and holds a button on the headphones for a few seconds while facing the person they want to focus on. During this enrollment process, the system captures audio samples from both headphones and uses this recording to extract the speaker's vocal characteristics, even in the presence of other speakers or noise nearby.

These features are input into a second neural network running on a microcontroller computer connected to the headphones via a USB cable. This network runs continuously, separating the selected audio from everyone else's audio and playing it to the listener. Once the system locks on a speaker, it will continue to prioritize that person's voice even if the wearer turns away. The more the system focuses its training data on the speaker's voice, the better it will be at separating that voice.

Currently, the system can only register the loudest target speaker, but the team aims to make the system work even if the loudest voice in a particular direction is not the target speaker.

Distinguishing a single voice in a noisy environment is extremely difficult, said Sefik Emre Esquimez, a senior researcher at Microsoft who works on voice and AI but was not involved in the study. “We know businesses want this, and if we can make it happen, it opens up a lot of uses, especially in meeting scenarios,” he says.

While research into speech separation is often more theoretical than practical, this work has clear real-world applications, says Samuel Cornell, a researcher at Carnegie Mellon University's Language Technologies Institute who was not involved in the study. “I think this is a step in the right direction. It's a breath of fresh air,” Cornell says.

