Getting in Touch with Our ‘Gerbil Brain’ Could Help Machines Listen Better

It could mean more adaptable and efficient hearing devices ranging from hearing aids to smartphones.

Macquarie University

May 7, 2024

Macquarie University researchers have debunked a 75-year-old theory about how humans determine where sounds are coming from, and it could unlock the secret to creating a next generation of more adaptable and efficient hearing devices ranging from hearing aids to smartphones.

In the 1940s, an engineering model was developed to explain how humans can locate a sound source based on differences of just a few tens of millionths of a second in when the sound reaches each ear.

This model worked on the theory that we must have a set of specialized detectors whose only function was to determine where a sound was coming from, with location in space represented by a dedicated neuron.

Its assumptions have been guiding and influencing research – and the design of audio technologies – ever since.

But a new research paper published in Current Biology by Macquarie University Hearing researchers has finally revealed that the idea of a neural network dedicated to spatial hearing does not hold.

Lead author, Macquarie University Distinguished Professor of Hearing, David McAlpine, has spent the past 25 years proving that one animal after another was actually using a much sparser neural network, with neurons on both sides of the brain performing this function in addition to others.

Distinguished Professor of Hearing, David McAlpine, in the Macquarie University anechoic chamber.Macquarie University

Showing this in action in humans was more difficult.

Now through the combination of a specialized hearing test, advanced brain imaging, and comparisons with the brains of other mammals including rhesus monkeys, he and his team have shown for the first time that humans also use these simpler networks.

“We like to think that our brains must be far more advanced than other animals in every way, but that is just hubris,” Professor McAlpine says.

“We’ve been able to show that gerbils are like guinea pigs, guinea pigs are like rhesus monkeys, and rhesus monkeys are like humans in this regard.

“A sparse, energy efficient form of neural circuitry performs this function – our gerbil brain, if you like.”

The research team also proved that the same neural network separates speech from background sounds – a finding that is significant for the design of both hearing devices and the electronic assistants in our phones.

All types of machine hearing struggles with the challenge of hearing in noise, known as the ‘cocktail party problem’. It makes it difficult for people with hearing devices to pick out one voice in a crowded space, and for our smart devices to understand when we talk to them.

Professor McAlpine says his team’s latest findings suggest that rather than focusing on the large language models (LLMs) that are currently used, we should be taking a far simpler approach.

“LLMs are brilliant at predicting the next word in a sentence, but they’re trying to do too much,” he says.

“Being able to locate the source of a sound is the important thing here, and to do that, we don’t need a ‘deep mind’ language brain. Other animals can do it, and they don’t have language.

“When we are listening, our brains don’t keep tracking sound the whole time, which the large language processors are trying to do.

“Instead, we, and other animals, use our ‘shallow brain’ to pick out very small snippets of sound, including speech, and use these snippets to tag the location and maybe even the identity of the source.

“We don’t have to reconstruct a high-fidelity signal to do this, but instead understand how our brain represents that signal neurally, well before it reaches a language centre in the cortex.

“This shows us that a machine doesn’t have to be trained for language like a human brain to be able to listen effectively.

“We only need that gerbil brain.”

The next step for the team is to identify the minimum amount of information that can be conveyed in a sound but still get the maximum amount of spatial listening.