This is known as Blind Source Separation [1], and it's been a field of study for decades. The specific problem here seems to be the "cocktail party problem", where you want to isolate a single speaker (or in this case 5?) in a room full of conversations.
When I was in grad school, I knew an EE research group in the building next to mine working on this problem using ICA (independent components analysis) -- this was ca 2004, before the resurgence of deep learning. Even with ICA useful results could be obtained.
The results of the FB work [2] with RNNs are pretty impressive (audio samples).
I feel that they are underplaying just how big a deal this would be in hearing aids. It’s not just a case of “slightly better noice filtering” for some it is the difference between being able to go to social events or not. For a large group of people using hearing aids the cocktail party effect means they can’t hear anything at all in social settings, so they avoid them completely because of the negative effects that come from everyone assuming your able to follow group conversations when your in fact sitting in your own little bubble only able to pick up what’s going on when someone semi-yells directly at you.
In any case the box you’d be selling them this product in wouldn’t say “better sound” it would say: “Get back your ability to attend and enjoy parties, enjoy group conversations and socialize unencumbered”. That’s a huge quality of life improvement.
You still have the issue of how to figure out which voices to boost and which to reduce, but I’d expect that to be simpler issue of using multiple receivers and directional detection.
Facebook's work on separating multiple sources in an audio stream is fundamentally different from prior ICA-based methods of Blind Source Separation [0] in ways that are both interesting and seem to be part of a broader trend at FB Research.
ICA-based BSS requires at least n microphones to separate n sources of sound. This work does the separation with one microphone.
What makes this more broadly interesting is FB Research has separately developed the capability to reconstruct full 3D models from single image photos[1].
Both of these reconstruct-from-single-sensor problems are MUCH harder than their associated reconstruct-from-multiple-sensors variants (ICA in the case of audio, stereo separation or photogrammetry in the case of video) so they aren't efforts one undertakes casually.
The obvious motivation for this single-sensor approach is augmenting existing video and audio clips, most of which are single camera, single microphone (or very closely spaced stereo microphones with minimal separation), and all of which people have already uploaded massive numbers to Facebook.
The more interesting motivation could be that FB (Oculus) is widely believed to be developing next generation AR or VR glasses. Most of the discussion around AR/VR headsets focuses on the displays, but if you wanted to keep both your physical size and hardware parts cost to an absolute minimum, one of the things you'd want to minimize is your sensor count.
FB Research seems to have a strong interest in things that reduce the number of sensors required to provide high grade AR/VR experiences and that make it possible to explore pre-existing conventional media in spatialized 3D contexts.
This is a really interesting problem to work on. A couple obvious points:
1. This is a task that humans must do all the time. It's very important in all kinds of different circumstances.
2. This is also a task that humans find very difficult. It's not like recognizing someone by their face, where humans do it effortlessly but struggle to describe how. We frequently fail at this.
Combining (1) and (2), and the assumption that this task has been just as important historically as it still is now, we might conclude that this is a really hard problem and AI is unlikely to reach the level of performance we might hope for.
And if AI quickly jumps to superhuman levels of performance, that too would have many interesting implications.
I have some skepticism. Humans can’t process multiple audio streams simultaneously. Recognizing faces is a poor analogue to compare with, it would be more similar to photostitching several faces all superimposed on top of each other and asking a human to identify each component face, which humans are also terrible at.
But we also rarely need realtime solutions for this. Casually tracking gross features from a mixed audio source, like a big crowd, is easy, but we almost never need to be actively listening and disambiguating between several deliberate audio signals all at once. Human communication just isn’t setup that way, though I’m sure there are niche exceptions.
Note here that video conferencing is emphatically not an example of an exception. You still need to have synchronous order of speaking, not because technology lacks the ability to separate the streams for the listener, but because the listener cannot pay attention to more than one stream at a time.
To me this technology seems almost exclusively useful for surveillance and offline audio analysis or audio synthesis / mixing.
Still possibly valuable, but definitely not in any fundamental way that would significantly change or augment realtime verbal communication.
> Humans can’t process multiple audio streams simultaneously.
I'm not sure this is true. I think our comprehension drops when we do, but we regularly process background noise and speech at the same time, same for people interjecting over others with simple things like 'excuse me' or similar.
I may be misunderstanding or stretching your point but it's interesting to think about.
I also immediately jumped to surveillance tech when I read it though I can think of other areas it would be useful. Home assistants like they mention in the article would be a great use case, not for multiple inputs at the same time, but to isolate the speaker addressing the home assistant.
We are much better at faces than we are at voices:
• Imagine I show you 10 photos of faces. Then I come back 15 minutes later and ask you to pick them out from a set of 20 faces.
• Imagine I play you 10 audio snippets of someone talking. Then I come back 15 minutes later and ask you to pick them out from a set of 20 audio snippets of people talking.
That said, I agree with you that primarily this technology would be useful for surveillance.
If you can zoom in on any given conversation happening at New York Penn Station, you're basically in the Bourne Identity, with all the privacy-destroying overreach that entails.
I think you completely missed the point. If you think the overall fact that humans are better at face identification than audio identification is relevant here, you missed the point, since that high level fact is not related to this specific problem or the comparison being made.
The task of disambiguating speakers, relative to skill at other audio processing tasks, is more similar to disambiguating superimposed faces, relative to skill at other vision tasks.
It is not similar to quickly recognizing several different faces that are present in a video stream.
Tasks requiring simultaneous processing of focused attention on multiple input streams are not very related to human visual or audio processing. These types of tasks are not similar to attending to things in peripheral vision or peripheral audio.
I have audio playing in my head constantly. It's how I learn about the world. It also seems to be inverted from when I was younger; visuals were much more vivid then.
Same, I can often identify actors on tv shows or movies by their voice, some of which is 20 to 30 years older or later than when I originally heard the voice.
> Humans can’t process multiple audio streams simultaneously.
Yes they can, they can process two, and no more than two. Being able to listen to a conversation where multiple people are talking at the same time isn't processing different audio streams. It's assigning a set of sounds heard over time by timbre, subject (meaning) or relative volume to subsets representing different theoretical sources. People have very little problem doing this with a small number of potential sources (probably another seven plus or minus two thing.)
Even in large crowds, we can often pick out a few individual voices of interest (if distinctive enough in timbre, position and movement (judged by volume and pitch changes), or accent/wording/subject, and follow those voices while disregarding the rest of them.
edit: we might suck at multitasking, but that's not a problem a computer would have. I can't follow multiple speakers at once if they're not interacting with each other, and they're speaking quickly and overlapping, but that's a problem with context switching. With a computer I can just route the separated voices to systems that will handle them. Everybody gets their own waiter.
I think you're right -- we mostly only talk to 1 person at a time.
Where it might be useful is automated live captioning (already a feature in Microsoft Teams), especially in cultures that favor a "high involvement" conversation style where people interrupt and talk over each other (vs a "high consideration" style where everyone waits their turn). This tech might help improve transcript quality.
In the article, under the subheading Why It Matters, the authors say this:
[Quote]
The ability to separate a single voice from conversations across many people can improve and enhance communication across a wide range of applications that we use in our daily lives, like voice messaging, assistants, and video tools, as well as AR/VR innovations. It can also improve audio quality for people with hearing aids, so it’s easier to hear others clearly in crowded and noisy environments such as parties, restaurants, or large video calls.
Beyond its separating different voices, our novel system can also be applied to separate other types of speech signals from a mixture of sounds such as background noise. Our work can also be applied to music recordings, improving our previous work on separating different musical instruments from a single audio file. As a next step, we’ll work on improving the generative properties of the model until it achieves high performance in real-world conditions.
Modern hearing aids do various things in attempt to separate voices in order to present them more cleanly in the 3D spacial field. It's usually referred to as voice tracking, but different companies have different buzzwords. This does very much make it easier for people with hearing aids in both ears to locate a specific voice, and focus on it at the exclusion of others, pretty much as your ears already do. They just make the stereo division more clear. So I think this newer technology is taking what already exists, to the next level.
Some settings can also lower the volume of voices you aren't looking straight at. It's really quite useful if you are in a conference and everyone keeps shuffling around and trying to talk over the person who is supposed to be speaking, for example.
The weird side effect is something I jokingly call "spy mode" which allows me to intensify voices behind me instead of ahead of me. Apparently it's more to benefit drivers, but it's cool hearing clearly everything going on behind you, too. Interestingly I can't use the on-board controls or even the remote to enable the "spy mode," but I can when I am connected (via bluetooth) to my cell phone and control them with the app.
Not an expert in this domain but I'm not sure this can be done (well) without a physical component.
Recent studies have shown that we can consciously and subconsciously physically manipulate the position and directionality of our outer ear and some of the mechanics in the inner ear to "zero in" on noises and affect the frequency response of the ear. Our ears move imperceptibly when we look from side to side to synchronize what we hear with what we see. Try listening to one person in a busy room is saying then try doing the same while looking somewhere else.
There is hardware actively filtering out interfering sounds based on location and frequency, then there's the wetware that further processes the incoming signals and attempts to strip unwanted noise. I don't believe the second can be effectively done without a feedback loop to the first.
The "Why it matters" section is interesting. Cynically I'm trying to think of commercial uses of this for FB. I'm thinking if you built a device that you could put into public places, restaurants, or stores:
* People could order food from their table without summoning a server. I guess some restaurants have tablets or other devices at their table but it seems to break immersion if you're enjoying your company.
* In a big box store someone could come help you where you are without having to have workers roam the store and then you hope you run into someone.
* Fingerprint people in public or private for targeted advertising.
The assumption that this is possible come from our ability to isolate voices in a crowd by paying attention to one or more of them. However our ability to do so rests on two important factors that don't exist in these datasets. 1. We have two ears to allow for sound localization and 2. The sounds we distinguish are colocated in space allowing us to use ambient information for disambiguation.
This means that the problem being solved here is harder than the natural problem we have evolved and learned to solve.
This is both impressive and possibly problematic. Some feature of training in a goal directed fashion in naturalistic environments could be essential for higher quality speaker isolation, or it might not matter at all. The multiplicity of models phenomenon tells us there are likely many solutions to this problem.
It’s porentially even harder than this. Not only do we have two ears, our ears are shaped to help determine where sounds are coming from, and we make constant movements to further help process audio. I remember years ago I played an FPS capture the flag type game with headphones, and, not everyone knows this, but the math works out that the same sound could come from in front or behind, so I developed a strong tendency to make small mouse movements rotating left and right to help me identify where footsteps were coming from. I didn’t even know I was doing it until I talked to an audio engineer who mentioned this problem, and next time I played, I realized how much I relied on it to avoid getting blindsided.
When I was in grad school, I knew an EE research group in the building next to mine working on this problem using ICA (independent components analysis) -- this was ca 2004, before the resurgence of deep learning. Even with ICA useful results could be obtained.
The results of the FB work [2] with RNNs are pretty impressive (audio samples).
[1] https://en.wikipedia.org/wiki/Signal_separation
[2] https://enk100.github.io/speaker_separation/