Google Works Out a Fascinating, Slightly Scary Way For AI To Isolate Voices In a Crowd (arstechnica.com)
An anonymous reader quotes a report from Ars Technica: Google researchers have developed a deep-learning system designed to help computers better identify and isolate individual voices within a noisy environment. As noted in a post on the company's Google Research Blog this week, a team within the tech giant attempted to replicate the cocktail party effect, or the human brain's ability to focus on one source of audio while filtering out others -- just as you would while talking to a friend at a party. Google's method uses an audio-visual model, so it is primarily focused on isolating voices in videos. The company posted a number of YouTube videos showing the tech in action.
The company says this tech works on videos with a single audio track and can isolate voices in a video algorithmically, depending on who's talking, or by having a user manually select the face of the person whose voice they want to hear. Google says the visual component here is key, as the tech watches for when a person's mouth is moving to better identify which voices to focus on at a given point and to create more accurate individual speech tracks for the length of a video. According to the blog post, the researchers developed this model by gathering 100,000 videos of "lectures and talks" on YouTube, extracting nearly 2,000 hours worth of segments from those videos featuring unobstructed speech, then mixing that audio to create a "synthetic cocktail party" with artificial background noise added. Google then trained the tech to split that mixed audio by reading the "face thumbnails" of people speaking in each video frame and a spectrogram of that video's soundtrack. The system is able to sort out which audio source belongs to which face at a given time and create separate speech tracks for each speaker. Whew.
The company says this tech works on videos with a single audio track and can isolate voices in a video algorithmically, depending on who's talking, or by having a user manually select the face of the person whose voice they want to hear. Google says the visual component here is key, as the tech watches for when a person's mouth is moving to better identify which voices to focus on at a given point and to create more accurate individual speech tracks for the length of a video. According to the blog post, the researchers developed this model by gathering 100,000 videos of "lectures and talks" on YouTube, extracting nearly 2,000 hours worth of segments from those videos featuring unobstructed speech, then mixing that audio to create a "synthetic cocktail party" with artificial background noise added. Google then trained the tech to split that mixed audio by reading the "face thumbnails" of people speaking in each video frame and a spectrogram of that video's soundtrack. The system is able to sort out which audio source belongs to which face at a given time and create separate speech tracks for each speaker. Whew.
Might be useful for sorting out what political pundits are saying when they try to overspeak each other.
If it weren't for deadlines, nothing would be late.
1) Employ a robot.
2) Instruct the robot to kill the people in the room, one by one, until the target voice is no longer heard.
#DeleteChrome
Not that I need one yet (actually I do), but a hearing aid (not that I need one) that can pull voices out of the background noise of life (not that I can't do that myself) would be really handy. For the people that really need hearing aids. Not me of course.
The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
So far it's only able to isolate Fran Drescher's voice in a crowd of Amish people. But they're improving it every day.
SJW: Someone who has run out of real oppression, and has to fake it.
The reason we still use court reporters at hearings, depositions and trials is because they can distinguish between voices when folks talk over the top of each other. That's why tape/digital recorders haven't replaced live stenographers. Not yet. That might be the real market for this google product: automated production of transcripts from depositions.