It's also obviously not a pure human voice recording as the pitch transitions sound heavily auto-tuned or electrified (think Cher's "Believe").
I anticipate people becoming experts in detecting AI-generated vocalists in much the same way that we can currently detect AI-generated images due to abnormalities especially in details like ears or fingers.
And I also expect that, very soon, we won't be able to tell them apart anymore (like those wine experts that fail to detect the good wines if blindfolded).
The non-singing TTS are barely discernible now. I watch a lot of narration-heavy edu-tainment on YouTube and often the only way I can detect TTS is the consistent monotone and uniform syllable cadence. There can be 15 minutes before a single mispronounced word is spoken. That could be a preview of what's to come with AI video.
I've heard this, and I would have been inclined to believe it. But then I watched the documentary Somm about the journey of a couple of friends reaching for the highest rankings of sommeliers. They could identify grapes, regions and year with striking accuracy. I just don't see how you could do that and then not be able to tell white and red wine apart.
I barely know what I’m doing with wine but am 100% sure I could at least tell you which are whites and which reds if you lined up a typical Chardonnay, a typical Pinot Grigio, a typical cab sauv, and a typical Pinot noir.
I am certain there exist weird wines that could fool me (I’ve had a few really weird wines) but typical shit from the grocery store, I’m gonna be able to tell at least that much. I might even ID them more precisely than red or white. It’s not exactly subtle…
Then again I don’t have a clue how someone could fail to tell which is coke and which Pepsi in the “Pepsi challenge”. They’re wildly different flavors. I can tell by smell alone.
I vaguely remember looking into this before, and it turned out that the tasters were being told (incorrectly) that it was a red wine, and asked to describe the flavour profile. They then used tasting terms more frequently associated with reds than with whites, and didn't question what they were told.
So it's less a case of "they cannot distinguish red from white" and more a case of "they went along with a suggested classification". I feel like this is a weaker result, although it's still a little surprising.
My feeling is there is the high level classification which is quite difficult to fuck up. After that it’s all adjectives and analogues, which is the fluffed up phoniness that inherently presents itself in the process of converting our subjective experiences of physical reality into abstract symbols.
I've given this map to half a dozen smart/well educated Canadians, who happily engaged in pointing out the states they recognized for several minutes, and not one of them noticed until it was pointed out.
I'm from the UK and would probably have fallen for this for several minutes as well. I hope that I'd eventually realise from the number of states down the West coast.
For anyone else who is geographically challenged, the map apparently has 64 states instead of 50. Here is an article with the extra states highlighted.
Yeah, but that still shows people's perception of wine is barely above noise level, if it can be so easily misled.
For comparison, imagine someone showing a piece of Picasso to art critics and saying "Could you please describe the artistic significance of this painting by da Vinci?" The critics won't start using terms commonly reserved for Renaissance era; they'll say "What the fuck are you talking about, this isn't da Vinci."
A lot of the biggest perceived differences come from temperature, since red wines are usually served at room-temperature. If you ever decide to do a blind test, make sure to control for temperature. I did it, and I had a very hard time picking out which varietals were red and which were white.
I rarely drink wine (less than 1x every few years) and I can tell the difference between a red wine and a white wine, and subcategories of red wines (and I do specifically mean the difference, so that means only when compared to another wine).
The hard part is identifying the type of wine, but many of my wine-drinking friends can do with ease. We've tried the "test," having me or someone else randomly purchase wines from the closest store and then serving random samples to them while they're blindfolded. They're able to identify the specific variety more than 4/5 of the time.
Yeah, I'm sure a lot of these tasters are overly pretentious. But some people are willing to go the opposite extreme and think people can't taste anything. Can anyone tell the difference between Coke and Sprite? Between Coke and Pepsi? Coke and Diet Coke? Of course we can. The difference between a typical pinot noir, syrah, or cabernet sauvignon is not something it takes magic powers to differentiate. Now specific years, wineries, etc, now that raises questions.
This one only shows you poor “expertise” more than anything, as it is standard exercise while training to become a wine expert in France (they also give students white wine that have been red-colored or otherwise tempered with), so I wouldn't expect any legit expert to be fooled this way. Though it's true that with some wines it can be tough initially for enlightened amateurs.
Source: my wife's godfather did the studies for that[1] two years ago.
This myth is based on a fundamental misunderstanding of the experiment conducted. The conclusion of the experiment was that the vocabulary used to describe wine is subjective, and that the chosen descriptors are most heavily influenced by the color of wine, the perceived cost of the wine, and the taster's opinion of whether it was a good wine or not.
I've participated in a blind wine tasting, and it was trivially easy for even complete amateurs to guess the right color of wine 100% of the time.
As far as I can tell, AI image generation still struggles with some things after many years of research and is often detectable. Perhaps vocals is easier though.
It's like cgi, you only recognize bad examples of it while the good ones go right past you. I've got plenty of ai generations that fool professional photo retouchers - it just takes more time and some custom tooling.
Generative text-to-image models based on neural networks have been developing since around 2015. Dall-E was the first to gain widespread attention in 2021. Then later models like Stable Diffusion and Midjourney.
"Quadrupled" is a very specific and quantitative word. What measure are you basing that on?
Ah right. But that's only tangentially related to being able to distinguish AI-generated images. There are tells that are completely separate to resolution, such as getting the correct spacing of black keys on a piano.
Many human vocalists have similar aberrations. Remember Jimi Hendrix "Excuse me while I kiss this guy", or the notorious autotune on a number of contemporary pop artists (you gave an example yourself)?
IMHO many of the successes of "artificial intelligence" come from "natural stupidity". Humans have many glitches in our perceptual mechanisms. The AIs that end up going viral and become commercially viable tend to exploit those perceptual glitches, simply because that's what makes them appeal to people.
The difference between this and Hendrix's "kiss this guy" is that you can listen to it and plausibly believe that Hendrix is actually saying "the sky". In the linked track you know the actual words but it still doesn't sound like them.
The average modern pop songs boasts a worse voice that sounds more autotuned than this AI song here. The biggest problem to me is how the voice appears to be shaky.
I'm going to go out on a limb and say you don't listen to much "modern pop" - the production quality of the biggest mainstream pop is _extremely_ high at this point, and while "worse voice" is obviously subjective, this really wouldn't be anything stand-out in that regard even if it didn't sound like a robot.
I mean, if you dislike the sound of an autotuned voice (which a lot of people do), then there are a lot of modern(ish?) pop artists that use it heavily who you're gonna hate: Travis Scott, Lizzo, Neo, Kesha, T-Pain, Doja Cat ...
> I anticipate people becoming experts in detecting AI-generated vocalists in much the same way that we can currently detect AI-generated images due to abnormalities especially in details like ears or fingers.
People fail to identify even the most basic and obvious fakes, but somehow there’s a group of people who think that as fakes become harder to distinguish from reality, we’ll all magically become experts at it. We won’t. People’s ability to detect fakes will get worse, not better, as a consequence of more prevalent and better fakes.
I'm always surprised by this idea too. I can easily see an outcome of it becoming increasingly hard to convince people that real things are actually real.
At that point, is it AI generated?? That feels like an entirely different category to me
(like it's sort of no difference than paying someone to voice something and share it)
I think the stuff that is completely generated with no human in the loop is a different category for me because it can be used for things at scale like, bots on social media, or ads in a podcast generated just for you, etc. As long as there is still a human in the loop making the editing decisions, it feels not categorically different from the world we have today.
That's a fair point, but "ai does the work and humans clean up the mistakes" is generally a lot faster than humans doing all the work. Singing well takes skill (even when you have autotune), splicing together multiple "takes" into one good recording, less so.
I'm never going to have the voice to sing this, but I can easily imagine learning how to edit it.
AI/Human combos can still be valuable. More broadly I'd argue that that's how almost all tech works. E.g. there are still textile workers, just many less of them producing much more clothing.
I would say it's still categorically different, just because we're automating one piece of labor that was kind of thought until about ~12 years ago to be un-automatible.
Like, there's been computer-singing voices for awhile, but they always sounded pretty robotic and goofy (e.g. Microsoft Sam), and I think for a long time people just assumed that to get mostly-realistic voices, you need an actual singer. Yes, it still requires a bit of human tweaking to make it perfect, but I suspect that if put to the test it would reduce the cost of making a song substantially.
* sublicence - "sublissence"
* fitness - "fisted"
* infringement - "infring-ment"
* liable - "liar-ful"
It's also obviously not a pure human voice recording as the pitch transitions sound heavily auto-tuned or electrified (think Cher's "Believe").
I anticipate people becoming experts in detecting AI-generated vocalists in much the same way that we can currently detect AI-generated images due to abnormalities especially in details like ears or fingers.