The video fakes are always blurry around the mouth and gestures don't match the words.
The audio fakes are too "clean" (too defined breaks between words and no background noise) and sound like caricatures.
Text is virtually impossible because politicians say "fake" stuff all the time (e.g. speeches someone else wrote) so you can't rely on uncharacteristic language.
The only mistakes I made were calling real ones fake, but I was always able to detect the fakes. Some of the audio sounded too clean but turned out to be real.
It would be interesting to test how well any of us could spot these fakes if we weren't already on a web site priming us to look for them. How much of the skepticism is because you are on "alert" vs one's baseline levels of skepticism.
To me they always look like uncanny valley so my brain flags them as fishy. Everything deepfake feels like watching Gemini Man to me.
But then again, I am pretty paranoid as I do think the media and web are not out there to provide me with quality; they want to achieve money and/or power and will try to manipulate me in order to do that so I would assume fake in case of doubt.
A much easier criterium: the real speakers were speaking continuous sentences with logical spots for breathing pauses, the fakes discrete words. And I guess that's also why I had a harder time with the silent clips, then you don't have that signal.
> The audio fakes are too "clean" (too defined breaks between words and no background noise) and sound like caricatures.
The lack of background noise alone was a pretty instant key for me. And adding realistic background noise, appropriate echo/reverb that would match (even roughly as one-on-one conversation vs podium at press briefing) the room they are speaking in aren't even things you need deepfake for and would be simple audio post-processing.
The impersonators were so bad that I assumed they were intentionally added as a control. Anyone who called those real should likely be discarded from the final results.
I saw one of Donald and was like ha ha, no way, those ears are obviously AI-generated. It was real.
There’s a great Just William story where he sneaks into a waxworks exhibition and pretends to be one, and a young man comes along and points out all the defects in his utterly unrealistic anatomy. It was a bit like that.
To label the practice of speechwriting as 'fake' is deeply misguided. Is the contract you signed 'fake' because you didn't write it? Do you believe politicians should be judged primarily on their writing skills?
A politician choses their speechwriters, works with them, and takes ultimate responsibility. The process just happens to be exactly the same as it for all other tasks they do. I would argue a demonstrated ability to choose good people and judge their work is so essential for high government office, the process involving speechwriters is a far better measure of their suitability for office.
Speechwriters, at least good ones, also don't write 'generic' texts. They will study their client's speech patterns and adjust word choices, rhythm, humor, and other patterns to fit the person's style. For offices as high as the US president, the writers tend not to change, sometimes over decades.
The silent videos are a pointless test because it could easily be poor quality video, and what does fake mean if it’s a video of the politician speaking behind the podium?
The videos with spoken words were very easy to spot, and the ones that dealt with substantive content you could tell by the intonations.
You can still see some bizarre artifacts in the silent fakes. Biden’s teeth moving in a different direction than the rest of his jaw is an obvious tell.
Exactly my feeling about this. Biden and Trump also both ramble incoherent sentences all the time, so they do talk like GPT-3, making it much harder to discern text fakes here.
The only thing that made me score higher than random on the text segments was assuming that the stereotypical extreme things for Trump and Biden to say are usually fake.
I.e. they used a bit of truth but stretched it too far.
This is just nihilistic cynicism. Both Trump and Biden speak in easily understood sentences. I would tend to disagree with Trump more often than not, but that doesn't mean he isn't a clear communicator. Biden is different in that he choses his words more carefully. There are many questions where it is better for him and the US to not answer a question, so he won't.
But nothing about either person's sentences is 'incoherent'. And 'stretching the truth too far' is something entirely different. Incoherent sentences, by definition, cannot be 'stretching the truth', because they convey no meaning.
I'm not saying everything that Trump and Biden say is like this, but the existence of so many bumbling speech examples from both makes it really hard to discern what is real and what is fake.
I found the text easy because my understanding of what they would say is based on their positions in reality than the fictional caricatures of them in the opposition party blogs.
If you weren't familiar then I could see it being hard.
Yeah, the fake audio clips had way too many telltale signs of a fake to be plausible. One of the Trump ones sounded like an SNL skit attempting his accent.
I only got one of the text ones wrong. I think content was my key indicator. The fakes seemed more like what the opposite party would try to make them say... like a caricature of whatever message they actually said.
I only did 12 of them before exiting but got 100% of them. I watched them on my phone without turning it landscape to simulate how I'd normally watch a brief news clip like this.
They're still not particularly convincing, especially the audio ones sound very fake.
But I think these sorts of tests are flawed. The fact that it's a test primes you to look for these details in a way you might not if these clips were shared on Twitter/Reddit/etc. The deepfakes are still pretty obviously fake, so I don't think I'd fall for one "in the wild". But I also see this style of test used for things like "can you hear the difference between these 2 audio compression formats" or "does this virtual guitar amp sound just like the real thing", and that's just not how you consume real media. For example I'd probably never know Meshuggah switched to entirely digital guitar amps years ago if it weren't for them saying it in interviews.
When deepfakes one day do get as close to the real thing as digital guitar amps have gotten to real amps, I doubt I'd notice a random r/worldnews post is faked, even if I could still reliably pass an A/B test.
This is the real test. While we (HackerNews) readers might be able to realize they are fake, that other 90% of the world won't.
The reverse side of it is if an overwhelming amount of deep fakes get called out, people will stop trusting the real ones (we have already started stepping into that reality)
Perhaps you detected them because you knew there would be 50% fakes? If you'd see one of these fakes in a CNN article, would you still reliably detect it? I think the role of the prior expectation should not be underestimated.
I think it's one of those studies where they say they are testing for X but they are actually testing for Y. Not sure what Y could be. But I don't even live in the US or consume it's media, and could still figure out all of the examples I saw/listened.
Maybe it was just the samples that I got, but I found the audio fakes extremely unconvincing - I think they sounded not much different than some comedian trying to do an impression of them.
I found the silent videos much harder, though more in the direction that I marked a lot of them as fake when they were in fact real.
I didn't get the reasoning behind the text-only examples at all. It's impossible to verify a quote just by looking at it. We know this for several centuries - that's why academia makes such a big deal of tracking sources. So what were they trying to do here?
In general, I think the performance results here might be a bit too optimistic: It's one thing to spot a fake if you are told to specifically look for them. However, if one of those were just casually embedded in some news report while the attention of the audience is somewhere completely different, I doubt we could spot them as easily. Even moreso if the fake isn't of the two most well-known faces of the western world.
> I didn't get the reasoning behind the text-only examples at all.
So the researchers can quantify how much of people's ablitity to detect deep fakes is coming from the video/audio and not the textual content and knowledge of politician's actual political position.
I wonder if this is a good (alternatIve) use for a blockchain. If I know address X is from the whitehouse, and they post on that address and it gets sourced from that address for distribution, wouldn’t that be a certification of its origin?
It could also be signed using SSL or GPG or really just any other kind of PKI. In the end I need to establish a root of trust, so I either way I need to know that 2c414...2d7 is indeed the hash of the chain_address/ssl_cert/root_ca/gpg_key/... - so in that regard there is no advantage of using a blockchain over a "traditional" PKI.
Amend: The real question is how to verify cropped and recompressed snippets. In academia or HN we add a "see foo et al. [1]", so probably that needs to be encoded into the data stream, e.g. "authoritative src is https://media.whitehouse.gov/permanent-url/1234[...] with hash $hash signed by $cert_hash on $data, portions are $some_encoding_here" and better be signed as well.
Did not mean to be dismissive of your point. It is just that the overhang of blockchain as a solution to everything is so prevalent that sometimes need intervention.
It is true that the cryptographic primitives that blockchain is built on are relevant to many other things that we can do and have done. Blockchain in fact utilizes those properties to build a distributed shared state on which the parties have consensus. That, however, is a very special use case of those primitives. If you need message authentication, digital signatures have done that for decades and you don't need an expensive associated database to accomplish that.
I don’t think so. It is trying to solve the wrong problem, and the problem it manages to solve has simpler solution already.
You are solving the problem where an entity (the whitehouse) wants you to know that a message came from them. Yes they can use blockchain to sign it, but they can also just put it on their website, invite the press pool for an announcement, and have their PR folks reiterate the message every so often anyone asks.
This is not the problem with deepfakes. Imagine that a harmfull footage of the president saying something terrible emerges and someone has to decide if it is real or fake. This someone might be a simple citizen, a journalist, an editor etc. They check the blockchain and see that it is not signed by the whitehouse. Does that mean that it is fake? Maybe, maybe not. Maybe the president just wants to surpress it. Maybe it was an off the cuff remark while they thought nobody is recording them, or maybe they realised that they made a mistake and decided to not sign the footage after all. Which means that the person judging the authenticity of the footage haven’t gained anything from checking the blockchain.
Just as an example: during the Trump presidential campaing the now infamous “grab them by the p*sy” footage emerged. Obviously it wasn’t an official endorsed message from his campaign. Would you say it was fake just because it wasn’t signed by them?
Good points and you raise some issues I hadn’t considered. I honestly don’t know :)
My idea of the blockchain is that it is a chain that could be a chain of trust. Since it seems it is hard to temper with whoever owns an endpoint. But like you say there are other means for that.
I also didn't find the videos hard, as lip movement looked quite fake. Maybe for some people they would fit better, but both Trump and Biden are too old to move their lips so vigorously.
As for the text: you can easily have more than 50% success rate:
- with political knowledge you know some things are unlikely to be said by a given politician
- ironically, if the text is grammatically correct, it's unlikely that it was actually spoken by a politician - real transcripts have stuttering. Of course AI could be train to contain stuttering as well, or a bad AI could spew out complete garbage. I think in case of both Biden and Trump that makes it especially hard to differentiate between bad AI and them :D.
Even a genuine quote can be presented out of context.
For an example that is a bit on the nose, I recently read a quote from Sebastian Haffner about German history, that went like this:
> Germany was a much happier place at the turn of the 20th century than it was in the 80s.
Which sounds odd, coming from a left-wing, liberal journalist - unless you know from context that he was talking about the 1880s, not the 1980s. He was not wishing the Kaiserreich back, he was just comparing different historical phases of it. Nevertheless, without the context, the quote would look highly strange, even if it is technically not a fake.
The point of this survey is to improve the deepfakes. They aren't very difficult to spot when you've been primed and are paying attention, but the purpose here isn't to show off how good the fakes are or even challenge you to spot the difference -- it's to study what it is that gives a deepfake away. This'll ultimately be used to make the fakes more believable.
I know you really want to point out how the main purpose of this is polling (training), but you had to make the claim that we are just accustomed to seeing "deepfakes". So I'm gonna tell you why that's not the case.
This is maybe the 3rd time in my life I looked at any of the recent AI generated stuff since the breakthrough in the 2010s and they were all obvious. I got 6/6 correct on my first try. Like maybe if I never saw any normal video of any sort (NTSC, HD, LCD, CRT, mpeg) in my life then I wouldn't be able to tell, because I would have no conception of what the other 500 glitches caused by common video techniques are and if this is one of them as opposed to an edit. The first nimrod's face is wobbling and glasses are 2D and move in a different direction from his face. This is what I noticed in the first fraction of a second after opening the website, before even figuring out if this is what the test is or not.
This is literally like the 90s when people were like "WOAHHH these 3D effects are so real", and doom, etc. I really hope this isn't the state of the art people keep telling me about. It wouldn't surprise me though, as all modern politics are based on make believe concepts.
There's also the fact that this stuff is ultra low resolution, and at higher resolutions, mistakes will be even more obvious.
Aww, I was hoping I'd get a score at the end to see how well calibrated my predictions were, so I didn't record my accuracy on my own. But I got nothing. Now I'm a bit disappointed. I can't well do it again because I suspect I would have vague memories about which ones were fake and not – especially the ones that surprised me, which would be the most informative of my level of calibration.
I did certainly feel like I learned as I went, though. Seems like the deep fakes are a bit more monotonous, regardless of the format. The real humans have more embellishment and imperfections in their tone of voice, phrasing, and facial expressions.
This was an unsatisfying experience. The visual and audio fakes are easy to spot, while the texts are short enough and correct enough that you don't have enough to go with in making a technical decision. I did however get them right by trying to figure out if the statement made sense in the context of the respective politician's general stance on the issue discussed.
Update:
Come to think of it, IRL text deep fakes don't even make sense. You can make up whatever quotes you like, you don't need GPT3 to do it for you.
> Come to think of it, IRL text deep fakes don't even make sense. You can make up whatever quotes you like, you don't need GPT3 to do it for you.
Exactly. John Oliver pointed out a few weeks ago that the Russian media was using stock footage of other nation's military actions (Finland?) for propaganda. No need for deep fakes. The lowest effort lie can still work.
Deep fake tech is the new photoshop. It's mostly used for good (art, memes, jokes with friends, Disney films).
You can imagine a future where deep fakes are utterly indistinguishable from real life, but I think that people will learn to recalibrate themselves against the new technologies. It may even cause us to think more critically: people frequently chime in, "how is this not photoshopped?" on social media. They'll call out videos of explosions and physics as "looking fake". They'll have the same response to purported deep faked videos.
All the same, I'm glad and appreciate that academic and policy folks are taking a close look at the technology. We need frameworks, ethics, and institutional familiarity.
> Come to think of it, IRL text deep fakes don't even make sense. You can make up whatever quotes you like, you don't need GPT3 to do it for you.
They make sense for advertising and misinformation purposes.
You don't have to pay a firm to drum up interest for a brand/product, theoretically, you'd be able to generate convincing conversations about the brand between bots on Reddit or Twitter, and do it at scale. Instead of paid posts on social media, you cut out the middle man and give the perception of community support for a brand/product, despite it being artificial.
The last point goes for misinformation, as well. Want to start grassroots support for an idea/policy/person/group/place? Instead of paying teams of people to post on social media about it, you can automate the campaign and scale it with deep faked text.
I don't think deep fakes are there yet for either purpose, though, and it isn't a given that they'll be good enough for them, either.
I didn't take the test, but one thing I worry about is that the existence of mediocre fakes makes people more confident in their ability to spot fakes when in reality they might just not be catching the ones that are well-done. There's aren't many feedback loops in most people's lives whereby if someone believed something that was fake they'd be corrected and learn to be more skeptical.
>one thing I worry about is that the existence of mediocre fakes makes people more confident in their ability to spot fakes when in reality they might just not be catching the ones that are well-done.
Yeah, most of the "fake" audio clips were blatantly obvious to literally everyone with a functioning brain (and hearing ability I guess) and if you thought that represented anywhere near the state of the art in faking voices you'd be far more confident than you deserve to be.
If I hadn't already heard very good audio deepfakes I would have left the survey with far less concern than I presently have around the issue. I came in expecting that the quality of the fakes would get far more convincing over time and it really didn't, but the capability definitely exists (especially around public figures with much audio available for training)
Plot twist: 100 % of the material here was faked, only 50 % well done and 50 % not so well done! Now that's a test of how well people actually recognise this stuff.
I think starting with a guy who looks almost exactly like Tom Cruise in the first place really gives the computer the advantage it needs to look believable. You can see him here
https://youtu.be/p7-B8S734T4?t=77
DeepTomCruise on TikTok is probably the best I’ve seen at creating consistently good, believable deepfakes. But it’s largely because the actor is good at imitating Tom Cruise. The video fakery varies from mediocre to excellent, but it’s the voice and mannerisms that really sell it.
I was completely expecting to be impressed with yet another demonstration of this technology taking a step forward.
I got 31/32, and the only one I got wrong was a text statement that I don't really understand the point of including. None of them were remotely difficult except for one.
Every clip's authenticity was instantly, blatantly obvious. This survey could've been done in 60 seconds, sans the `x` seconds of required viewing before submitting an answer.
I'm likewise baffled by the inclusion of text samples. It has nothing to do with deep fakes. Perhaps it's a control? I just skipped them, defaulting to the 50% skepticism slider option.
Others in this thread report similar results. I wonder if Hacker News users have a preternatural aptitude for spotting inauthentic content owing to skewed technological competence, or if that's hubris -- simple primal readings of facial expressions/inflections expose the fakes. I'd bet on the latter option. Our lizard brains are keen to detect that uncanny valley. If this recording collection is any indication, duplicity is a few years down the road...
I really wish they took actual things the individual said in the past, but used a deepfake for them actually saying it. Most of the deepfakes were easy to spot, but a couple weren't clear at first that they were fake, only to be given away by the content of their speech.
But the person saying something outrageous would undermine the point of the study--to determine if deep fakes were indistinguishable from actual footage. If it's given away by something unrelated to the deepfake method, then you're adding in an external bias unrelated to the realism of the deepfake.
Unless I misunderstood the goals of the study of course.
It doesn't necessarily have to be. For example what if you modified a clip of a political candidate asking for funding and only modified the name of the organization/website their supporters should go to in order to donate.
Getting the voice accurate would be very hard, even if you only used existing words they've said to make up the false domain. I've tried editing audio that way and it still sounds like when a video game ai stitches dialog together.
Getting the voice accurate is actually really easy, and is done all the time. https://en.wikipedia.org/wiki/Overdubbing is used in music and movie production, and you can change out a few words with "ease" [1]. You use an impressions actor, get them to say the whole line, matching the cadence, and you've got a decent amount of skew in mouth movement because people aren't always lipreading carefully.
None of these techniques are out of the budget of a college film or dedicated hobbyist.
Overdubbing is far easier with an instrument than with a voice, the article doesn't even list examples of an impressionist over dubbing someone, just overdubbed duets. Also what you're describing is not overdubbing, it's ADR. Overdubbing is just that, dubbing over the existing audio. Not only would you need to mimic the speech but you'd have to recreate the ambient sound as well. Even with a person speaking into a microphone in a silent room the change in ambience is noticeable unless the whole scene is using a false ambience track separate from the voice track.
And I disagree with you about the lipreading. There's a reason good ADR in films is always done with a shot of the back of the actors head.
Even when it's done with literally the same original speaker it's pretty obvious. Or have you never fooled a stranger in the alps?
Mildly surprised the quality of this project coming out of MIT. Not only the UI was quite hard to use, I could not get beyond 10 samples. They were annoyingly obvious. Why would they not choose to not to make this as tricky as possible?
It’s clear that state of the art deep fake technology can do better than what’s presented here.
What if the whole point of this exercise is to instill (false) confidence in the most people that they have the ability to distinguish a deep fake from the real thing?
For the record, I don’t believe this is the case and I fully expect that subsequent iterations of this exercise will be far more difficult.
My guess is that they don't want to distort results. Any kind of scoring would present an incentive to start cheating (e.g. by not checking the "I've seen this before"-box), be it consciously or subconsciously.
From the presented samples I'd wager that one goal of the study might be to quantify the influence of audio-only vs video-only vs text, hence text-only, video-only, audio-only, and video with subtitles.
While text limits the decision making process to preconceived notions about the person in question and prior background knowledge, video and audio may present clear cues from human perception alone.
This is too easy, I struggled a bit with no sound but video to detect real or fake because the lighting can be a bit weird but for the voices you can hear it in a split second that they're fake
Agreed. The video and text fakes are getting better, but the audio (at least here) isn't close. It's not even an uncanny valley, it's worse than an SNL parody.
In cases without audio, watching the lips for sync was also a reliable giveaway.
I only did about half of the samples, but things became harder further in. I'm not sure if they were better done or I was just becoming bored and losing focus though.
For context: I'm mostly unaware of actions of US presidents (beyond the broad sweeps based on left vs right) and I've at most listened to Trump and Biden maybe a couple of times for a total of ~1min in the last few years - just doing this exercise has probably at least doubled my exposure.
I found the text and voice mostly impossible unless the content clued me in. Video was a little easier because I know to look for teeth, video with sound was fairly easy.
Is there a generic trick for recognising faked voices without really knowing the original (similar to looking for teeth on videos)?
Same here! I'm not a native English speaker, I live in the US but don't follow politics. I got a lot of Biden ones wrong because I haven't watched as many videos of him as Trump's videos. I also marked some real videos as fakes because I thought there is no way Biden or Trump would talk about this specific subject in this way.
Maybe we are missing some tricks to pick out the fakes.
For the majority of folks here who picked out the fakes easily, do you feel if you can do the same if the videos were of Putin speaking in Russian (with subtitles), or speaking in English with a very thick accent?
There are real world effects in the real videos that are completely absent in the fake ones that give them away.
The fakes have completely clean audio with similar artificial noise (real ones have crowd, or other backround noise), the fakes have no microphone or environmental artifacts (echo/reverb, plosives, others).
What made them even easier to recognize is that accents were all wrong. They were exactly as another person said, caricatures of the original voices. I actually found that they were so obvious that most could be identified before the second or two that they make you wait to click the submit button.
If it were a foreign language, I think the success rate would be lower, but the unnatural lack of noise gives them away.
For the speech clips I thought the disjointedness of the words made it obvious. That being said, Trump has a very recognizable way of talking so his clips were really easy to sort.
I mostly didn’t pay attention to the political content as much as I could because it seemed like a few of them were supposed to be gotchas based on that. Obviously the text only ones forced my hand.
Wish they told me at the end how many I got. Probably missed 4 or 5.
Before deepfakes, we had controversies such as "Technological Dub Erases a Bush Flub for a Republican Ad"[0] from a clip of the 2003 SOTU address when Bush misspoke "while" which was changed to "vial" for clarity in the ad.
I got a couple of the video-only ones wrong, because it's hard to tell a low-quality video of someone not moving very much from a fake version of same. The fake audio ones were easy to discern, almost from the first word. They sounded like an SNL impression of those politicians. I don't know whether that means video is, in general, easier to deepfake than audio, or if the examples used in this experiment were bad representations of the state of the art.
I had the same results. Super easy on audio samples, tough on video-only ones. I suspect we're all so used to compressed video that AI artifacts can hide in that noise. Just the other day a video of Putin was tagged as "green screen!!" because one part of the video appeared to show his hand move through a microphone stand.
But it wasn't. It was just shitty compression artifacts making it appear so. (Or, maybe it still was, but I'm not convinced by that evidence)
But speech cadences and patterns are super tough to counterfeit.
I was pretty good at detecting the fakes (video and sound), but made some mistakes with the real ones. But I've also relied on the 50% fakes prior which I couldn't use in a real-world setting. Idea for a follow-up experiment: Manipulate the prior and see how it affects people's performance. If you tell people it's just 10% fakes (proxy for the reputation of the source), performance might be much worse.
The video look worse than "professional fakes" long even before AI.Text is the hardest for people who don't follow those specific people(though it is still easy to spot the fake), and audio is very easy aswell.
Not boasting, just reality.I've seen some pretty astounding deepfakes, these are not part of those.
100% accuracy rate, gave up in the end because there was no indication of how many examples I had to go through.
So the AI is manipulating the mouth and its problem is its top lip isnt consistent so sometimes the top lip is appearing like a horizontal 'S'.
Voice impersonations are too unrealistic.
I'm not used to either Biden or Trump, as I do not follow US politics much. I am also uninterested in spending much time analysing the video, and watched it portrait mode on my phone. I found many of the fakes convincing at a glance.
Honestly, everyone going on about how obvious the fakes are seems a little silly. If the technology is here now in this form. It might already be far more impressive elsewhere, and doubtlessly will be far more impressive in the future. We should have a discussion about what that future might look like.
Been wondering about this in context of UKR and the huge role of open source intel via mobile phone videos in it. For understandable reasons it's not part of the discussion right now but there might be interesting retrospectives about this available down the line.
This is science. They're disproving a very narrow, specific hypothesis about a particular algorithm. They're not trying to disprove deepfakes in general. It's just one experiment collecting data on one setup.
I saw once a putin deep fake.
combine that with anonymous hackers breaking into tv station,
or as a website to generate video from text to dissuade occupiers.
What do u get? these technologies can have real life implications.
Anyway, THANK YOU site developers for properly handling resumption of progress and not abusing the history API. Greatly appreciated and unfortunately very rare for these sorts of sites.
All of the videos except the first one had no audio for me. The volume control was greyed out. I have the "no autoplay" option enabled in my browser, not sure if that's why.
fakes have been an issue since forever, and the solution has always been the same, estabilishing a trust chain.
just cutting things out of context is enough to completely twist reality, deep fakes are just convenient shortcuts, but hardly needed.
media going back into "anonymous sources" and "people close to the matter says" isn't helping either.
as a sciety we were going in the right direction with pgp and decentralized authority chains, but apparently all that scene is pratically dead nowadays.
Did anyone else notice they have been trained to automatically close pop-ups without reading them? It turns out this one has useful instructions in the pop-up.
Video & sound: labeled all correctly as fake/real.
Text: did not even try to read.
My comments:
For video, the biggest tell for me is the mouth. Just ~5 seconds of footage was enough usually. The lips shift in size and appearance or move way too little, for example.
For audio, also a few seconds was usually enough to make a decision. The intonation of both Biden and Trump is way off in the fakes if you've spent some time listening to them during debates, interviews, etc. It has hints of the real voice but to me the fakes sounded quite a bit off. The curious fact is that I haven't listened to anything said by Trump since he left office but his characteristic voice is still very clearly embedded in my head. I guess humans are quite good at memorizing and recognizing voices. Biden to me sounds drunk/high in the fakes LOL.
Overall, all of the fakes felt pretty easy to spot.
Anybody with knowledge in the field: is this the current "state of the art" in terms of deepfakes or better results can be achieved?
I missed one video-only one (I was correctly leaning "fake", since it felt "off", but second-guessed myself with the reflection off Biden's forehead: "Surely the AI would fuck that up, right?") and somehow only missed one of the text ones despite randomly guessing (how am I supposed to know if it's real or fake?). Had I stuck to my uncanny-valley-recognizing gut, I probably would've aced the audio and/or video ones.
Got most of them. I wish it gave me a score at the end, though! The easy ones were where the person was saying something really uncharacteristic, like Biden saying they were going to legalize marijuana on the state level. The neutral ones where the person wasn't saying anything but just moving around making random gestures with really low video quality were the hardest.
The test seems flawed. I got the first one (after the attention check) wrong - I said it was fake, but it was real. I thought it was fake because Trump's hair was clearly very smeared out, not at all realistic. But it seems that this must have been an artifact of heavy video compression. I don't see how this is a meaningful test of anything.
Yeah, but how is the test taker supposed to know whether to take the smearing as indicating it's fake or not?
Better would be to let them see 10 videos all at once, half of which are fake, and ask them to divide into a fake set of 5 and a real set of 5, after looking at all of them as many times as they like. Asking "fake or real" when there is no basis to tell whether flaws should be taken as indicating "fake" or just attributed to compression seems meaningless.
Or tell people what aspects of fakeness they're trying to assess - eg, forget about video artifacts, just pay attention to the audio.
Using clips of Trump and Biden is also a bad idea. They ask you to say if you've seen one before, but aren't many people going to have seen one, but not clearly remember that, and then be influenced to think it's real by sub-conscious recognition?
Why not present pairs of videos of the same non-famous person, one fake one real, both presented with the same amount of compression, and ask one to chose which is the real one? Using many different people, of course - why would you introduce doubt about the generality of your results by using only two people?
Of course, in practice people may be less able to recognize fakes when video quality is poor, which would be useful to know, but I think one would need to investigate that issue separately, not in combination with other reasons that fakes might or might not be recognizable.
I'm pretty certain that in at least one of these Trump is standing in front of a green screen, I put fabricated, but I guess that was wrong to do because that wasn't fabricated with AI? Either way this said it was real, which yeah it was, but the setting was faked. Idk.
Honestly, I think this level of "fakery" is a little... facile. Firstly, practically no one judges individual clips, they're judging based on who is bringing it to them and who they trust. Secondly, Donald Trump could say the most dispicable things and still get elected. People seem to forget that Trump got caught on a hot mic admitting to sexually assaulting women, and that he got elected after that.
Also, the fake ones with audio or text from Trump are way too easy for anyone who has seen Trump over the years to guess (as opposed to those who know Trump from "comedy news shows" from the last few years)... because Trump wouldn't say some of those things (anti-gay marriage stuff, for example).
Biden is a bit more difficult to guess like that because he could have said anything which he thought would be popular at a given point throughout his career... but then the technology is too bad to actually be convincing enough.
I didn't do the test. My first thought was that it's the wrong question. A "trusted sources" model for believing stuff can always be gamed. There should be more focus on independent confirmation, and general sanity checking, rather than caring about whether a video is fake or not, especially if one is going to take some action based on the content.
The video fakes are always blurry around the mouth and gestures don't match the words.
The audio fakes are too "clean" (too defined breaks between words and no background noise) and sound like caricatures.
Text is virtually impossible because politicians say "fake" stuff all the time (e.g. speeches someone else wrote) so you can't rely on uncharacteristic language.