The side-by-side display makes it pretty easy to distinguish the one from the other, simply compare them at a level where the one that makes the least sense is the one that is nonsense. Like that I score 10/11. But when looking at just the left side one suddenly the problem is much harder, and I'm happy to get better than even. Bits that don't help: not an English native writer. Seen too many real life papers with crappy writing that quite a few of these look plausible, especially when they are about fields that I know very little of.
Presumably when you're a native English speaker and have a broader interest the difficulty goes down a bit.
I like this project very much and would like to see some overall scores, and it might not hurt to allow for a verified result link to detect bragging rather than actual results (not that anybody on HN would ever brag about their score ;) ).
Overall: I'm not worried that generated papers will swamp the publications any day soon but for spam/click farms this must be a godsend and for sure it will cause trouble for search engines to classify real content from generated content.
and for sure it will cause trouble for search engines to classify real content from generated content.
Fortunately, SEO spam is currently nowhere near as coherent as this, and often features some phrases that are a dead giveaway ("Are you looking for X? You've come to the right place!" or a strangely-thesaurised version thereof), but I am also worried about this new generation of manufactured deception.
I tried making fake tweets using GPT-2 two years ago. When I actually interviewed people to verify my model I got good results for people who didn't actively engage with twitter versus people who regularly engaged in twitter (note that this was an N ~ 10 people and it was limited to GPT-2 774M.
I found that people would also refuse the test and would believe whatever the output of the model was due to my choice of subject.
Others that did a similar exercise and tried to verify their results using reddit had a great deal of people who would be able to spot fakes quite easily.
The biggest issue would be someone using a system to deliberately fool a targeted set of people which is easy given how ad networks are run.
Having read your comment first (ooh, horribile dictu on HN) I decided to try playing by only looking at the left paper and deciding if it was fake. Luckily the model seems to have picked up that "last names can be units" too strongly and the 2nd fake paper was discussing a frequency of "10 Jones".
Quite easy when you know one is fake. Flagging fake articles in a review queue, by abstract only, and when none may exist all the way up to all being fake ....
Now that's a challenge.
Also, if you train GPT on the whole corpus of Nature / Science / whatever articles up to, say, 2005, could you feed it leading text about discoveries after 2005 and see if it hypothesizes the justification for those discoveries in the same way that the authors did?
I find the whole thing ominous because there is no "there" there: there is no understanding in the GPT-2 system, but it's able to generate increasingly plausible text. This greatly increases the amount of plausible nonsense that can be used to drown out actual research. You could certainly replace a lot of pop-sci and start several political movements with GPT-2... all of which has no actual nutritional content.
I find it intriguing for exactly that reason. I agree there's no fundamental "there", but I suggest that you may find that ominous because it implies there's no fundamental understanding anywhere. Only stories that survive scrutiny.
GPT can write "about" something from a prompt. This is not much different than me interpreting data that I'm analyzing. I'm constantly generating stories and checking them, until one story survives it all. How do I generate stories!? Seriously. I'm sure I have a GPT module in my left frontal cortex. I use it all the time when I think about actions I take, and it's what I try to ignore when I meditate. Its ongoing narrative is what feeds back into how I feel about things, which affects how I interact with things and what things I interact with ... not necessarily as a goal-driven decision process, more as a feedback-driven randomized selection. Isn't this kind of the basis of Cognitive Behavioral Therapy, meditation, etc. See [1,2]. If you stick GPT and sentiment analysis into a room, will they produce a rumination feedback like a depressed person?
Anyway, if you can tell a coherent story to justify a result (once presented with a result), one that is convincing enough for people to believe and internalize the result in their future studies, how is that different from understanding that result and teaching it to others? The act of teaching is itself story generation. Mental models are just story-driven hacks that allow people to generalize results in an insanely complex system.
1. Happiness Hypothesis Jonathan Haidt
2. Buddhism and modern psychology, coursera
You could probably ask an undergraduate or pop science fan to write a paper title and first sentence of an abstract that would turn heads and get good results.
Faking an entire 10 page paper with figures and citations is much harder. I'm sure it'll happen next week, but until then I can still say that's where real understanding is demonstrated.
> there is no understanding in the GPT-2 system, but it's able to generate increasingly plausible text
Under some definition of "understanding". GPT understands how to link words and concepts in a broadly correct manner. As long as the training data is valid, it's very plausible that it could connect some concepts that it genuinely and correctly understands are compatible, but doing so in a way which humans had not considered.
It can't do research or verify truth, but I've seen several examples of it coming up with an idea that as far as I can tell had never been explored, but made perfect sense. It understood that those concepts fit together because it saw a chain connecting them through god-knows how many input texts, yet a human wouldn't have ever thought of it. That's still valuable.
As to how far that understanding can be developed... I'm not sure. It's hard to believe that algorithmically generated text would ever be able to somehow ensure that it produces true text, but then again ten years ago I would have scoffed at the idea of it getting as far as it already has.
Makes you wonder if humans also have less "there" than we give ourselves credit for. How much of sentence construction is just repeating familiar tropes in barely-novel ways?
To what extent have our brains already decided what to say while we still perceive ourselves as 'thinking about the wording'?
No, the text is generally very implausible if you know anything about the science. For example, describing dna twists with protein folding descriptors, mixing up quantum computing with astronomy, inorganic chemistry with biochemistry, virology with bacteriology...
I was really impressed with gpt-2 but seeing this really gave me a feel for how much of a lack of understanding it has.
A striking argument, at this point in time.
Quite likey this weakness will be overcome soon, when deep learning becomes integrated with [KR²]-spectrum methods...
this is already happening. maybe not in Nature, but all over the "small" papers of this world (and so also with any Nature derivatives). There's people milling the same article with nonsense "results" through 3 journals each in Elsevier, Springer, Wiley and RCS. And they stay up. Forever. And anyone looking into this topic will waste an hour because these 10x cited articles are just self-citations of the same crap.
This challenge would be more interesting if there were "Neither is fake" and "Both are fake" buttons (and obviously, the test randomly showed two fake and two real articles in the mix)
Absolutely. Especially once you factor in the fact that sometimes things are simply written poorly or don’t make sense, even if it was a real human being at the keyboard.
A lot of the communication I have with folks is subtly flawed in logic or grammar, but that doesn’t make me think I’m working with a bunch of androids.
It’s natural and often even necessary to try to figure out an author’s intent when their writing doesn’t fully make sense.
The chicken genome (the genome of a chicken that is the subject of much chicken-related activity) is now compared to its chicken chicken-to-pecking age: from a genome sequence of chicken egg, only approximately 70% of the chicken genome sequences match the chicken egg genome, which suggests that the chicken may have beenancreatic.
Even hard mode isn't that hard because GPT-2 tends to ramble on while saying nothing substantive. If I can't figure out what a paper is supposed to be talking about, it's fake.
Did way better on Hard mode than Easy. I think people get bored of doing this before we can see real indicative results.
Scores under 5 on what amounts to a coin flip doesn't strike me as so remarkable, especially when coupled with an incentivised reporting-bias as we see here. ("I got a high score! Proud to share!" Vs. "I got a low score, or an even score and look at all the people reporting high scores, think I might keep it to myself")
Being as it is, at this juncture, I think the AI may still have a chance to be strong with this one.
Also, were the AI to do well consistently, I'd think it might say more about the external unfamiliarity with, and the internal prevalence of, field-specific scientific jargon, than any AI's or human's innate intelligence.
The engineering/materials science/physics ones were fairly easy to identify for me. Usually it would be one or two sentences that were grammatically cohesive but would make a statement that didn't make any sense if you had even a basic understanding of the topic. One that stood out to me was an astrophysics paper that said a planet was orbiting solar wind. I don't have to be a PhD to know that's BS.
Yes, this mirrors my experience. Fields that have my interest are pretty easy in isolation (just looking at one subject), but fields that are remote can be a challenge.
In this instance, the medical and biotech generated stuff are much harder to identify because the algorithm doesn't need to introduce grammar issues. For instance, here is a random paper abstract that I changed, can you spot the change? Hint, it is one of the Greek symbols or a number.
Three highly pathogenic β-coronaviruses have crossed the animal-to-human species barrier in the past two decades: SARS-CoV, MERS-CoV and SARS-CoV-2. To evaluate the possibility of identifying antibodies with broad neutralizing activity, we isolated a monoclonal antibody, termed B4, that cross-reacts with eight β-coronavirus spike glycoproteins, including all five human-infecting β-coronaviruses.
Same for me, hard mode is quite easy, our brain is pattern matching Engine and constantly try to make sense of this and hence it may look like legitimate, it wouldn’t if text were 2D.
The hard version usually requires me to understand why a number, measurement or chemical or other substance doesn't make sense in the context of what each paragraph is describing. This means I can't just skim it in order to spot the fake, I need to figure out that what it's saying is wrong.
That's close enough for this to be a success if the purpose was to persuade or fool laymen.
I also got one that was a fake that was talking about measuring two actual quantum properties simultaneously by using a third state to probe it indirectly.
Which is absolutely a real thing except that the exact quantum properties in fact didn't commute while they claimed they did commute and said for some reason simultaneous measurement required a third state anyway.
I don't know how I would have been able to distinguish that from completely reasonable methods for quantum error correction without knowing ahead of time which quantum states commute and which don't... pretty cool.
If I were skimming or half asleep I definitely wouldn't have caught a lot of these on hard, abstracts are always so poorly written and usually trying too hard to be complicated sounding by using big words when small ones would do just fine!
I always wanted similar thing but for some philosophy texts. Notably Hegel - I'd love to see a philosopher trying to figure out which pile of gibberish is generated and which is the work of a father of modern dialectics.
Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.
> Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.
Yeah, I ran into at least one example which basically regurgitated a real paper. The "fake" article was:
Efficient organic light-emitting diodes from delayed fluorescence
A class of metal-free organic electroluminescent molecules is designed in which both singlet and triplet excitons contribute to light emission, leading to an intrinsic fluorescence efficiency greater than 90 per cent and an external electroluminescence efficiency comparable to that achieved in high-efficiency phosphorescence-based organic light-emitting diodes.
Highly efficient organic light-emitting diodes from delayed fluorescence
Here we report a class of metal-free organic electroluminescent molecules in which the energy gap between the singlet and triplet excited states is minimized by design4, thereby promoting highly efficient spin up-conversion from non-radiative triplet states to radiative singlet states while maintaining high radiative decay rates, of more than 106 decays per second. In other words, these molecules harness both singlet and triplet excitons for light emission through fluorescence decay channels, leading to an intrinsic fluorescence efficiency in excess of 90 per cent and a very high external electroluminescence efficiency, of more than 19 per cent, which is comparable to that achieved in high-efficiency phosphorescence-based OLEDs
That's something I've been suspecting for a while. More powerful models produce overfitting, so it is likely that GPT-2 is simply memorizing whole texts and then regurgitating them in whole.
Knowing how they're generated, a sequence of sentences that make sense are likely copied almost verbatim from an article written by a human. Without understanding the concepts, the algorithm may simply repeat words that go well together - and what goes together better than sentences that were written together in the first place?
What the GPT model is really good at is at is identifying when a sentence makes sense in the current context. Given that it has half of the internet as its learning corpus, it is easy that it's simply returning a piece of text that we do not know about. The real achievement thus is finding ideas that are actually appropriate in relation to their input text.
Yes, I got a really short astronomical one about the discovery of a metallic core planet circling a G-Type star, and I only knew it was fake, because I would have heard about it!
Sometimes the articles are really short which makes it even harder to figure out which is fake. GPT's big weakness is that it tends to forget what it was talking about and wanders off after a couple of paragraphs. With just one sentence to examine it can be very hard to spot.
With these GPT models, I don't get the appeal of creating fake text that at best can pass as real to someone who doesn't understand the topic and context. What's the use case? Generating more believable spam for social media? Anything else? Because there's no real knowledge representation or information extraction going on here.
I use GPT-2/3 in creative writing to generate rough text that I then go back and edit/improve because it gives a good starting point (and often has a lot of high-quality factors).
I don't know if this translates to technical writing, but it's possible someone might complete a prompt on some specific topic(s) and then use that as a point to start from, especially if they're knowledgeable enough on the topic to correct the output. It's nice to be able to skip a lot of boilerplate words (how many words in this comment are actually the meat of this idea, and how many words are just there to tie all those morsels together?)
I built my own editor on top of the API, which I ran at home, yeah. You can see the code and/or run it yourself at https://github.com/indentlabs/gpt-3-writer, but it's very rough and not really ready for public use. Just make sure your OpenAI secret key is set in ENV['OPENAI_SK'] and you should be able to run it yourself.
I also built a more polished version to add to the Notebook.ai document editor (so writers can get some continuation prompts whenever they get a bit of writer's block), but the pricing made it unfeasible to actually release. Notebook.ai is also open source though, and you can see the GPT-3 functionality in the unmerged PR here: https://github.com/indentlabs/notebook/pull/739
Mainly at present I see it as a demonstration of progress in the capabilities of the models. A machine is writing coherent syntax and grammar in close to real-time, maintaining a semblance of context while it does so, without requiring excruciatingly complex and detailed semantic algorithms.
One use-case is as a creative writing assistant. Typical ways to beat writer's block are to do something else (read, walk, talk, dream). At some point, either consciously or subconsciously, the hope is that these activities will elicit inspiration in the form of an experience or idea that will connect with the central vision of the author's work, allowing the writing to continue. So too it could be with prompts generated from these models, just another way of prompting and filtering ideas from the sensory soup of reality.
(While the "dark" version of reality has bedroom "writers" pooping out entire forests of auto-generated pulp upon ever-jaded readers, being able to instantly mashup the entire literary works of mankind into small contextual prompts would be another tool in the belt for more measured and experienced authors.)
There are other more practical use-cases for these models. There's work being done on auto-generating working or near-working software components from human language descriptions. Personally I'd love to just write functional tests and let the "AI" keep at it until all the tests pass. So seeing the models improve over time is a sign that this may not be an impossible feat.
From a less utilitarian perspective, I'd love computer assistants to have a bit of "personality". "Hey Jeeves, tell me the story about the druid who tapdanced on the moon." and just let the word salad play out in the background. Yeah it's a toy, but it would jazz up the place a bit, add a bit of sass even.
I do think we're a ways off the first computer generated science paper being successfully peer reviewed and contributing something new to human understanding of nature, I'd have scoffed at the idea ten years ago but now I'm sure it's only a matter of time.
Here's an example I just ran into today. This is post #17 in a thread on a car forum where people are discussing whether or not to run the A/C periodically in the winter:
5/5 on hard node, but it’s tough sometimes, I don’t actually know much about biology. But if you’ve played around with GPT before, you get better at spotting the subtle logical errors it tends to make. I wonder whether the ability to identify machine generated texts will become a useful skill at some point.
In essence detecting which one is fake is a common way how you train the generator, tweaking the generating process to "fix" any detectable flaw; and you train it until (as far as your system is concerned) the generated texts are indistinguishable from the real ones. A better system might distinguish them, but that better system can be relatively trivially adapted to generate better texts which it won't be able to distinguish from real ones.
GANs work by feeding back the mistakes and forcing the generator model to improve its cheating. In this case, filtering out titles that are ambiguous would act as an independent filter.
I'm sure GPT2 abstracts would fly through many conferences screening processes. I've seen talks and posters that were utter non-sense but everybody was too polite to say anything to the person or advisors.
I've reviewed articles that were completely made up and the other reviewer didnt even detect that. Nor did the editor.
I've contacted editors about utterly wrong papers, criticized the article on pubpeer, and the article is still published... Because it would harm their notoriety. Thats one of the madenning ascpects of academic publishing.
There is a long tail of weak journals in just about any field. When you think about it, this is inevitable in any society that has freedom of the press and where there exist incentives (evaluated by non-experts) for publishing. You have to evaluate journals the same way that you would evaluate products purchased in a flea market.
Abstracts are relatively short, so for the length of an abstract GPT-2 might just be fusing together the abstracts of two or three related papers, so the result might look legit. It tends to wander around when the length is increased, though, and if asked to go on for long enough it will lose the plot.
How does one train GPT-2 with their own content and produce nice results at arbitrary lengths? I found a few libraries but I could not use them well, I get lost very quickly. I just want to train our internal Confluence and have fun with it.
If you don't understand the terminology, any paper is gibberish :) but I agree, I can detect fake ones fairly reliably in biology, but not in e.g. astronomy.
Even with GPT-3, where this "game" would be much harder, this would be kind of a weak demo, because the human reader doesn't understand what the text means in either case (most of the time), taking much if not all away the interesting part of whether the generated text makes any sense or not.
We have known for a while that language models can generate superficially good looking text. The real question is whether they can get to actually understand what is being said. As humans don't understand either, the exercise sadly moot.
STEM people love to bring up the sokal affair. the same STEM people also don't realize that many journals and conferences in STEM have been tricked by things like this (more specifically precursors using HMMs and etc).
The reason for the downvotes is probably the generalisation regarding what "STEM people love to bring up" and also "don't realize". It feels like an unprovoked strawman attack against an ambiguously defined group of people.
10/10 on easy and 10/10 on hard. Hard selections seem mostly hard because they are short enough you don't see gpt-2 to go off the rails with something completely nonsensical.
Only one was convincing enough to be truly challenging, I got it right because the mechanism proposed was fishy, 1) I had domain expertise, and 2) the date of the paper made no sense relative to when that sort of a discovery would be made (2009 is too early)
6 and 2 on hard mode. The failure of the model to connect ideas in long paragraphs (or to make a succinct claim) is what gives it away. It introduces far too many terms with far too little repetition and far too much specificity in such a short span.
Suggested tweak - train it against papers written by people with an Erdos number < 3 (or Feynman contributors, etc.), so that the topics and fake topics are more closely related in style and content. Maybe even feed it some of their professional letters as well. That would produce some very hard to decipher fakes.
Another great corpus for complex writing is public law books. Have it compare real laws from the training set with fake laws. I bet it would be very difficult to figure out the fake laws.
Training one of these on an entire corpus of one author (Roger Ebert, Justice Ginsberg, Joyce, anyone with a large enough body of work), and having people spot the fake paragraphs from the real ones would be very, very difficult. An entire text, however, would likely be discernible.
It is getting really, really close to being able to fool any layman, though. Impressive work!
The generated abstracts may be gibberish but I wonder how often they contain little bits of brilliance, or make novel connections between ideas expressed in the training set. If we got a panel of domain experts to evaluate the snippets on this basis, thrir labels could be used to fine-tune the model in the direction of novel discovery. (This is almost certainly not a novel idea!)
Now here is an application where the GPT stuff can really shine, which is trying to convince people that aren't domain experts that something is speaking from authority, even if the reader doesn't intend to get anything meaningful from the material either way.
18/20 on hard mode. Sherter ones are more difficult, longer ones tend to have dangling clauses or circular claims. I suspect GPT-3 could produce convincing complete abstracts. But this was good enough that I don't feel bad about the two I missed.
Seems like the model likes to repeat words in the title, particularly when hyphens are involved (I guess it considers them
as different words?) e.g. "new dinosaur-like dinosaur" and "male-pattern traits in male rats" are a couple I saw.
If we feed AI all the knowledge about the physics of the world, then will it be ever capable of giving answers without actually performing scientific research inferring it just from the laws that define the world?
I very much doubt it, at least with regards to GPT-n style models. In this particular example, it is not actually being fed any knowledge about the physics of the world. Rather, it is being fed the texts of a very specific subculture (that of scientific research and publishing) which is based not only on the prior sensory experiences of human beings, but also following the arbitrary agreements and expectations of the members of the scientific community that have developed over time. Even the most intelligent human minds would be unable to learn anything meaningful from scientific papers if they were expected to read them from scratch having been brought up completely isolated and without any prior knowledge.
On the other hand, an interesting possibility with well-designed text-mining and AI models would be for them to generate valid hypotheses that hadn't been contemplated earlier, based on the massive corpus of scientific publications. The model may be able to find possible correlations or interesting ideas by combining sources from different fields that would normally be ignored by the over-specialised research community. However, in that case the model wouldn't be valuable for providing answers—rather, it's value would be in providing questions.
We've done recent work on using a transformer to generate fake cyber threat intelligence (CTI) and found that a set of cybersecurity experts could not reliably distinguish the fake CTI examples from real ones.
Priyanka Ranade, Aritran Piplai, Sudip Mittal, Anupam Joshi, and Tim Finin, Generating Fake Cyber Threat Intelligence Using Transformer-Based Models, Int. Joint Conf. on Neural Networks, IEEE, 2021. https://ebiq.org/p/969
The sad thing is that often there's an equal mental effort to read GPT articles and the real ones. It's as if people are trying to make their papers as incomprehensible as possible.
Nah--If you're publishing in Nature, you're already well beyond that game.
The incomprehensibility comes from the fact that abstracts (and particularly NPG abstracts) are trying to do many things at once--and all in 200 words. In theory, the abstract should describe why your work is of broad general interest (so Nature's editors will publish it), while explaining the specific scientific question and answer(!) to a specialist audience of often-picky, sometimes-hostile peer reviewers, and conforming to a fairly specific style that doesn't reference the rest of the paper.
It's tough to do well, and even moreso for non-native English speakers.
Pretty easy, even in hard mode, and not due to any knowledge of the subject matter. I'm 15 - 0 so far.
I kept seeing certain types of grammatical error, such as constructs like "... and foo, despite foo, so..." or "with foo, but not foo..." where foo is the exact same word or phase appearing twice in a sentence.
I also kept seeing sentences with two clauses that should have agreed in number or tense but did not.
"This study presents the phylogenetic characterization of the beak and beak of beak whales; it is suggested that the beak and beak-toed beaks share common cranial bones, providing support for the idea that beaks are a new species of eutriconodont mammal."
Is it repeating itself because the corpus is too small? 10,000 papers seems like rather a small corpus. How large a training corpus would normally be used in GPT-2 work?
[I know nothing - I'm pretty ignorant about practical ML]
I found it relatively easy to spot the fakes, but the titles on some of them were pretty good and made me wish they were real. Like reading science journals from a whimsical fantasy universe.
Some of my favorites: "A new onset of primeval black magic in magic-ring crystals"
"The genetic network for moderate religiosity in one thousand bespectacled twins"
"Thermal vestige of the '70s and '00s disco ball trend"
It seems some of the fake ones could easily have been real (e.g. the one about the 3d structure of bound Ach receptor) . I guess the brevity of the text helps to make it make sense and to make it indistinguishable
Not hard at all. The fake ones doesn't make any sense from an English perspective. Looks like someone just picked the next word on a SwiftKey keyboard or something. "this word fits here... Right?"
I wish there was a version of this for computer science. We don't have a broad flagship journal like Nature, so maybe it would need to be trained on a collection of IEEE and ACM venues.
The major tell of these systems is writing something coherent over a span larger than a few paragraphs, so this is less impressive than it would have been 5 years ago. Still, well done
Presumably when you're a native English speaker and have a broader interest the difficulty goes down a bit.
I like this project very much and would like to see some overall scores, and it might not hurt to allow for a verified result link to detect bragging rather than actual results (not that anybody on HN would ever brag about their score ;) ).
Overall: I'm not worried that generated papers will swamp the publications any day soon but for spam/click farms this must be a godsend and for sure it will cause trouble for search engines to classify real content from generated content.