It never ceases to amaze me what you can do with these transformer models. They created millions of potential solutions for each problem, used the provided examples for the problems to filter out 99% of incorrect solutions and then applied some more heuristics and the 10 available submissions to try to find a solution.
All these approaches just seem like brute-force approaches: Let's just throw our transformer on this problem and see if we can get anything useful out of this.
Whatever it is, you can't deny that these unsupervised models learn some semantic representations, but we have no clue at all what that actually is and how these model learn that. But I'm also very sceptical that you can actually get anywhere close to human (expert) capability in any sufficiently complex domain by using this approach.
And next year they can filter out 99.99%. And the year after that, 99.9999%. So literally, an exponentially greater number of monkey/typewriting units. (An AI produced Shakespeare play coming soon).
>> we have no clue at all what that actually is and how these model learn
This is why I'm super cool-to-cold about the AI/deep learning classes being sold to young people who would otherwise be learning fundamental programming skills. It appears to me like trying to teach someone to ride a horse before they understand what skin, bones, muscles, animals, and horses are.
>>get anywhere close to human (expert) capability in any sufficiently complex domain
You can get close enough to scalp a lot of billionaires, but at the end of the day it's always going to be human coders banging our heads against management, where they ask for shit they can't visualize and it's our job to visualize how their employees/customers will use it. Yes it involves domain specific knowledge, but it also requires, er, having eyeballs and fingers, and understanding how a biological organism uses a silicon-based device. That's kind of the ultimate DS knowledge, after all. Now, lots of coders just copy-pasta a front end, but after all the hooplah here I'd be extremely surprised if in ten years an AI has caught up to your basic web mill in Indonesia when it comes to building a decent website.
Surely if your discrimintator gets orders of magnitude better like your describing, we could train the transformer GAN style, and reduce the dependence on generating so many examples to throw away.
Another way to frame it is that these models still perform very poorly at the task they're designed to do. Imagine if real programmer needed to write a solution a hundred times before they were able to achieve (average) performance. You'd probably wonder if it was just blind luck that got them to the solution. You'd also fire them. What these models are very good at doing is plagiarizing content, so part of me wonders if they aren't just copying previous solutions with slight adjustments.
> Imagine if real programmer needed to write a solution a hundred times
To be fair, a lot of creative work requires plenty of trial and error. And since no problems are solved from scratch, all things considered, the most immediate contributors to your result and you might have iterated through tens of dozens of possibilities.
My advantage as a human is I can often tell you why I am eliminating this branch of the search space. The catch is my reasoning can be flawed. But we do ok.
> just copying previous solutions with slight adjustments.
It's not just doing that, Copilot can do a workable job providing suggestions for an invented DSL. A better analogy than autocomplete is inpainting missing or corrupted details based on a surrounding context. Except instead of a painting we are probabilistically filling in patterns common in solutions to leetcode style problems. Novelty beyond slight adjustments comes in when constraints are insufficient to pin down a problem to a known combination of concepts. The intelligence of the model is then how appropriate its best guesses are.
The limitations to GPT3 codex and AlphaCode seems to be they're relatively weak at selection and that they require problem spaces with enough data to distill a sketch of and how to inpaint well in them. Leetcode style puzzles are constructed to be soluble in a reasonable number of lines, are not open ended and have a trick to them. One can complain that while we're closer to real world utility, we're still restricted to the closed worlds of verbose apis, games and puzzles.
While lots of commenters seem concerned about jobs, I look forward to having the dataset oliphaunt and ship computer from Fire Upon Deep someday soon.
>> While lots of commenters seem concerned about jobs, I look forward to having the dataset oliphaunt and ship computer from Fire Upon Deep someday soon.
I think this is more worthy of debate than anything about DSL models or current limits to problem spaces.
I'm not concerned about my job, but I am concerned about a world where corporate money starts shifting toward managing AIs as beasts rather than coding clever solutions. I'm concerned about it because (1) It has always been possible in theory to invent an infinite number of solutions and narrow them down, if you have the processing power, to those that "work", but, this leaves us in a position where we don't understand the code we're running (as a society) or how to fix it (as individuals). And (2) because learning to manage an elephant, as a beast, is utterly different from learning to build an elephant, and it will lead to a dumbing-down of people entering the trade. In turn, they'll become more reliant on things just working the way they're expected to work. This is a very negative cycle for humanity as a whole.
Given the thing you're looking forward to, it's only about 30 years before no one can write code at all; worse, no one will know how to fix a broken machine. I don't think that's the thing we should advocate for.
"Understanding the code" might not be that big of a deal as you might think -- we have this problem today already. A talented coder might leave the company and the employer may not be able to hire a replacement who's as good. Now they have to deal with some magic in the codebase. I don't hear people giving advice not to hire smart people.
At least with AI, you can (presumably) replicate the results if you re-run everything from the same state.
There's also a very interesting paragraph in the paper (I'm in no position to judge whether it's valid or not) that touches on this subject, but with a positive twist :
Interpretability. One major advantage of code generation models is that code itself is relatively interpretable. Understanding the behavior of neural networks is challenging, but the code that code generation models output is human readable and can be analysed by traditional methods (and is therefore easier to trust). Proving a sorting algorithm is correct is usually easier than proving a network will sort numbers correctly in all cases. Interpretability makes code generation safer for real-world environments and for fairer machine learning. We can examine code written by a human-readable code generation system for bias, and understand the decisions it makes.
> Now they have to deal with some magic in the codebase. I don't hear people giving advice not to hire smart people.
People do advise against hiring people who write incomprehensible code.
Yeah every now and then you run across some genius with sloppy code style and you have to confine them to a module that you'll mark "you're not expected to understand this" when they leave because they're really that much of a genius, but usually the smart people are smart enough to write readable code.
>The limitations to GPT3 codex and AlphaCode seems to be they're relatively weak at selection
This really does seem like the key here--the knowledge apparently is all in the language model, we just haven't found the best ways to extract that knowledge in a consistent and coherent manner. Right now it's just: generate a bunch of examples and cherry pick the good ones.
How do you know the inner workings of the mind don't operate in a similar manner? How many different solutions to the problem are constructed within your mind before the correct one 'just arrives'?
I suspect there is some similarity between language models and the structure of language in the mind, but there's a whole lot more going on behind the scenes in the brain than simple runtime statistical model output. Intentionality, planning, narrativity, memory formation, object permanence... Language models are exciting and interesting because apparently they can do abstract symbolic manipulation and produce coherent text, but I wouldn't call AGI solved quite yet.
I was really impressed with a lot of the GPT3 stuff I had seen people showing so I gave it a spin myself. I was surprised by how repetitive it seemed to be, it would write new sentences but it would repeat the same concepts among similar prompts. I wish I saved the examples, it was like when a chat bot gets in a loop but GPT3 varied the sentence structure. I think that if you look closely at transformer models outputs you can expect the same sort of thing. Its like in high school when people would copy homework but use different wording.
I also think generally in ML and DL the overarching progress gets hyped but in the background there are murmurs about the limitations in the research community. Thats how we end up with people in 2012 saying FSD is a couple years away but in 2022 we know we aren't even close yet. We tend to oversell how capable these systems are.
Id be shocked if people pitching startups and research grants etc all started saying "yeah this stuff isn't going to work for a couple of decades in any kind of sustainable manner" even if these types of unknowable unknowns were known.
What do you think then is the difference between going from 50th to 99.9th percentile in their other domains? Is there something materially different between ago, protein folding, or coding? (I don’t know the answer, just curious if anyone else does)
>> What do you think then is the difference between going from 50th to 99.9th percentile in their other domains? Is there something materially different between ago, protein folding, or coding?
Yes, it's the size of the search space for each problem. The search space for arbitrary programs in a language with Universal Turing Machine expressivity is infinite. Even worse, for any programming problem there are an infinite number of candidate programs that may or may not solve it and that differ in only minute ways from each other.
For Go and protein structure prediction from sequences the search space is finite, although obviously not small. So there is a huge difference in the complexity of the problems right there.
Btw, I note yet again that AlphaCode performs abysmally badly on the formal benchmark included in the arxiv preprint (see Section 5.4, and table 10). That makes sense because AlphaCode is a very dumb generate-and-test, brute-force search approach that doesn't even try to be smart and tries to make up for the lack of intelligence with an awesome amount of computational resources. Most work in program synthesis is also basically a search through the space of programs, but people in the field have come up with sophisticated techniques to avoid having to search an infinite number of programs- and to avoid having to generate millions of program candidates, like DeepMind actually brags about:
At evaluation time, we create a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work.
They say that as if generating "orders of magnitude more" progams than previous work is a good thing, but it's not. It means their system is extremely bad at generating correct programs. It is orders of magnitude worse than earlier systems, in fact.
(The arxiv paper linked from the article quantifies this "massive" amount as "millions"; see Section 4.4).
Well with respect to Go the fundamental difference afaict is that you can apply self-supervised learning, which is an incredibly powerful approach (But note e.g. that even this approach wasn't successful in "solving" Starcraft). Unfortunately it's extremely difficult to frame real-world problems in that setting. I don't know anything about protein-folding and don't know what Deepmind uses to try to solve that problem, so I cannot comment on that.
> this approach wasn't successful in "solving" Starcraft)
Why do you say that? As I understand it, AlphaStar beat pros consistently, including a not widely reported showmatch against Serral when he was BlizzCon champ.
1. First, though I am not sure of this (i.e. this should be verified), I heard that the team working on AlphaStar initially tried to create a Starcraft AI entirely through "self-play," but this was not successful. (Intuitively, in a real-time game, there are too many bad options too early on that even with a LOT of time to learn, if your approach is too "random" you will quickly enter an unwinnable position and not learn anything useful.) As a result, they replaced this approach with an approach which incorporated learning from human games.
2. "including a not widely reported showmatch against Serral when he was BlizzCon champ." is a mischaracterization. It was not a "showmatch," rather there was a setup at Blizzcon where anyone could sit down and play against AlphaStar, and Serral at some point sat down to play AlphaStar there. He went 0-4 vs AlphaStar's protoss and zerg, and 1-0 vs its Terran. However, not only was he not using his own keyboard and mouse, but he could not use any custom hotkeys. If you do not play Starcraft it may not be obvious just how large of a difference this could make. BTW, when Serral played (perhaps an earlier iteration of) AlphaStar's terran on the SC2 ladder, he demolished it.
I remember when seeing the final report, I was a bit disappointed. It seemed like they cut the project off at a strange point, before AlphaStar was clearly better than humans. I feel that if they had continued they could have gotten to that point, but now we will never know.
"It seemed like they cut the project off at a strange point, before AlphaStar was clearly better than humans. I feel that if they had continued they could have gotten to that point"
What if that's why they cut it off..
I think the GP means that the AlphaStar team stopped working on the project because they felt it was reaching a dead end and unlikely to produce further results, or at least other ventures might have been more promising.
I think that's most likely the case too, otherwise why would they give up?
I guess I feel that there is a big discontinuous jump between "not clearly better than humans" and "clearly better than humans," where the latter is much, much more significant than the former. It seems like going on a hike and stopping before the summit.
I looked into this again and the hotkey situation seems more unclear than I suggested. You could not log into your Battle.net account, so it would have been somewhat time consuming to change all of your settings manually. If I had to guess, I might wager that Serral changed some of the more important ones manually but not the others, but this is just conjecture and maybe he changed all of them. I don't know if anyone but Serral would know this, however.
In any case, Serral said this, which you can take as you will:
"It was okay, I doubt i would lose too many games with a proper setup. I think the 6.3-6.4 mmr is pretty accurate, so not bad at all but nothing special at the same time."
On the one hand, surely it doesn't seem surprising that the player who lost, the human, would say the above, and so one may be skeptical of how unbiased Serral's assessment is. On the other hand, I would say that Serral is among the more frank and level-headed players I've seen in the various videogames I've followed, so I wouldn't be too hasty to write off his assessment for this reason.
Not once humans adapted to it afaik. AlphaStar got to top grandmaster level and then that was it, as people found ways to beat it. Now, it may be that the team considered the project complete and stopped training it. But technically - as it stands - Starcraft is still the one game where humans beat AI.
No, the version which played on ladder was much weaker than the later version which played against pros and was at BlizzCon -- the later version was at professional level of play.
There were numerous issues. First one (somewhat mitigated lately) was extremely large number of actions per minute and (most importantly) extremely fast reaction speed.
Another big issue is that the bot communicated with the game via a custom API, not a via images and clicks. Details of this API are unknown - like how invisible units were handled, but it was much higher level than a human would have (pixels).
If you look at the games, the bot wasn't clever (which was a hope), just fast and precise. And some people far from the top were able to beat it convincingly.
And now the project is gone, even before people had a chance to really play against the bot and find more weaknesses.
The difference is that DreamCoder has a hand-crafted PCFG [1] that is used to generate programs, rather than a large language model. So the difference is in how programs are generated.
________
[1] The structure of the PCFG is hand-crafted, but the weights are trained during learning in a cycle alternating with neural net training. It's pretty cool actually, thought a bit over-engineered if you ask me.
Right, I think it’s a bit crazy not to use a grammar as part of the generation process when you have one. My guess is that constraining LLM generation with a grammar would make it way more efficient. But that’s more complicated than just throwing GPT3 at all of Github.
Also, my understanding is that Dreamcoder does some fancy PL theory stuff to factorize blocks of code with identical behavior into functions. Honestly I think that’s the key advance in the paper, more than the wake-sleep algorithm they focus on.
Anyways the point was more that self supervised learning is quite applicable to learning to program. I think the downside is that the model learns its own weird, non-idiomatic conventions, rather than copying github.
I guess you're right. The sleep-wake cycle is like a kind of roundabout and
overcomplicated EM process. I've read the paper carefully but theirs is a
complicated approach and I'm not sure what its contributions are exactly. I
guess I should read it again.
Yes, it's possible to apply self-supervised learning to program synthesis,
because it's possible to generate programs. It's possible to generate _infinite_
sets of programs. The problem is that if you make a generator with Universal
Turing Machine expressivity, you're left with an intractable search over an
infinite search space. And if you don't generate an infinite set of programs,
then you 're left with an incomplete search over a space that may not include
your target program. In the latter case you need to make sure that your
generator can generate the programs you're looking for, which is possible, but
it limits the approach to only generating certain kinds of programs. In the end,
it's the easiest thing to create a generator for progams that you already know
how to write- and no others. How useful is that is an open question. So far
no artificial system has ever made an algorithmic contribution, to my knowledge,
in the sense of coming up with a new algorithm for a problem for which we don't
have good algorithms, or coming up with an algorithm for a problem we can't
solve at all.
My perception is influenced by my studies, of course, but for me, a more
promising approach than the generate-and-test approach exemplified by DreamCoder
and AlphaCode etc. is Inductive Programming, which is to say, program synthesis
from input-output examples only, without examples of _programs_ (the AlphaCode
paper says that is an easier setting but I very disagree). Instead of generating
a set of candidate programs and trying to find a program that agrees with the
I/O examples, you have an inference procedure that generates _only_ the programs
that agree with the I/O examples. In that case you don't need to hand-craft or
learn a generator. But you do need to impose an inductive bias on the inference
procedure that restricts the hypothesis language, i.e. the form of the programs
that can be learned. And then you're back to worrying about infinite vs.
incomplete search spaces. But there may be ways around that, ways not available
to purely search-based systems.
Anyway program synthesis is a tough nut to crack and I don't think that language
models can do the job, just like that. The work described in the article above,
despite all the fanfare about "reasoning" and "critical thinking" is only
preliminary and its results are not all that impressive. At least not yet. We
shall see. After all, DeepMind has deep resources and they may yet surprise me.
4. This is an overview of Meta-Interpretive Learning (MIL), a new approach to ILP that overcomes many difficulties of earlier approaches (Full disclosure: my own work is on MIL, though not the article linked):
Meta-Interpretive Learning: achievements and challenges (Stephen Muggleton, 2017)
That should be enough to get you started. I recommend reading in the order I linked to the various articles. I tried to give links to documents that I know can be read for free.
Unfortunately most of the material on ILP is either in scholarly articles, or, where there are textbooks, they tend to be older. That sounds bad, but there has been much new work recently with several new approaches.
Let me know if you're looking for more specific information. See my signature for contact details- I'm happy to answer emails about ILP :)
It should be emphasised that inductive programming is not tied to logic programming, and works for every other programming paradigm as well, e.g.
functional programming [1, 2]. We could also do IP for imperative programming, although, as far as I am aware, nobody has done this.
That's absolutely true! But the OP asked about ILP in particular.
To be fair, logic and functional programming languages do have some advantages as target languages for Inductive Programming compared to imperative languages in that they have very simple syntax. For example, Prolog doesn't even have variable declarations. That's very convenient because the learning system only needs to learn the logic of the program, not the syntax of the language also. It's also much simpler to define language bias or program schemata etc constraints on the form of hypotheses in such languages, or even order programs by generality. For instance, Prolog has unification built-in and unification is used in ILP to order programs by generality (by testing for subsumption). All this machinery would have to be implemented from scratch in an imperative language.
Although the reason that logic and functional programming languages are given more weight in IP is probably for historical reasons, because Lisp and Prolog were, for a long time, "the languages of AI".
I'm trying to remember... I think there's been some IP work on imperative languages, maybe even Python. I'll need to check my notes.
Not naive at all! One common categorisation of ILP approaches is by whether they search for programs from the most to the least general (least general is more specific), or from the least to the most general. Some approaches do a little bit of both. Approaches that search from general to specific are known as "top-down" and approaches that search from specific to general are known as "bottom-up".
The "top" and "bottom" terms refer to a lattice of generality between programs, where generality is typically measured by subsumption or entailment etc. Subsumption in particular is a syntactic relation (that implies a semantic one, entailment) so "searching" a space of logic programs ordered by subsumption means in practice that the space of programs is constructed by generalising or specialising some starting program by means of syntactic transformation according to subsumption (e.g. a first order clause can be specialised by adding literals to it: P(x):- Q(x) subsumes P(x):- Q(x), R(x). The simplest intuition is to remember that by adding more conditions to a rule we make it harder to satisfy).
A more general program entails more logical atoms and ILP algorithms are typically trained on both positive and negative example atoms of a target program, so top-down approaches begin with an over-general program that entails all the positive examples and some or all of the negative examples and specialise that program until it entails only the positive examples. Bottom-up approaches start with an over-specialised program that entails none of the positive examples and generalise it until it entails all the positive examples.
The mathematics of generalisation are at the core of ILP theory and practice. It's what sets ILP apart from statistical machine learning which is based on the mathematics of optimisation.
That’s a big question but I’m tempted to answer it with a yes. A protein sequence contains a complete description of the structure of a protein but a coding question contains unknowns and the answers contain subjective variability.
We have a clue as to what it is (these are just functions at the end of the day) but don't know how the model's learned parameters relate to the problem domain. I saw a talk (maybe of Jeff Dean?) a while back that discussed creating models that could explain why certain features weighed more than others. Maybe with more approaches targeted towards understanding, these algorithms could start to seem less and less like a semantically opaque computational exercise, and more in line with how we humans think about things.
github autopilot scares me every time I write code on my personal pc and get those auto-suggestions. I am happy we dont have it at work yet.
It is clear writing code will soon be something of the past; maybe it is a bad idea to train our children to code. Let's make sure we milk every penny before the party is over!
Maybe… maybe… tools like Copilot will allow us to work at a higher level of abstraction (like optimizing compilers have allowed us to do).
I say maybe because so far the code that Copilot has generated for me has been impressive for what it is, but riddled with obvious and subtle bugs. It’s like outsourcing my function implementations to a C-student undergraduate intern. I definitely wouldn’t use any of its code without close scrutiny.
AI will make some software engineering tasks more efficient and more accessible but human programmers are not going anywhere any time this side of the Singularity.
All these approaches just seem like brute-force approaches: Let's just throw our transformer on this problem and see if we can get anything useful out of this.
Whatever it is, you can't deny that these unsupervised models learn some semantic representations, but we have no clue at all what that actually is and how these model learn that. But I'm also very sceptical that you can actually get anywhere close to human (expert) capability in any sufficiently complex domain by using this approach.