2. get on a call with the sales teams of major cloud providers to procure a few thousands GPUs and enter into too long contracts.
3. "pretrain" a GPT. one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. go through training hell, learn all about every possible NCCL error message, see the OPT logbook as good reference: https://github.com/facebookresearch/metaseq/blob/main/projec...
I was so confused by the saltiness until I saw the username. I'm sure you've earned it.
I got into deep learning because of your char-rnn posts a while ago -- it inspired me to do an undergrad thesis on the topic. I read arxiv papers after that and implemented things from the ground up until a startup liked my work and hired me in a neural network engineer position.
Fast forward a few years and I was enamoured with minGPT and it stuck with me. I wanted a CIFAR10 experimentation toolbench so I took my hand at my best swing at applying the minGPT treatment on the current best single-GPU Dawnbench entry, added a few tweaks and got https://github.com/tysam-code/hlb-CIFAR10. It currently (AFAIK) holds the world record for training to the 94% mark by a fair bit.
It's about 600 lines in a monolithic file, only requiring torch and torchvision, but it's my first project like this and I'd like to learn how to better minify codebases like this. It seems like the hardest part is knowing how to structure inheritance and abstraction, but I don't know if you had any good outside references/resources that you used or would recommend. If you have any feedback or help, I am open to receiving it, as I am very much a newbie at this particular art/science. It is quite a fun one, however (especially as it is a useful tool for my day-to-day work).
I'm also hoping to apply the same treatment to a small language model at some point by taking the Dawnbench approach -- picking a good target validation loss value or some reasonable metric, then optimize around that obsessively to build a good tiny reference model. I don't know if you'd know anyone that's interested in that kind of thing, but I feel like that would be a fun next step for me.
Extremely interested in your take on where language/reasoning competency ends and knowledge retrieval begins.
OpenAI stuff has succeeded in part because it can synthesize good bullshit* on a huge variety of topics. For many purposes this makes it as good as asking someone in the same room to look something up for you on Wikipedia.
But while vast general and somewhat special knowledge is very impressive, comprehension and reasoning ability can exist without it. We know from our own human experience that general knowledge is useful to have, but not the same thing as intelligence or wisdom. It seems rational to think that the size of model needed to get ChatGPT's adequate level coherence and rationality is much less than that required to also encode sufficient general knowledge to be informative on just about any topic, most of which are not language specific.
* in the Frankfurtian sense of 'information provided without regard to its correctness'
This is why it has always seemed to me that the 'chat bot' -> AI pathway has felt quite analogous to the 'chess bot' -> AI pathway. We're constantly trying to replicate things that look like demonstrations of intelligence, but never really bothering with what intelligence is. What I mean is that a man lifting 400kg is a demonstration of exceptional athleticism. A 400kg man sitting on a balance and having 400kg go up on the other side is not, even though if we only observe the output (400kg goes up) then it is absolutely identical.
This isn't just a 'only humans can be intelligent' type argument, but emphasizing that what we want and what we're pursuing seem to be quite different. Newton deriving the inverse square law of gravitational attraction by observing things fall on Earth and watching the celestial bodies in the sky - that is an application of the sort of intelligence that we want. Asking a student to memorize and later recite that the gravitational force is proportional to m1*m2/r^2 is the sort of intelligence that we're building. And it's not like the latter leads to the former, of course it's the exact opposite!
We don't know how to define the essence of intelligence, so all we can do is knock all the strawmen we traditionally attribute to intelligence, until we're either left with the essence of true intelligence, or we knock off everything and we find intelligence was just a bunch of tricks after all.
I don't think the essence is especially elusive. It's the ability to make novel, meaningful, and useful discoveries from precepts that don't immediately "obviously" lead to those discoveries. Observing the sky and nature leading to a mathematical formulation of gravity is an absolutely amazing leap. In part because of what was done, but perhaps even more so for even beginning to imagine it was something that could be done.
Imagine stargazing in a time prior to Newton, observing objects falling on Earth, and somehow managing to derive an accurate mathematical formulation of something you had no reason to imagine even existed. Man hadn't been to space, let alone the moon, and so for all we knew if you dropped an apple anywhere it would just as well fall. Perhaps how Newton may have began his discovery was by asking himself where it would fall, but that's a tangent.
---
Humanity's entire existence has been but a blink of time on any sort of timescale, besides our own lives. And in that time we went from the bleeding edge of technology, perhaps pun intended, being 'poke them with the pointy end' to having men travel into space, voyage to other 'planets', land on them, and imminently live on them.
A truly intelligent machine, given its capacity for practically infinite storage, infinitely more accurate recall, and arbitrarily small 'generations' (in terms of self recursive evolutionary improvement) ought be able to not only match this, from a similar starting base of knowledge, but move rapidly beyond it at an extremely swift rate.
So I don't think it would be particularly ambiguous, or debatable. An intelligent machine would quickly advance nearly all fields of humanity by an unimaginable amount. Given the ability for a machine to also scale its own processing capacity to arbitrarily higher degrees (while we're, more or less stuck with fixed 'hardware') this should be able to continue for an exceptionally long period of time as well.
You would end up completely revolutionizing humanity to a degree we can't even really imagine today - repeatedly, and it would all happen within a matter of years. I'm even being somewhat generous here by allowing for a greater intelligence to remain mindless and servile, which seems quite antithetical to intelligence. But if we can "just" get to here, as described, I think that'd be pretty compelling, servility notwithstanding.
> practically infinite storage, infinitely more accurate recall
You can already hold all of humanity's written cultural knowledge on a thumbnail-sized drive. Recall is pretty accurate, too.
> revolutionizing humanity
Tell that anyone who lived 100 or 1000 years ago. World-wide instant communication. A cultural species of interconnected thought, a network the size of a planet. Augmented with tools of perfect memory recall, error-free precision calculations, way beyond what an unaugmented human brain can do. Welcome to the present. We even got remote-controlled robots working for us on other planets.
I'll keep my ad-blocker enabled, though. But allow me to think of it as a brain augmentation. We do what we always did: cultural evolution. We are creating tools to augment ourselves, not machines that are independent from us. As we integrate those tools into our daily routine, the tools are shaping our practice and needs. So we again create different tools, or ways of living. We diversify, then copy the successful. What's the point of creating a new god of intelligence? We have plenty of gods already. Let's maybe study and discuss non-human intelligence instead. (Edit: rewrote last paragraph.)
> I don't think the essence is especially elusive. It's the ability to make novel, meaningful, and useful discoveries from precepts that don't immediately "obviously" lead to those discoveries.
But that's trivial simply by enumerating all Turing machines that reproduces the observed outputs. The point of intelligence is that it somehow "filters" the space of possible theories in some specific, computable way.
The ideal model of this is Solomonoff induction, which orders Turing machines by Kolmogorov complexity, but that ordering is not computable. So intelligence is some computable approximation of this, but discovering the specifics of how that works is non-trivial.
I would define intelligence not in terms of search, but creation. And the two are indeed different. As one simple example - early man had no concept of math, or even numbers. Incidentally, the same is even true of some isolated tribes to this day [1]. Somehow we created numbers, seemingly from nothing. And it was only this creation that enabled us to move onto even more creation where the search space continues to grow ever wider, yet we continue to pull something from nothing. It's not like we had any real basis for the formulation of numbers, or even reason to imagine they existed.
The further you go back in our development, the greater the distinction between creation and search becomes. The article itself even gets into a bit of a paradox on this note. It suggests that language defines thought, and since they have no numbers in their language - they cannot think about numbers. But then how do we have numbers? Somebody was certainly able to, and it's not because they started with numbers in their language. And for that matter how do we even have language? Another thing that was developed from absolutely nothing. Go for enough back in our evolutionary timeline and we wouldn't have even had the ability to express e.g. 'angry noise'. Yet somehow, we created such things - again seemingly from nothing. And I think that is the purest essence of intelligence.
But everything expressible by humans is expressible by a Turing machine, so there is no fundamental difference between search and creation since Turing machines are recursively enumerable.
We didn't create numbers seemingly from nothing, it was necessary to track our food and our children or family. Even crows have the ability to count.
Again, read the paper. Numbers were thought to be an intuitive concept - they are not, not even amongst humans. One needs not numbers to keep track of their family or food anymore than they need calculus to say, "Wow that thing's speeding up." They have "one", "two", and "many" and that works for all their purposes.
Basically look to any example, where what is discovered is not a recombination of preexisting knowledge but the emergence of new knowledge and you'll find search is pointless. As an example, consider hand washing. Now a days we all intuit that hand washing is a good way to prevent the spread of disease. But of course that intuition is because we all know of and accept a germ based theory of disease. A couple of hundred years ago this was not true. Go a little further back and the concept of germs did not even exist. And so surgeons did not regularly wash their hands even before doing things like surgery.
Now I challenge you to, even in wildly hand-wavey fashion, to describe the creation of a turing machine that could, from the basis of knowledge of an individual of such times, "discover" the secret of hand washing. There were no records kept on hand washing : illness rates or anything of the sort, because nobody even stopped to consider the impact it might be having.
The difficulty you're going to face here is that there is no preexisting knowledge to draw upon. You are not "searching" for an answer, but having to create it, seemingly from nothing.
> Again, read the paper. Numbers were thought to be an intuitive concept - they are not, not even amongst humans.
Your link doesn't prove anything, it contains testimony from experts arguing both sides. Odd that you think one side is automatically correct from one study that was inconclusive and a clear example of an exception to the rule, at best.
> Basically look to any example, where what is discovered is not a recombination of preexisting knowledge but the emergence of new knowledge and you'll find search is pointless
I think you'll find it much harder to argue this point than you think. Most such discoveries result from simple observations of the world, so the information was already out there, people just didn't notice it before.
> Now I challenge you to, even in wildly hand-wavey fashion, to describe the creation of a turing machine that could, from the basis of knowledge of an individual of such times, "discover" the secret of hand washing
Exactly the way it happened: someone noticed that fewer people died in hospitals where the doctors washed their hands after performing autopsies. The scientific process is reliable because it's mechanistic, repeatable. All scientific knowledge derives from simple, repeatable observations like this.
The closest thing you'll find to true invention is maybe math and various logics. But even then, this is often simply a process of permuting existing axioms, and adding a new randomly generated axiom to see if anything interesting haopens. This is a search process, most of whose results will be internally inconsistent and so get discarded quickly by the human mind with it's effective pattern matching.
Hand washing records were not kept for the same reason I've mentioned multiple times - nobody ever thought it relevant, so it wasn't considered relevant. So it's not like you can simply search the records for something which does not exist! Even when one doctor finally did discover the value of handwashing, through a remarkable degree of serendipity, his hypothesis, based on anecdotal evidence, was rejected because it did not line up with scientific thought of the time. He ended up in an insane asylum and died. Like a Greek tragedy, his cause of death was an infected wound on his hand, very possibly caused by excessive washing! [1]
So now we return to the same question. How do you expect a machine to simply discover the value of hand washing? Let alone carry out tests? You have no data on hand washing whatsoever as it's not seen as relevant. It's a rather random hypothesis that, given the knowledge of the time, would have less than zero basis for support. There is no logical reason for its discovery, nor ought it ever be prioritized highly in any way whatsoever.
And this is, in many ways, the rule more than the exception for discovery.
On the numbers issue. I was not referring to "views" but facts. The tribesmen had terms only for 1, 2, and many. And while the article doesn't mention it, it's safe to assume they have 0 systems of mathematics. Assuming they are not uniquely retarded, we were all in a similarly limited state of understanding at some point. Searching for where we are, from the state of where they are, will yield no results. Yet somehow, we achieved it.
Anyone at Google Cloud out there? It seems I can't get my GPU quota raised to 40 x V100 as an independent researcher. I was told that setting up a website would help, but I would rather not. I can pay the bills...
If you're actually an independent researcher, sometimes you can find professors at universities or national labs that are willing to help out in exchange for credits on the paper. I've had success at [redacted] labs in the New Mexico region as well as folks from my previous university. The trick is asking people who do research that's sort of adjacent to your field.
@ingenieroariel, sorry for this trouble.
I am product manager for Cloud TPU, I would be happy to connect you with my GPU colleagues and also explore if Cloud TPU can help with your research as well.
What's the best way to connect with you?
Oh, I quit recently.
Very surprisingly, I learned it was harder to get access to GPUs at big tech companies outside their dedicated research teams than it was for a scrappy hacker outside on side projects with lots of savings. So I quit to work on those projects. I miss not having a dedicated infra team but don’t miss having to beg for resources. I wanted to use google cloud since I use some of their other services for this, but could not figure out how to get them to increase quotas and take my money. I was willing to pay nearly double the hourly rate I am currently paying at coreweave to use only one cloud provider but they just wouldn’t sell me it
Out of curiosity, just how much resources are you able to access with google colab and can you use it for AI training? I’ve been becoming more interested in fintech and trying my hand at predictive/pattern recognition AI but the costs are incredibly prohibitive to getting started, both hardware and data.
It is sometimes far more convenient to avoid the paper trail and bureaucracy of having to provision things internally, similarly I'm sure to how renting GPUs online for short periods of time avoids the issue of having to pay for the maintenance and time costs of maintaining them onsite.
GPT-J I think hasn't gone beyond 20B parameters, and while it is not the most obvious I think the original question is asking about the full 180B parameter+ kind of model. :) :thumbsup:
Pile seemed quite clean and manageable to me (I was able to preprocess it ~8 hours for a simple task on consumer grade hardware). Is Pile clean and rich enough for LLM training too ?
> 2. get on a call with the sales teams of major cloud providers to procure a few thousands GPUs and enter into too long contracts.
It seems like the standard instructGPT model itself is based on a 1 billion param GPT model. Wouldn't that fit on a 24GB RTX 3090 ? Might take longer, maybe not enough opportunity for hyper-parameter search, but still possible right ? Or is hyper-parameter search on a thousand machines in parallel the real magic sauce here ?
> 3. "pretrain" a GPT. one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. go through training hell, learn all about every possible NCCL error message, see the OPT logbook as good reference: https://github.com/facebookresearch/metaseq/blob/main/projec...
Sounds like a good opportunity to learn. No pain, no gain :-)
Maybe somebody would open source the equivalent datasets for this soon ? Otherwise the data collection seems prohibitively expensive for somebody trying to do this for fun: contract expert annotators, train them, annotate/reannotate for months ?
I would love to know what your thoughts are on how software engineering (and jobs in general) will change over the next 10 years and what we lowly developers can do to keep up & maybe even be involved in that change
Andrej wrote an interesting post some years back titled Software 2.0 about the direction he saw software engineering going. It's more about changes in software than the changes in the job market, but I suspect you'd still find it interesting. https://karpathy.medium.com/software-2-0-a64152b37c35
Today to me is the equivalent of the phone phreaking days when people are just doing as much as they can and getting away with as much as they can for as long as they can, until the regulations come.
It will be an interesting time in the next few years as I think StabilityAI's guerilla marketing tactics have inadvertently by proxy also placed the ML dataset debate right in the laps of the larger consumer market.
1. collect a very large dataset, see: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla... . scrape, de-duplicate, clean, wrangle. this is a lot of work regardless of $.
2. get on a call with the sales teams of major cloud providers to procure a few thousands GPUs and enter into too long contracts.
3. "pretrain" a GPT. one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. go through training hell, learn all about every possible NCCL error message, see the OPT logbook as good reference: https://github.com/facebookresearch/metaseq/blob/main/projec...
4. follow the 3-step recipe of https://openai.com/blog/chatgpt/ to finetune the model to be an actual assistant instead of just "document completor", which otherwise happily e.g. responds to questions with more questions. Also e.g. see OPT-IML https://arxiv.org/abs/2212.12017 , or BLOOMZ https://arxiv.org/abs/2211.01786 to get a sense of the work involved here.