This genuinely seems like magic to me, and it feels like I don't know how to place it in my mental model of how compuation works. A couple of questions/thoughts:
1. I learned that NNs are universal function approximators - and the way I understand this is that, at a very high level, they model a set of functions that map inputs to outputs for a particular domain. I certainly get how this works, conceptually, for say MNIST. But for the stuff described here... I'm kind of baffled.
So is GPT's generic training really causing it to implement/embody a value mapping from pixel intensities to HTML+Tailwind text tokens, such that a browser's subsequent interpretation and rendering of those tokens approximates the input image? Is that (at a high level) what's going on? If it is, GPT in modelling not just the pixels->html/css transform but also has a model of how html/css is rendered by the browser back box. I can kind of accept that such a mapping must necessarily exist, but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind. Is the way I'm thinking about this useful? Or even valid?
2. Rather more practically, can this type of tool be thought of as a diagram compiler? Can we see this eventually being part of a build pipeline that ingests Sketch/Figma/etc artefacts and spits-out html/css/js?
An LLM is really a latent space plus the means to navigate it. Now a latent space is an n-dimensional space in which ideas and concepts are ordered so that those that are similar to each other (for example, "house" and "mansion") are placed near each other. This placing, by the way, happens during training and is derived from the training data, so the process of training is the process of creating the latent space.
To visualize this in an intuitive way, consider various concepts arranged on a 2D grid. You would have "house" and "mansion" next to each other, and something like "growling" in a totally different corner. A latent space -- say, GPT-4 -- is just like this, only it has hundreds or thousands of dimensions (in GPT-4's case, 1536), and that difference in scale is what makes it a useful ordering of so much knowledge.
To go back to reading images: the training data included images of webpages with corresponding code, and that code told the training process where to put the code-image pair. In general, accompanying labels and captions let the training process put images in latent space just as they do text. So, when you give GPT-4 a new image of a website and ask it for the corresponding HTML, it can place that image in latent space and get the corresponding HTML, which is lying nearby.
Being a universal function approximator means that a multi-layer NN can approximate any bounded continuous function to an arbitrary degree of accuracy. But it says nothing about learnability and the structure required may be unrealistically large.
The learning algorithm used: Backpropagation with Stochastic Gradient Descent is not the universal learner. It's not guaranteed to find the global minimum.
Specifically, the "universal function approximate" thing means no more and no less than the relatively trivial fact that if you draw a bunch of straight line segments you can approximate any (1D, suitably well-behaved) function as closely as you want by making the lines really short. Translating that to N dimensions and casting it into exactly the form that applies to neural networks and then making the proof solid isn't even that tough, it's mostly trivial once you write down the right definitions.
Unlikely given the dimensionality and complexity of the search space. Besides, we probably don’t even care about the global minimum: the loss we’re optimising is a proxy for what we really care about (performance on unseen data). Counter-example: a model that perfectly memorises the training data can be globally optimal (ignoring regularization), but is not very useful.
1. Is the way I'm thinking about this useful? Or even valid?
The process is simpler. GPT reads the image and creates a complete description of it, then the user gets this description and creates a prompt asking for a tailwind implementation of that description.
2. I see this skipping the sketch/figma phase and going directly to live prototype
Your curiosity is a bit of fresh air after months of seeing people arguing over pointless semantics. So I'm going to attempt to explain my mental model of how this works.
1- This is correct but not really useful view imo. Saying it can fit any arbitrary function doesn't really tell you whether it'll do it given finite resources. Part of your excitement comes from this i think, We've had this universal approximators far longer but we've never had an abstract concept approximated so well. The answer is the scale of the data. I'd like to pay extra attention to GPT's generic training now before moving on to multi modalities. There is this view that compression is intelligence(see hutter prize and kolmogorov complexity/compressor) and these models are really just good compressors. Given that model weights are fixed during the training and they are much smaller than the data we are trying to fit and the objective is to recover the original text(next token prediction), there is no way to achieve this task other than to compress this data really well. as it turned out, the more intelligent you are the more you are able to predict/compress, and if you are forced to compress something, you are essentially being forced to gain intelligence. It's like If you were to take an exam tomorrow on the subject you currently dont know anything about, 1- you could memorize potential answers 2- but if the test is few thousand questions long and there is no way to memorize them given the time/ability to exactly memorize the answers, your best bet is to actually learn the subject and hope to derive the answers during the test. This compression/intelligence duality is somewhat controversial especially among HN crowd who deny the generalization abilities of LLMs, but this is my current mental model and I haven't been able to falsify this view so far.
If you accept this view, the multi modality capability is just engineering. We don't know exactly about GPT4-V, but from the open source multi modal research we can infer the details. given an image and text pair of a dataset where the text explains what's going on in the image(e.g. an image of a cat and a long description of the image), we tokenize/embed the image like we do to text. This could be through Visual transformers(ViT) where the network just generates visual features for each patch of the image and put them in a long sequence. Now, if you give these embeddings to a pretrained LLM, and force it to predict the description of the image(text pair), there is no way to achieve this task other than to look at those image embeddings and gain general image understanding. After your network is capable of understanding the information in given image and express it in natural language, the rest is instruction tuning to use that undersanding. Generative image models like stable diffusion works similarly, only in that you have a contrastive model(CLIP) that you train by forcing it to produce the same embeddings of same concepts(e.g. embeddings of picture of a cat and embeddings text "picture of a cat" is forced to be close to each other during training.). Then you use this dual information to allow your generative part of the model to steer the direction of generation. What's surprising to me in all of this is, we've had these capabilities at this scale(lucky) and we can get more capabilities with just more compute. Like if the current gpt4 had a final loss of 1 on the scale of data it has now, it'll probably be much more capable if we can get the loss to 0.1 somehow. It's exciting!
This is my general understanding and I'd like to be corrected in any of these but hope you find this useful.
2) It seems to be that way. Probably possible even today.
In the AGI sense of intelligence defined by AIXI, (lossless) compression is only model creation (Solomonoff Induction/Algorithmic Information Theory). Agency requires decision which amounts to conditional decompression given the model. That is to say, inferentially predict the expected value of consequences of various decisions (Sequential Decision Theory).
Approaching the Kolmogorov Complexity limit of Wikipedia in Solomonoff Induction, would result in a model that approaches true comprehension of the process that generated Wikipedia including not only just the underlying canonical world model but also the latent identities and biases of those providing the text content. Evidence from LLMs trained solely on text indicates that even without approaching the Solomonoff Induction limit of the corpora, multimodal (e.g. geometric) models are induced.
The biggest stumbling block in machine learning is, therefore, data efficiency more than data availability.
1. I learned that NNs are universal function approximators - and the way I understand this is that, at a very high level, they model a set of functions that map inputs to outputs for a particular domain. I certainly get how this works, conceptually, for say MNIST. But for the stuff described here... I'm kind of baffled.
So is GPT's generic training really causing it to implement/embody a value mapping from pixel intensities to HTML+Tailwind text tokens, such that a browser's subsequent interpretation and rendering of those tokens approximates the input image? Is that (at a high level) what's going on? If it is, GPT in modelling not just the pixels->html/css transform but also has a model of how html/css is rendered by the browser back box. I can kind of accept that such a mapping must necessarily exist, but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind. Is the way I'm thinking about this useful? Or even valid?
2. Rather more practically, can this type of tool be thought of as a diagram compiler? Can we see this eventually being part of a build pipeline that ingests Sketch/Figma/etc artefacts and spits-out html/css/js?