This genuinely seems like magic to me, and it feels like I don't know how to pla...

Michelangelo11 · on Nov 16, 2023

My attempt at an explanation:

An LLM is really a latent space plus the means to navigate it. Now a latent space is an n-dimensional space in which ideas and concepts are ordered so that those that are similar to each other (for example, "house" and "mansion") are placed near each other. This placing, by the way, happens during training and is derived from the training data, so the process of training is the process of creating the latent space.

To visualize this in an intuitive way, consider various concepts arranged on a 2D grid. You would have "house" and "mansion" next to each other, and something like "growling" in a totally different corner. A latent space -- say, GPT-4 -- is just like this, only it has hundreds or thousands of dimensions (in GPT-4's case, 1536), and that difference in scale is what makes it a useful ordering of so much knowledge.

To go back to reading images: the training data included images of webpages with corresponding code, and that code told the training process where to put the code-image pair. In general, accompanying labels and captions let the training process put images in latent space just as they do text. So, when you give GPT-4 a new image of a website and ask it for the corresponding HTML, it can place that image in latent space and get the corresponding HTML, which is lying nearby.

MAXPOOL · on Nov 16, 2023

Being a universal function approximator means that a multi-layer NN can approximate any bounded continuous function to an arbitrary degree of accuracy. But it says nothing about learnability and the structure required may be unrealistically large.

The learning algorithm used: Backpropagation with Stochastic Gradient Descent is not the universal learner. It's not guaranteed to find the global minimum.

cornel_io · on Nov 16, 2023

Specifically, the "universal function approximate" thing means no more and no less than the relatively trivial fact that if you draw a bunch of straight line segments you can approximate any (1D, suitably well-behaved) function as closely as you want by making the lines really short. Translating that to N dimensions and casting it into exactly the form that applies to neural networks and then making the proof solid isn't even that tough, it's mostly trivial once you write down the right definitions.

yu3zhou4 · on Nov 16, 2023

Specifically for neural networks, is there any alternative for backpropagation and gradient descent which guarantee finding the global minimum?

b3kart · on Nov 16, 2023

Unlikely given the dimensionality and complexity of the search space. Besides, we probably don’t even care about the global minimum: the loss we’re optimising is a proxy for what we really care about (performance on unseen data). Counter-example: a model that perfectly memorises the training data can be globally optimal (ignoring regularization), but is not very useful.

meiraleal · on Nov 16, 2023

1. Is the way I'm thinking about this useful? Or even valid?

The process is simpler. GPT reads the image and creates a complete description of it, then the user gets this description and creates a prompt asking for a tailwind implementation of that description.

2. I see this skipping the sketch/figma phase and going directly to live prototype

johnthewise · on Nov 16, 2023

Your curiosity is a bit of fresh air after months of seeing people arguing over pointless semantics. So I'm going to attempt to explain my mental model of how this works.

1- This is correct but not really useful view imo. Saying it can fit any arbitrary function doesn't really tell you whether it'll do it given finite resources. Part of your excitement comes from this i think, We've had this universal approximators far longer but we've never had an abstract concept approximated so well. The answer is the scale of the data. I'd like to pay extra attention to GPT's generic training now before moving on to multi modalities. There is this view that compression is intelligence(see hutter prize and kolmogorov complexity/compressor) and these models are really just good compressors. Given that model weights are fixed during the training and they are much smaller than the data we are trying to fit and the objective is to recover the original text(next token prediction), there is no way to achieve this task other than to compress this data really well. as it turned out, the more intelligent you are the more you are able to predict/compress, and if you are forced to compress something, you are essentially being forced to gain intelligence. It's like If you were to take an exam tomorrow on the subject you currently dont know anything about, 1- you could memorize potential answers 2- but if the test is few thousand questions long and there is no way to memorize them given the time/ability to exactly memorize the answers, your best bet is to actually learn the subject and hope to derive the answers during the test. This compression/intelligence duality is somewhat controversial especially among HN crowd who deny the generalization abilities of LLMs, but this is my current mental model and I haven't been able to falsify this view so far.

If you accept this view, the multi modality capability is just engineering. We don't know exactly about GPT4-V, but from the open source multi modal research we can infer the details. given an image and text pair of a dataset where the text explains what's going on in the image(e.g. an image of a cat and a long description of the image), we tokenize/embed the image like we do to text. This could be through Visual transformers(ViT) where the network just generates visual features for each patch of the image and put them in a long sequence. Now, if you give these embeddings to a pretrained LLM, and force it to predict the description of the image(text pair), there is no way to achieve this task other than to look at those image embeddings and gain general image understanding. After your network is capable of understanding the information in given image and express it in natural language, the rest is instruction tuning to use that undersanding. Generative image models like stable diffusion works similarly, only in that you have a contrastive model(CLIP) that you train by forcing it to produce the same embeddings of same concepts(e.g. embeddings of picture of a cat and embeddings text "picture of a cat" is forced to be close to each other during training.). Then you use this dual information to allow your generative part of the model to steer the direction of generation. What's surprising to me in all of this is, we've had these capabilities at this scale(lucky) and we can get more capabilities with just more compute. Like if the current gpt4 had a final loss of 1 on the scale of data it has now, it'll probably be much more capable if we can get the loss to 0.1 somehow. It's exciting!

This is my general understanding and I'd like to be corrected in any of these but hope you find this useful.

2) It seems to be that way. Probably possible even today.

jabowery · on Nov 17, 2023

In the AGI sense of intelligence defined by AIXI, (lossless) compression is only model creation (Solomonoff Induction/Algorithmic Information Theory). Agency requires decision which amounts to conditional decompression given the model. That is to say, inferentially predict the expected value of consequences of various decisions (Sequential Decision Theory).

Approaching the Kolmogorov Complexity limit of Wikipedia in Solomonoff Induction, would result in a model that approaches true comprehension of the process that generated Wikipedia including not only just the underlying canonical world model but also the latent identities and biases of those providing the text content. Evidence from LLMs trained solely on text indicates that even without approaching the Solomonoff Induction limit of the corpora, multimodal (e.g. geometric) models are induced.

The biggest stumbling block in machine learning is, therefore, data efficiency more than data availability.

gooob · on Nov 16, 2023

>but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind

think about the number of dimensions and the calculation speed we are dealing with

maCDzP · on Nov 16, 2023

2. Made me think of UML and use it to build SQL statements or a program that is object oriented?

That would be nice.

HPsquared · on Nov 16, 2023

It's a "universal translator"