I think GP was probably referring to "Scaling Data-Constrained Language Models" (2305.16264) from NeurIPS 2023, which looked first at how to optimally scale LLMs when training data is limited. There is a short section on mixing code (Python) into the training data and the effect this has on performance on e.g. natural language tasks. One of their findings was that training data can be up to 50% code without actually degrading performance, and in some cases (benchmarks like bAbI and WebNLG) with improvements (probably because these tasks have an emphasis on what they call "long-range state tracking capabilities").
For reference: In the Llama 3 technical report (2407.21783), they mention that they ended up using 17% code tokens in their training data.
Also GPT-3.5 was another extreme if I remember correctly. They first trained only on code then they trained on other text. I can't seem to find the source though.
There was an interview with Zuckerberg about how they initially split training llama chat models on purely normal text and codellama on code, but later realized that if they combine the training set they get a model that is better at both tasks than each specialized one was.
Grounding everything in symbolic representations. [1] Which can greatly empower stuff that we could simulate but was too complicated to write a game around; now you can have agents respond to complex simulations with appropriate dialogue. But it's limited by what we can build a simulation to do.
Or,
Leaning in to making the LLM the core of the experience but relying on the player to play along to a greater or lesser extent. This sidesteps the jailbreaking problem but requires rethinking what playing a video game is about - is it about breaking free of the limits of the system, or about co-creativity?
There's some attempts to find other paths, but they very much are pioneering new ways to play games and look very different to past gameplay. [2]
We already do this for songs; anyone can pay the mechanical rate and record their own cover of a song.
It is an imperfect comparison, since a cover is its own recording, and ongoing royalties are involved, but the point is that there are some precedents for setting a price.
It's not quite that straightforward; since close kin also share genetic heritage anything you do to benefit your near kin also propagates some percentage of your own DNA. Kin selection has been part of evolutionary theory since Darwin:
> "This difficulty, though appearing insuperable, is lessened, or, as I believe, disappears, when it is remembered that selection may be applied to the family, as well as to the individual, and may thus gain the desired end."
Eusocial insects are doing quite well, for example, despite the tiny subset of the species that reproduce.
From the abstract: "Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts."
That first article entirely revolves around some random finance bro’s idle speculation in a YouTube comment. It blows my mind that people are so trusting of obvious guess work given that it’s a privately held company that’s not disclosing their financials.
> That first article entirely revolves around some random finance bro’s idle speculation in a YouTube comment.
I'm not sure that's a wholly accurate description? The article appears to point to sources beyond that singular comment - in particular, ostensible internal financial information:
> Ferguson based his assessment on internal second-quarter figures recently obtained by the New York Times. According to this report, X booked $114 million worth of revenue in the U.S., its largest market by far. This represented a 25% drop over the preceding three months and a 53% drop over the year-ago period.
> That already sounds bad. But it gets worse. The last publicly available figures prior to Musk’s acquisition, from Q2 of 2022, had revenue at $661 million. After you account for inflation, revenue has actually collapsed by 84%, in today’s dollars.
Profit is the whole point of owning a company, unless it's run by an oligarch, like is the case with Twitter. Not too dissimilar from Abramovich or another Sheriff from Moldova owning a sports club.
He can afford it as long as the stock market remains convinced Tesla deserve to be the most valuable car company in the world while being outside the top ten in sales ...
(I count "can make Starship on the side" as a QED of "making bank").
Regardless of how Starship concludes, when it does so it saves them ongoing costs, and turns "small profit" into "huge profit": either Falcon becomes redundant, or the R&D team does.
Where are you getting 2 billion from? The original CLIP paper says:
> We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. [1]
OpenCLIP was trained on more images, but the datasets like LAION-2B are kind of low-quality in terms of labeling; I find it plausible that a better dataset could outperform it. I'm pretty sure that the stock images Adobe is drawing from have better labeling already.
I agree that this is likely to backfire on artists, but part of that is that I expect the outcome to be that large corporations will license private datasets and open research will starve.
The 400m images in the paper yield the ~40% zero shot ImageNet accuracy in the chart they publish.
That level of performance is generally not good enough for text conditioning of DDIMs.
The published CLIP checkpoints, and later in the paper, they talk about performance that is almost twice as good at 76.2%. That data point, notably, does not appear in the chart. So the published checkpoints, and the performance they talk about later in the paper, are clearly trained on way more data.
How much data? Let's take a guess. I got the data points from the chart they have, and I went and fit y=a log_b (c+dx) + K to the points in the paper:
a≈12.31
b≈0.18
c≈24.16
d≈0.81
K≈−10.47
Then I got 7.55b images to get a performance of 76%. The fit is R^2 = 0.993, I don't have any good intuitions for why this is so high, it could very well be real, and there's no reason to anchor on "7.55b is a lot higher than LAION-4b", although they could just concatenate a social media image dataset of 3b images with LAION-4b, and boom, there's 7b.
OpenCLIP reproduced this work after all with 2b images and got 79.5%. But e.g. Flux and SD3 do not use OpenCLIP's checkpoints. So that one performance figure isn't representative of how bad OpenCLIP's checkpoints are versus how good OpenAI's checkpoints are. It's not straightforward to fit, it's way more than 400m.
Another observation is that there are plenty of Hugging Face spaces with crappy ResNet and crappy small-dataset trained-from-scratch CLIP conditioning to try. Sometimes it actually looks as crappy as Adobe's outputs do, there's a little bit of a chance that Adobe tried and failed to create its own CLIP checkpoint on the crappy amount of data they had.
> Sure, you can check if it's mathematically coherent, but that tells you nothing about whether it describes the physical world correctly.
This is a very good point I think a lot of people miss. (Including some who should know better.) Pontificating about speculative physics is all right for Aristotle but you need actual experiments to ground your results.