if I've been reading it correctly, the power of chatgpt is in the training and data, not necessarily the algorithm.
And I'm not sure if it's technically possible for one AI to train another AI with the same algorithm and have better performance. Although I could be wrong about any and everything. :-)
A LLM by itself could generate data, code and iterate on its training process, thus it can create another LLM from scratch. There is a path to improve LLMs without organic text - connect them to real systems and allow them feedback. They can learn from feedback from their actions. It could be as simple as a Python execution environment, a game, simulator, other chat bots, or a more complex system like real world tests.
I know that NVidia is using AI that is running on NVidia chips to create new chips that they then run AI on.
All you have left to do is to AI the process of training AI, kind of like building a lathe by hand makes a so-so lathe but that so-so lathe can then be used to build a better and more accurate lathe.
I actually love this analogy. People tend to not appreciate just how precise modern manufacturing equipment is.
All of that modern machinery was essentially bootstrapped off a couple of relatively flat rocks. Its going to be interesting to see where this LLM stuff goes when the feedback loop is this quick and so much brainpower is focused on it.
One of my sneaky suspicions is that Facebook/Google/Amazon/Microsoft/etc would have been better off keeping employees on the books if for no other reason than keeping thousands of skilled developers occupied, rather than cutting loose thousands of people during a time of rapid technological progress who now have an axe to grind.
It is a nice analogy because you can expand it really to all history of technological progress. Tools help make tools - all the way back to obsidian daggers and sticks.
NeRFs are a form of inverse renderer; this paper uses Score Jacobian Chaining[0] instead. Model reconstruction from NeRFs is also an active area of research. Check out the "Model Reconstruction" section of Awesome NeRF[1].
From the SJC paper:
> We introduce a method that converts a pretrained 2D diffusion generative model on images into a 3D generative model of radiance fields, without requiring access to any 3D data. The key insight is to interpret diffusion models as function f with parameters θ, i.e., x = f (θ). Applying the chain rule through the Jacobian ∂x/∂θ converts a gradient on image x into a gradient on the parameter θ.
> Our method uses differentiable rendering to aggregate 2D image gradients over multiple viewpoints into a 3D asset gradient, and lifts a generative model from 2D to 3D. We parameterize a 3D asset θ as a radiance field stored on voxels and choose f to be the volume rendering function.
Interpretation: they take multiple input views, then optimize parameters (a voxel grid in this case) to a differentiable renderer (the volume rendering function for voxels) such that they can reproduce the input views.
I won't lie... ZBrush is brutally hard. I got a subscription for work and only used it for one paid job, ever. But it's super satisfying if you just want to spend
Sunday night making a clay elephant or rhinoceros, and drop $20 to have the file printed out and shipped to you by Thursday.
I've fed lots of my sculpture renderings to Dali and gotten some pretty cool 2D results... but nothing nearly as cool as the little asymmetrical epoxy sculptures I can line up on the bookshelf...
People are definitely building at a high pace, but for what it's worth, this isn't the first work to tackle this problem, as you can see from the references. The results are impressive though!
Image classification is still a difficult task, especially if there are only a few examples. Training a high resolution 1k multi-class imagenet on 1m+ images is a drag involving hundreds or thousands of GPU hours from scratch. You can do low-resolution classifiers more easily, but they're less accurate.
There are tricks to do it faster but they all involve using other vision models that themselves are trained for as long.
But can't something like GPT help here? For example you show it a picture of a cat, then you say "this is a cat; cats are furry creatures with claws, etc." and then you show it another image and ask if it is also a cat.
You are humanizing token prediction. The multimodal models for text-vision were all established using a scaffold of architectures that unified text-token and vision-token similarity e.g. BLIP2. [1] It's possible that a model using unified representations might be able to establish that the set of visual tokens you are searching for corresponds to some set of text tokens, but only if the pretrained weights for the vision encoder are able to extract the features corresponding to the object to which you are describing to the vision model.
And the pretrained vision encoder will have at some point been trained to minimize text-visual token cosine similarity on some training set, so it really depends on what exactly that training set had in it.
This paper https://cv.cs.columbia.edu/sachit/classviadescr/ (from the same lab as the main post, funnily) does something along those lines with GPT. It shows for things that are easy to describe like Wordle ("tiled letters, some are yellow and green") you can recognize them with zero training. For things that are harder to describe we'll probably need new approaches, but it's an interesting direction.
If you have a few examples you can use an already trained encoder (like CLIP image encoder) and train a SVM on the embeddings, no need to train a neural network.
In the last week, a lot of the ideas I’ve read about in the comments of HN, have then shown up as full blown projects in the front page.
As if people are building at an insane speed from idea to launch/release.