Maybe I'm missing what "abstraction" means here but seems like the tasks were centered around grids and other spatial problems, which are a very limited subset of abstraction/reasoning.
In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general. Positions, rotations, etc. are a concept that GPT4 finds very hard to apply, which is kinda unsurprising since it has no body, no world, no space; it "lives" in the realm of text. DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.
It's also worth remembering that blind humans who can recognize squares by feel do not have the ability to recognize squares by sight upon gaining vision.
I suspect the model is bad at these kinds of "reasoning" tasks in the same way that a newly-sighted person is bad at recognizing squares by sight.
In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general.
The problem with a statement like is that it leaves the door open to accepting any kind of canned generality as "abstraction in general". Abstract reasoning is indeed a fuzzy/slippery concept and spatial reason may not capture it well but I'm pretty sure it captures it better a general impression of ChatGPT.
...since it has no body, no world, no space; it "lives" in the realm of text.
There's a bizarre anthropomorphism on this thread, both reflexively compare this software system to a blind human and the implicit call to be considerate of this thing's supposed disability.
Why is it bizarre to consider the limitations inherent in the input data on which the model is trained? Fundamentally, it still "sees" the world through text, and the extent to which it can "understand" spatial relationships is defined by that. It seems utterly unsurprising that this leads to very poor grasp of the actual concepts behind what things like "above" or "left" are - the text that humans produce when talking about such things kinda relies on the reader having their own experience (if not vision, then at least body awareness) that can be mapped to those concepts. You can explain "left" and "right" to a human by telling them which of their hands is which, and I can't help but wonder what the actual information payload this constitutes when you consider the body spatial awareness that is brought into context by association.
Why is it bizarre to consider the limitations inherent in the input data on which the model is trained?
Sure the thing is limited, the study is demonstration of this (and general purpose abilities have been claimed for LLMs at various point).
I was pushing back against the "it's like a blind person" anthropomorphizing argument [edit: especially the assumption these things learn through experience and reflection which the parent also makes]. Maybe if the thing "had eyes", it could learn spatial information and maybe it couldn't (though it would take a lot of work to make that metaphor meaningful). The thing certainly doesn't learn text in the fashion that human learns speech since humans don't digest the entire Internet before they can speak.
Apparently it doesn't improve abstract reasoning capability, because according to the article the multimodal gpt4 did just as dismally as the text-only gpt4. This was surprising to me, as I would have expected an improvement with a model that did include spatial relationships.
Technically true, but when those tokens are 1:1 mapped to text, I think we can simplify this down without losing anything important.
Of course, once you start using tokens for other things - as multimodal LMs already do - that changes. But this current crop of model still has visual modality in its infancy IMO, and gauging the overall performance of model as a whole based on that is very questionable.
> Technically true, but when those tokens are 1:1 mapped to text
I don't know what GPT-4V does particularly, but my understanding is that multimodal models very often have an expanded token space with special tokens related to image handling, so, literally, there is not a 1:1 relationship of tokens to text.
A string of tokens is text. Tokens is just another alphabet, like Japanese letters having many representations for the same sounds and a letter can be entire words sometimes.
By the very fact that there's paper here, whatever it's merit, the authors of the paper have codified their concept of generality and this doesn't validate the point I was replying to, which was essentially "my impression/feeling" is that it is better".
Point is that it's good at abstract reasoning that isn't spatially grounded like in that paper. So it's not really leaving any door open. It's not a cop out. That's just how it is.
> DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.
This has nothing to do with having "no body, no world" and everything to do with the fact that training pictures where things are upside down are simply vastly rarer that pictures where they aren't.
What would directions be for an intelligent creature that lives in zero gravity? I just like thinking about this for the same reasons humans like writing speculative science fiction. Trying to guess what alien perspectives look like, might also give us insights when we're the ones making the alien.
However, North, South, East, and West are relative to the poles of the Earth. Something living in zero gravity would have to use some object as an anchor to determine the direction.
You’re also oriented based on objects. We don’t have an abstract compass pointing north 24/7 the way we can use our bodies to determine left and right or gravity to point down.
The solar system has a north pole and a south pole based on the rotation of the Sun. Basically the only places in which there isn't something to orient against are in the depths of inter-galactic-cluster voids with nothing around. And if a being is stuck in one of those voids, orientation is way down the list of problems they have.
FWIW there is some interesting variability among human cultures on that, as well. There are a few that actually use cardinal directions predominantly or exclusively instead of body-relative ones like "left" and "right".
No, but they would have front and back, and people from the bridge would share which way was “up” and “down” and “left” and “right” based on the controls.
In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general. Positions, rotations, etc. are a concept that GPT4 finds very hard to apply, which is kinda unsurprising since it has no body, no world, no space; it "lives" in the realm of text. DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.