Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe I'm missing what "abstraction" means here but seems like the tasks were centered around grids and other spatial problems, which are a very limited subset of abstraction/reasoning.

In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general. Positions, rotations, etc. are a concept that GPT4 finds very hard to apply, which is kinda unsurprising since it has no body, no world, no space; it "lives" in the realm of text. DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.



It's also worth remembering that blind humans who can recognize squares by feel do not have the ability to recognize squares by sight upon gaining vision.

I suspect the model is bad at these kinds of "reasoning" tasks in the same way that a newly-sighted person is bad at recognizing squares by sight.


When did blind humans gain vision out of curiosity?


The first one I heard about was 10-15 years ago by projecting an image on the tongue. Ahh, here it is: https://www.scientificamerican.com/article/device-lets-blind...


https://www.projectprakash.org/_files/ugd/2af8ef_5a0c6250cc3...

They studied people with treatable congenital blindness (dense congenital bilateral cataracts)


In my experience GPT4/V is pretty bad at those specifically, not necessarily around abstraction in general.

The problem with a statement like is that it leaves the door open to accepting any kind of canned generality as "abstraction in general". Abstract reasoning is indeed a fuzzy/slippery concept and spatial reason may not capture it well but I'm pretty sure it captures it better a general impression of ChatGPT.

...since it has no body, no world, no space; it "lives" in the realm of text.

There's a bizarre anthropomorphism on this thread, both reflexively compare this software system to a blind human and the implicit call to be considerate of this thing's supposed disability.


Why is it bizarre to consider the limitations inherent in the input data on which the model is trained? Fundamentally, it still "sees" the world through text, and the extent to which it can "understand" spatial relationships is defined by that. It seems utterly unsurprising that this leads to very poor grasp of the actual concepts behind what things like "above" or "left" are - the text that humans produce when talking about such things kinda relies on the reader having their own experience (if not vision, then at least body awareness) that can be mapped to those concepts. You can explain "left" and "right" to a human by telling them which of their hands is which, and I can't help but wonder what the actual information payload this constitutes when you consider the body spatial awareness that is brought into context by association.


Why is it bizarre to consider the limitations inherent in the input data on which the model is trained?

Sure the thing is limited, the study is demonstration of this (and general purpose abilities have been claimed for LLMs at various point).

I was pushing back against the "it's like a blind person" anthropomorphizing argument [edit: especially the assumption these things learn through experience and reflection which the parent also makes]. Maybe if the thing "had eyes", it could learn spatial information and maybe it couldn't (though it would take a lot of work to make that metaphor meaningful). The thing certainly doesn't learn text in the fashion that human learns speech since humans don't digest the entire Internet before they can speak.


I'd recommend looking up model grounding by multi-modal training. Seemingly models improve as you add more modes.


The study did include a multimodal model.

Apparently it doesn't improve abstract reasoning capability, because according to the article the multimodal gpt4 did just as dismally as the text-only gpt4. This was surprising to me, as I would have expected an improvement with a model that did include spatial relationships.


> Fundamentally, it still "sees" the world through text

Fundamentally, it "sees the world" [0] through tokens, which are not text.

[0] Also a bad metaphor, but...


Technically true, but when those tokens are 1:1 mapped to text, I think we can simplify this down without losing anything important.

Of course, once you start using tokens for other things - as multimodal LMs already do - that changes. But this current crop of model still has visual modality in its infancy IMO, and gauging the overall performance of model as a whole based on that is very questionable.


> Technically true, but when those tokens are 1:1 mapped to text

I don't know what GPT-4V does particularly, but my understanding is that multimodal models very often have an expanded token space with special tokens related to image handling, so, literally, there is not a 1:1 relationship of tokens to text.


A string of tokens is text. Tokens is just another alphabet, like Japanese letters having many representations for the same sounds and a letter can be entire words sometimes.


>The problem with a statement like is that it leaves the door open to accepting any kind of canned generality as "abstraction in general".

Not really

https://arxiv.org/abs/2212.09196


Nah,

By the very fact that there's paper here, whatever it's merit, the authors of the paper have codified their concept of generality and this doesn't validate the point I was replying to, which was essentially "my impression/feeling" is that it is better".


Point is that it's good at abstract reasoning that isn't spatially grounded like in that paper. So it's not really leaving any door open. It's not a cop out. That's just how it is.


> which is kinda unsurprising since it has no body, no world, no space; it "lives" in the realm of text

or rather the training set was lacking in this regard


> DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.

This has nothing to do with having "no body, no world" and everything to do with the fact that training pictures where things are upside down are simply vastly rarer that pictures where they aren't.


My point is: both are two sides of the same coin.


What would directions be for an intelligent creature that lives in zero gravity? I just like thinking about this for the same reasons humans like writing speculative science fiction. Trying to guess what alien perspectives look like, might also give us insights when we're the ones making the alien.


Basically the same, gravity doesn’t define left/right or North, South, East, and West for us just up and down.


However, North, South, East, and West are relative to the poles of the Earth. Something living in zero gravity would have to use some object as an anchor to determine the direction.


You’re also oriented based on objects. We don’t have an abstract compass pointing north 24/7 the way we can use our bodies to determine left and right or gravity to point down.


Right, that's why we use compasses, which use the poles of the Earth to determine the direction.

Something living in zero gravity doesn't have a planet, so they'd have to find something else to base the directions on.

That's what I was trying to say before.


The solar system has a north pole and a south pole based on the rotation of the Sun. Basically the only places in which there isn't something to orient against are in the depths of inter-galactic-cluster voids with nothing around. And if a being is stuck in one of those voids, orientation is way down the list of problems they have.


That's a good point. The sun of a solar system could possibly be what an alien society living in zero gravity bases their directions on.


FWIW there is some interesting variability among human cultures on that, as well. There are a few that actually use cardinal directions predominantly or exclusively instead of body-relative ones like "left" and "right".


No, but they would have front and back, and people from the bridge would share which way was “up” and “down” and “left” and “right” based on the controls.


> DALLE3 suffers a similar problem where it has trouble with concepts like "upside down" and consistently fails to apply them to generated images.

There’s probably not many (if any) upside down images or objects in the training data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: