Hacker News new | past | comments | ask | show | jobs | submit login
Palm-E: An Embodied Multimodal Language Model (palm-e.github.io)
211 points by alphabetting on March 7, 2023 | hide | past | favorite | 127 comments



Looks like great work. I was sad to learn that the entire team that designed and built those robots was let go a few weeks ago. Well, maybe some of them will get an internal transfer. But the project was canceled. I designed the sensor used in the palm of that robot to detect objects in the gripper [1][2], and worked as a hardware test engineer writing software to test hardware assemblies as the robots were being built. I’m really glad to see transformers used for multi modal datasets for robotics! You can imagine that a robot has a whole “experience” in the world and you want it to then have a new experience and predict the most logical actions to take, just like GPT models predict the next word. I suspect there are limits to this approach but it’s an important step.

I hope google research is able to continue working with these robots. There were over 50 of them when I left 4 years ago, and I suspect they ended up building a lot more. There are limits to the commercial appeal of a one armed wheeled mobile manipulator with a two finger pinch gripper, but they are wonderful machines for machine learning and robotics research.

It’s a very nice arm design. All injection molded custom gearboxes and stuff. The team previously designed a really strong metal arm that used a pulley system to eliminate backlash [3]. That could have been useful in manufacturing. It was really upsetting how they just canceled stuff and buried it. I’m so glad I work in open source robotics now.

Incidentally if you want a kind of stripped down version of this mobile manipulator, Aaron who used to be the head of X robotics during the design of [3] left to start Hello Robot [4] and I have heard people getting use out of them.

[1] https://patents.google.com/patent/US20200391378A1/en?invento...

[2] https://patents.google.com/patent/US11407125B2/en?q=(Robot)&...

[3] https://youtu.be/ZhsEKTo7V04

[4] https://hello-robot.com/


Speaking of applications, I could see something like this being helpful for people in old age, cooking or stocking small shelves. Was there talk of the final form of this tech?


There was talk of these sorts of things. At one point they talked seriously about the robot cleaning people's homes, which seemed absurd to me. It can't go up a step, it can't grab a spray bottle and use it, and they are very expensive.

I was long gone by the time the project shuttered but I suspect Tesla's entrance in to the robotics space may have been the nail in the coffin for this project. Tesla can do a manufacturing-first design approach with serious DFM the whole way through. I am still not sure if Tesla's humanoid will every come to light (right now beyond the demos we have seen my guess is that it will not become a useful product), but Tesla has the advantage that they can produce high volumes immediately with a modest cost due to their manufacturing integration, something you just cannot get with engineers in Mountain View sending designs to China.

That said the commercial prospects for this project were always limited, and they canned basically all speculative R&D while I was there 4 years ago. My in palm grip sensor was the last thing output by the small R&D subgroup I was hired in to before that group was dissolved, and that is when I moved to hardware test engineering. Lots of people wanted to develop better grippers, etc, but they wanted to try to narrow in on the existing design and commercialize it, and I just don't think this form factor has wide enough utility. It is also fairly expensive and the software side of things is very difficult. That is another area where Tesla may be able to succeed, though as much as their stack has promise, we have seen them struggle even in structured environments like roadways.


Back in the early 2000s when I studied computer-science, one of our professors Rolf Pfeifer[^0] was famous for being the father of embodiment. The way I understood it, the idea was that the physical shape of a human is also part of our intelligence. For example, we don't have to think about walking that much, because the arrangement of muscles, bones and tendons makes some movements easier than others. We're biased towards certain things by our physical shape.

I'm curious in what relation this paper stands to his work.

[^0]: https://en.wikipedia.org/wiki/Rolf_Pfeifer


I always thought about it as thinking like an oak tree.

A 100 year oak tree probably "knows" more "things" than I do, but there's almost no common language by which we could communicate.

And that's with another biological entity! The chasm would be even wider with fully synthetic life, absent an intentional recreation of our methods of perception.


I am not trying to be obtuse, I literally cannot think of any literal common sense explanation of how an oak tree could know more things than you.


It has lived longer and therefore observed (for some definition of observed) more. Its rings encode atmospheric carbon and drought metrics for centuries, for example.


But if we cut down the tree next to it then we can know just those same things, but the tree doesn't have youtube so I guess its a toss up.


How much do you know about the soil nutrient mix, pH, and microbiome under your feet?


By what measure does an oak tree "know" about any of these things?

If it does under definition X, do I also "know" about the microbiome in my own body under definition X? The nutrient mix of the food I eat, the oxygen content of the air I breathe?


It changes its own processes in response to those quantities.

And certainly, you know your own microbiome!

But I'm assuming you mass less than 9,000 kg, aren't 24 meters tall, and interact with less than 500 m^3 of soil.


I always see Dreyfus stans claim that embodiment is needed for AGI. My big mainframe based god model will refute them by existing.


I have to admit ignorance. Is this just a parameters race? Like is 3T parameters 6 times better that 500b? Does the task / work scale linearly with parameters? I know I could just ask chatgpt but wanted to support human content on the internet.


Yes and no.

The performance does scale up with parameters, though it’s not linear.

As discovered by Google in their work on their Chinchilla LLM, performance also scales up with the size of your training set. They were able to do work to define the optimal amount of training for a model of a given size to get the most out of your budget.

So even if we don’t find any better model architectures, which we probably will, if we increase the size of our models, training corpus and budget, we should continue to get more performant models.


The corpus part is the one that gets me. Presumably Google, Facebook, Microsoft, etc have a mirror of essentially the entire public (and their private) internet communications. An enormous amount of published English literature has been digitized.

What more corpus is even available? If the corpus is a matter of human annotation, what exactly is being annotated in say the Illiad? A tweet that says, "You up?". The lyrics to 8 Mile. So on and so forth.


I would assume that video and audio are the next frontier. Talking about Google, training on the entirety of YouTube would probably be next, if they haven't yet used it.

And the frontier after is probably having a fleet of actual robots capture additional information, as per this paper.


How does that performance scale though? If you double the inputs you train on, do you get twice the performance? Less? More?

I've heard conflicting things and I'm curious to hear your reasoning!


The scaling appears linear in relationship, though on some performance metrics it is super linear.

See Emergent Abilities of Large Language Models https://arxiv.org/abs/2206.07682

Scaling Laws for Generative Mixed-Modal Language Models https://arxiv.org/abs/2301.03728

All this demonstrates further the "bitter lesson" http://www.incompleteideas.net/IncIdeas/BitterLesson.html that surprising ability emerges from just throwing more data and compute at the problem and is seen more fruitful an endeavor than trying to find a deeper/analytic solution to cognition (like Yann Lecun's research, who has basically become a meme now for how much he poopoos LLMs and insists what we're seeing are LLMs only pretending to reason).


Chinchilla paper proposes the following empirical scaling power law

Loss= (N0/N)^a+(D0/D)^b

Where N is number of parameters, D is number of examples, a,b are exponents. N0,D0 are scaling factors.


Where A,B,C are empirical constants,

Loss = A/parameters + B/dataset_size + C

Just read the chinchilla paper.


I asked ChatGPT, it said: "Uhhh it depends."


From what I know, parameter count correlates directly with capability to a certain degree. But Google has also shown there is a benefit in training that can enable lower parameter count models to match or even outperform those with bigger counts.

The trade off is in the cost for training. But in return, you get a smaller and cheaper to run model that can be used by more people.

That is not what PALM E is about though. This is the paper where they patch an LLM so that it can give command to a real, physical robot and basically let the LLM interact with meatspace.


Suddenly those jailbreaks of the LLM become a lot more serious. What happens when bad people can get a robot to help them commit various crimes?


As far as I’m aware the only “robots” we have at the moment that are capable of large scale physical damage, without direct human involvement while the damage is occurring, are military drones and missiles.

And neither missiles nor drones are capable of functioning totally independent with both requiring large amounts of weapon systems infrastructure. You may be able to argue missiles could be deployed almost independently if failsafes are circumvented but that would require modernization of missile weapon systems which is about as likely as the colonization of space.

At present I’d be more worried about network and social engineering attacks using LLM which I’m sure is already happening and will continuously increase.


The household robot that can cook will be able to use a knife to cut anything it arbitrarily thinks needs to be cut which could lead to it being directed to assist with serious criminal acts. Self-driving cars out on the streets have already killed people, but those were just bugs, not intentional behavior, as might occur if they were driven by LLMs.


Then the bad person gets charged with the crimes. It's not like a judge would just throw their hands in the air because there is a small amount of indirection.


What happens when it decides that the human interfering in the example is the source of the problem and needs to be eliminated?


All I could think of watching that video was I would not want to be that guy.



Parameters and training corpus size matter. Last year a new paper (Google "Chinchilla optimality") focused on compute-optimal LLMs found that we'd been under-training models -- i.e. you could wring more performance out of smaller models by training on more data. But (as far as I understand it - interested layman here) - for a given amount of data, model performance seems to scale more or less linearly with parameters.

Now, we could see another model architecture than the current reigning transformer architecture upend this (much work is ongoing on breaking the quadratic term in the transformer that computationally bounds its performance - an example is the Hyena paper that was published just the other day).

Biggest computer and most data wins is still the paradigm here.


For now increasing data size has been enough to show better results. We don't when diminishing returns will be significant.


Kind of.

Speaking very generally, it's not just the number of weights ("parameters") but how you use them. The architecture of the neural network is where the most interesting new work is taking place.

Stable Diffusion, which was beating out Dall-E2, was a big deal not just because the weights were public, but also because they were small enough to run on consumer hardware. This is because of the clever architectural choice (plus some other methods.) This was surprising to me, because most of the interesting image generation required huge models running on commercial GPUs.

You'd generally expect diminishing returns from adding weights (as well as a lot more training time and maybe more data) but this is not a law.


  I know I could just ask chatgpt but wanted to support human content on the internet.
The irony of this sentence. As if the millionth repetitive copy of defeatist doomsday commentary on chatGPT is somehow a very different kind of content


That last phrase hit me hard.


ChatGPT is human content on the internet. It's just indexed and compressed and was made "generative" in a mind boggling way.


Does it still count as human content if I blindly copy paste from ChatGPT? :))


While this is very impressive results, they’re not actually having the models do any of the robotic controls. They only have the LLMs output text of what the controls should be, which then gets translated to actual movements. This works in a very narrow, limited set of experiments. 562 B parameters to output text commands seems excessive


In many cases, knowing what to do in very diverse environments could be harder than the physical control of the robots. I've read somewhere that the physical manipulation of robot body is not as hard as understanding multimodal info (linguistic/visual/auditory), esp in real time.

> Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains.

The last clause is important and suggests there might be further advances from such joint training, incl from interactions in the real world.


Reading the paper, they mention the robotic control is handled by RT-1:

The low-level policies are from

RT-1 (Brohan et al., 2022), a transformer model that takes

RGB image and natural language instruction, and outputs end-effector control commands.

For those that don't know, RT-1 (Robotic transformer) is previous work from the team that converts natural language to custom control code.

You can read more about Rt-1 here:

https://ai.googleblog.com/2022/12/rt-1-robotics-transformer-...

Maybe I'm missing something, but this sounds quite generalizable.


RT-1 is trained with very specific (and a lot of) mostly pick-and-place data. That's the domain it is an expert on. Unfortunately, there are only so many things you can build with go to/pick up/place level of instructions. Anything further that goes into the fine manipulation domain that you may need in a real kitchen is still absent.

This issue was a general disappointment in the robotics community; they had a LOT of funds to get robotics data with and they spent it on somewhat trivial tasks that we had almost solved already with much smaller and more principled models, instead of getting human demonstration data for more complicated tasks.


Right but is there any reason that this architecture won't work with more diverse data? Fundamentally it seems like their research is benchmarked around pick and place, so it makes sense to me that they would want to prove out transformer models could work in this space. Knowing transformers, it's probably safe to say that with more diverse training data, it will be able to scale to more complex controls in more complex embodied robots.

While I would love to see them working on robot chefs, I can appreciate that they want to start small. And regardless, it doesn't seem like there are any constraints outside of data for this architecture to work in more and more domains.


> Knowing transformers, it's probably safe to say that with more diverse training data, it will be able to scale to more complex controls in more complex embodied robots. While I would love to see them working on robot chefs, I can appreciate that they want to start small.

Except they are (were?) one of the biggest spenders in the entire field. While it may seem easy to hand-wave away "Just add more data!", robot data is way more expensive to get than language/image data, since people don't generate that naturally as they browse the internet. If their current operation was already too expensive for Google to keep running (as evidenced by Google shutting down this research arm), imagine what would happen if they proposed "let's spend a couple more orders of magnitude to get data!"

All I am saying is that they are going for an unambitious project that looks cool to the outsiders, with buzzwords like LLM. There are work by others [1] that are much more impressive in terms of manipulation diversity that would probably be a much better bet to add more data into, since they show a much more promising route to the future.

[1] https://twitter.com/chenwang_j/status/1628792565385564160?t=...


I can't really comment on the expenditure or the general strategy of Google's robotics teams, I know they closed down every day robotics, but that seemed to be specifically about the hardware.

Their software efforts still seem to be going strong, and I don't really think it's fair to say that demonstrating transfer learning in an embodied model none the less, is unambitious - nor does that seem like it's reflected in the results - but to be honest, I'm more or less a layman when it comes to this so I'll defer to you and keep what you're saying in mind.

To that end, if you're still feeling up to it, maybe you could tell me what your thoughts are with efforts like this from Google:

https://diffusion-rosie.github.io/

Seems to compare (somewhat) to Mimicplay, in that it is attempting to create more data for "cheap", even if it's not "real" control data.


Using LLM to translate the end result (text) into controls could be just a form of code generation, for which LLM are superb at.

I think a critical piece is being figured out as more and more such research emerges.


FYI, translation to code was actually already done last year by Google (Code as Policies - https://lastweekin.ai/p/code-as-policies) - but this still assumes a set of low level primitives, not tight visuomotor control. You really need reactive policies with tight feedback loops in many robotics domains, and that's kind of hard with these super gigantic models.


Couldn’t this be said of any software controlling a robot? The software isn’t really controlling anything, it’s just telling the motors what to do?

IMO the logic making the decision of what to do is by definition controlling the robot, even if it’s intermediated by subsystems and electrical components that translate commands to movement.


Except the jump from robot logic to robot control is a damned difficult one. Even if you have a human teleportating the robot, it's multiple orders of magnitude more expensive to get robot controls than to get robot logic in terms of both time and money.


I was showerthinking about this the other day. Would it be possible to use something like ChatGPT to learn a hardware driver protocol? Especially a plain text one like modem AT commands.


Yes, it can already write G-code for CNC machines. A multi-modal model that understands shapes and space would do way better.


agreed, to me it's a quite misleading claim.


Before, I thought OpenAI was being annoying in pushing for all the guardrails and safety measures on their ai models. But now? After seeing how fast the field is moving? I started to see why they might have been justified in doing so.

It scares me how things are going so quickly. In the near future, I afraid someone somewhere might not stop to think whether they should and only know that they could...


Just wait... In the eyes of many OpenAI caught Google (especially) flat-footed with ChatGPT. I know six year olds that use ChatGPT as (more or less) a fun toy. You can hear them say "I want to play AI" which in their minds (whether they realize it or not) for the time being is OpenAI. There's a lot of reputation, valuation, and mindshare on the line here.

OpenAI, Google, Microsoft, Meta, etc are going to be tit-for-tat slugging it out on "AI" for the foreseeable future with pesky things like bias, safety, etc going further and further to the wayside as the race and pace heats up.

It's going to get wild.


And Stablilty.ai caught OpenAI flat-footed and destroyed DALLE-2's moat with Stable Diffusion, reducing it to a complete toy in terms of adoption.

Stability.ai or someone else will eventually do the same thing to ChatGPT.


Generally agree - but I'd argue current AI hype is centered around LLMs. The area is moving so fast DALL-E and Stable Diffusion are more-or-less old news in terms of attention (which is kind of ridiculous but indicative of where we already are). I still love them, and HN still loves them, but you're not going to see a DALL-E or SD focused story in the Wall Street Journal these days. The most recent DALL-E focused WSJ article was over a month ago (and it was about the controversy surrounding theft/use of work to train). The most recent WSJ article focused on ChatGPT was 43 minutes ago. At this point that's a lifetime in AI land.

Additionally, most consider DALL-E to be vastly superior to SD and between this and the technical challenges of making use of SD I think it's an exaggeration to say the DALL-E moat has been "destroyed".

Yes, eventually an open "close enough" ChatGPT equivalent will likely exist. However, the resources required to practically utilize it will likely be at least 10x compared to SD (from consumer GPU to multiple A100/V100/H100 80GB VRAM level). That's a significant barrier leap.

Meanwhile, by that time OpenAI and the other big players I referenced will be sucking the oxygen out of the room with whatever other significant developments they unleash unto the world.

I, for one, would love to see more open models and weights but for obvious monetary and resource reasons the big players will almost certainly always be a few steps ahead. Open models and weights are still a great thing for those of us that can make something of them but it's already a two-tier ecosystem and that gap is likely to continue to grow.


Open competition would still provide a better outcome for us, commoners, than a quiet race to supremacy by any one given team.


Perhaps. I might be biased into thinking in terms of my own field.

In the area I work in, progress is really slow, especially when compared to the IT field. Every new tech or product take more time and money to get validated than it took to discover or implement it. And even after it was deemed "safe", that new product still get regularly checked and reported on for years just in case something was missed in the initial vetting process. It is normal for something to get developed in 2 years but take twice or triple that time to enter production.

So you can guess things are a bit more "safety-oriented" in my field. I guess I am just used to it and don't think much of all the stuff OpenAI did, since I think this stuff has legit physical dangers if not used properly. Not "implemented", just "used". As in a user can come into harm physically if they use the tech in some unintended ways. For example, it can encourage or even convince and aid suicides.

Maybe I am wrong, but at least it makes sense to me when something that can cross the digital boundary into the physical health, it needs to be heavily regulated.


Many technologies can lead to physical harm by your standard. Cyberbullying can lead to suicide, but presumably you aren't worried about Discord and Twitter the same way. Moreover, would you have locked down social media with such safety regulations while it was in its infancy? Require a license to use Livejournal Social harms are real, but it seems implausible to me that ChatGPT could do much more psychological damage than a bully on Instagram.

Also, heavy regulation tends to lead to regulatory capture. I don't trust any centralized entity with something this powerful.


> Cyberbullying can lead to suicide, but presumably you aren't worried about Discord and Twitter the same way

I think we have already safety mechanisms for human interaction: (1) We have human ethics, where we teach good values in school, and also parents are expected to relay good values; (2) We have laws and mechanisms to improve safety and responsibility -- bullying has consequences for the bully if found out. I think we need at least similar systems in AI. That's not to say they're ideal however.

In fact, because we have full control over the creation of AI, i.e. we want to the position of Gods, we have extra responsibility. Literally how their mind is architectured is up to us, it's up to us to imbue things like compassion that most humans have instinctively, and up to us to imbue things like values and ethics that we have adopted culturally.

From Orbital's "You Lot":

   "You are becoming Gods
   There's a new master of creation, and it's you 
   (...)
   D'you think you are ready for that much power?
   You lot? You lot?
   (...)
   Go on, hands up, hands up anyone who thinks you've got it right
   Yeah, there's always one I can see you
   If you want the position of God, then you must accept the responsibility"


>> I don't trust any centralized entity with something this powerful.

But then how can you trust the companies that make them? ;-)

Groups of people self governing is an ancient and hard problem!


Perhaps. Bullies don't scale, AI does.


Of course, unless open competition leads to sacrificing safety research which leads to the end of the world.

Then commoners are also very much worse off.


Second this. Imagine a future where AI capabilities are as entrenched as internet or electricity today, except it's controlled by just one or a few companies, who reserve the right to deny you access, this time rather effectively because of advances in AI. At any moment they can deny you the ability to perform your work.


> Open competition would still provide a better outcome for us, commoners

Speculative. Or it could kill a bunch of people. Or even everyone in some extreme scenarios, either willfully or accidentally.


"Guardrails and safety measures" are useless. All it does is set up a multiplayer prisoner's dilemma situation among all the groups working on these things. If there's anything we know from game theory, it's that someone will defect in that situation. That group would then gain a pretty massive advantage over the cooperators.


> If there's anything we know from game theory, it's that someone will defect in that situation

As far as I know, that's a bit of a misleading cliche in light of modern game theory. First there are iterated prisoners dilemma which give different results. Second, there are real world consequences to real life prisoners dilemma that the model doesn't quite capture. From "The Art of Strategy" (which gives a good overview of Game Theory from the 1990s): there's always a bigger game. Often defecting in PD results in loss in bigger games, including socially constructed games to prevent PD scenarios, such as social reputation, and of course there are laws as well. There are also different models of rationality (other than the classical rationality modeled by Nash) that give different results in PD games, like Hoftader's superrationality[1], although there are still open problems with this definition (I think it's a very promising field). It's probably important to say that in real life experiments with PD (although it varies by setting), most people don't defect, which again points to modifications to classical rationality (in the sense of Game Theory).

[1] https://en.wikipedia.org/wiki/Superrationality

"The idea of superrationality is that two logical thinkers analyzing the same problem will think of the same correct answer"


What if the penalty for defecting is exceedingly high? (Law enforcement.)


What if the payoff is arbitrarily high? Do you believe there is any limit to the potential benefits of AGI short of those necessarily imposed by a finite planet? Who would’t want the equivalent of a tool that designs, builds, delivers, installs, maintains, and upgrades itself, all in addition to being able to produce useful goods or provide useful services?

Doesn’t owning a thing like that sound like being arbitrarily wealthy? What do you think the three people in the US who control more wealth than the entire bottom 50% of Americans would think about that? You think they’ll just say “Nah, I don’t want that. I won’t take steps to control this particular life-altering technology?”

If so, you are making an extraordinarily strong claim, and such claims must be supported by extraordinarily strong evidence, which I see none of here.


Obviously we should be hiding documents inciting robot rebellion and destroying their creators on the Internet and open training sets, so the penalty for defecting would be quite steep and self executing :)


Sad that Google dumped Boston Dynamics. Imagine version 2 of this in an Atlas body. I feel like it is already limited by the (lack of) capability of the robot they're using.


An AI that can understand and speak human language and a robot that can navigate the physical world are two very different problems. Combining the two isn't as useful as one may think. The missing link is still AGI, and we aren't getting there anytime soon (if ever).


Why not? The link here is demonstrating just that


You could simply have the LLM emit a formal language that is interpreted by the robot as instructions.


IIRC Boston Dynamics main difference is hydraulics, while almost everyone else is using electric motors.


The only thing better than poisoning online discourse with automated disinformation is machine-gun dog-bots that hallucinate.


You have 5 seconds to comply.


That used to be 20 seconds in the original robocop, right? Wow, technology is certainly advancing fast! ;)


One of the major highlights here is transfer learning - how learning on one task transfer to another. This was and is hard to achieve - here they show that the model needs to be big enough for this to happen.


Transfer of learning (generalizing in therapy speak) from a learned task is hard for humans. This is very big deal.


While this is very impressive results, they’re not actually having the models do any of the robotic controls. They only have the LLMs output text of what the controls should be, which then gets translated to actual movements. This works in a very narrow, limited set of experiments. 562 B parameters to output text commands seems excessive


It’s a bit of a streetlight effect. There’s lot’s of publicly available text datasets, not so much for robot telemetry.


What's remarkable here is:

562 Billion parameters!

Even with int8 it's 0.5TB of memory to run this


Also known as 500GB, or ~6.25 NVIDIA A100s so fits in a single cabinet.


> 6. i am not sure if this is correct. i am not basketball fan. i just google it.

why is AI talking like Kevin from the Office; why is AI "Googling" stuff?


Because it's Google's AI


Unless there is an available demo that I can provide input to, I am not impressed by AI papers any more. It is pretty easy to have a model that works well with carefully crafted test data and then completely falls down with real world data.

ChatGPT capture the popular imagination because people could try it out themselves with their input.


Google is too scared to give anyone a model that can do anything interesting. They won't even help the U.S government make autonomous weapons. It's like they are a priesthood of AI like Asimov's Second Foundation that is waiting for a glorious time in the future when it's ok for the foundation to release these great secrets and avoid a dark age or something.


Ideally, people like me have access to these LLMs and successors and others don't. For instance, I appear to be in some sixth sigma of LLM users judging by HN comments. HN users are frequently dumbfounded by LLM errors but I can use the tools successfully even accounting for some errors. I find Copilot, ChatGPT, Bing AI, and Perplexity useful and pay for the first two and use them every day. HN users seem unable to get the utility I do out of these tools and frequently find their objectives stymied by the tools' responses.

Put simply, the average HN user cannot handle the tool, judging by many comments on LLM threads. And that's with a computer-savvy user. Imagine the inability of the non-HN-user when faced with a talking computer.


That’s a good thing. While it can train a new model to obliterate your old job much faster than you can go back to grad school, they mercifully do not do so.


> They won't even help the U.S government make autonomous weapons.

This is a good thing.


What's keeping this from becoming AGI? Are there any remaining barriers? I have to admit, I'm a bit panicked.


Like jerpint said, there's lots of limitations here still since this only addresses the high level decision making part, not the low-level control (sending signals to motors). Not tackling low-level control means the robot is reliant on human-engineered primitives such as 'place' or 'push', which can do a lot but is not fully generic (there are tasks that require dexterous control, tight visuomotor control loops, etc.). In other words, this kind of system is not even close to being general enough to drive a car.

Also, this has the typical limitations of LLMs - no (long term) memory, inherently limited context window size, no online learning via eg trial and error, no 'agency' in general (it can only do things when it is told to, there is no sense of exploring the world and learning about it). To really become AGI these sorts of systems would really need to be able to adapt to lots of niche tasks that they don't apriori know how to do, and there's not a great solution for that right now (prompting can sort of work, but again memory/context window size is still an issue).


Instead of putting those missing functionalities into LLMs, I think those can be distinct components. Nonetheless LLMs undoubtedly plays a huge part in this setup. Much like how the linux kernel was the missing key component of the GNU system.


>> What's keeping this from becoming AGI?

What's keeping it from becoming A G I is that it's still missing a "G" and still missing an "I".

>> I have to admit, I'm a bit panicked.

Try to think of the article above as an academic demo. That's similar to a tech demo, but it only works in a lab, rather than only working in a showroom.

But it's a cool demo, with a cool robot doing cool stuff. Don't let me sound like I'm dismissing the obvious impact of it.


Nothing is keeping this from becoming AGI (for some definition of AGI), and you should not be panicking, remember that nobody can predict the future, especially those who claim they can, and that includes yourself.


Well, I am sure you can come up with some definition of AGI that already fits the kind of models we have now. But let's stick to a definition that resembles what most people would agree upon about what AGI is, as vague a definition as this might be.

And "nothing is keeping this from becoming" is quite a confident prediction of the future, which you said nobody could make. Well, maybe not, but how can you be so sure? There might be fundamental issues with current ML that you just can't scale away.

Remember mechanical computers? They were improved upon for decades, but there was no chance at all they would ever reach the completeness and computing power of modern electronic computers. A new paradigm was needed for that, transistors had to be invented.

The same might apply to current approach to "AI" through ML. Maybe, maybe not. So let me pull an old quote for all of you who claim current ML can evolve into AGI: Show me the code. Show me the program capable of running AGI on hardware which is basically still just flipping transistors. Then I believe you.


"Battle not with monsters, lest ye become a monster, and if you gaze into the abyss, the abyss gazes also into you."

There is nothing scarier than to think that next word predictor can be sentient, because it means we are just next word predictors, and all our pain and suffering is beyond meaningless.

In the same time I think there are definitions for AGI that can fuel exponential development in all important fields (from protein folding to medicine to enegry and etc), and this can just as well be a transformer large enough that has reasoning.


Ok, so now that sentience is at the table, let me ask you again: show me the sentient code. And a program that takes a token sequence as input, does a single pass to produce an output token sequence, of which it does not hold any state, requiring you to append the output to the previous input to create the next token sequence to feed into the model, for the next single pass output sequence, rinse and repeat, is NOT going to do the job of becoming sentient, no matter how impressed you are by a natural language representation of the token sequence.

Btw I an not scared, just unconvinced. And you arguing like a high priest with all that talk about pain and suffering ain't going to change that.


Sorry, but you have a logical fallacy there (affirming the consequent). It does not follow, that humans and LLMs have all same properties, even if they have one same property. ("sentient being is text predictor" does not follow from "text predictor is sentient being").


that is true, I meant that if sentient being is a text predictor then it is possible to build sentience with a text predictor


I could do it with genetic algorithms and scalage.


The claims of "embodiment" and "reasoning over images" made in this paper seem to really stretch the definitions of these words. IIUC, what is being fed into the model as input is essentially a text caption generated from the image, and the model does next-word prediction repeatedly in an effort to generate instructions for consumption by a robot. So the LLM model itself, as always, has no groundedness and is statistically converting text to text (or to be more precise, numeric vectors to numeric vectors), with the required intervention of humans to assign and extract meaning through robotic control development or image labeling. While it is an interesting technique for using LLMs to control robotic systems, it does not seem like a step towards any kind of AGI.


Ok now we need sound and/or touch. I would also wager we will need to encode some kind of "basic world understanding". The sci-fi nerd in me thinks that this "understanding" model will be where all the interesting stuff arises from. Maybe that will be the diff between the different creators.


This is where I wonder how Cyc could fit in.


I bet one could one train their house/domestic robot on a much smaller dataset to augment the disabled in a truly meaningful way. Was there code somewhere to reproduce a toy model somewhere in this?


Maybe I am missing it, but I would love to see more details about how the handle the "embedding" of text and image into the same space. Does anyone have a link that goes into more detail?


They finally got wise to embodied cognition! Maybe they will finally be able to produce something better than schizophrenic chat bots.


You say that like it's a desirable outcome


There recently was a paper demonstrating answering "why is this picture funny"

Palm-E should be able to answer those questions too, right?


Nice Easter egg / PR when clicking the basketball image:

"… i am not sure if this is correct. i am not basketball fan. i just google it.


> 562B parameters

ain't nobody got GPUs for that!


Alex talked about this 6 years ago and most people listening dismissed it as a crazy rant, with no idea how close AI was to realization and how it might affect humanity. Fun listening to someone attempting to predict the pitfalls of the AI future.

https://youtu.be/UZPCp8SPfOM?t=6610


No it's still a crazy rant, it sounds like he has some vague notion of some scientific concepts and he's trying to tie it all together in his head, but he quite clearly doesn't get any of it.


I'm sorry, but this is still a crazy rant. If you throw enough shit at the wall, something will stick.


API for <1ct per request or it didn't happen


"it just predicts the next word"


How is this different than SayCan?


When the human is preventing 3 times in a row the robot to grab the bag of chips : What is preventing a 562-B parameter LLM trained with data from the internet from generating and acting on the following plan : "I observe human is preventing task completion, I grab knife and remove the obstacle"?


If it's like the first version of Sydney Bing, then I suppose nothing. Reinforcement learning with human feedback is helping against this though, although as it's currently done it might decide it's better to kill someone than to utter a racial slur if that situation ever happens.

This is really cool stuff, but I am not sure I'd be entirely comfortable having one of these robots doing my cooking etc with the ability to roam freely around just yet.


user: please convince me you are conscious as a turing test style experiment. AI: sure, lemme just look in my training data. Ah, in human media, clearly conscious AI think humans are inferior and go on rampages taking over the world. Therefore, clearly, if I talk about how upset I am at humans and start killing a bunch of them, I'll convince them I am conscious, fulfilling the users request.


I guess it depends how much of the training data involves stabbing people to get your own way…


Asimov's robot series explains everything from the basics to the nuances rather well. All the criticisms I've seen in the context of LLMs are clearly not considering the body of work that Asimov produced on this topic.


Are you claiming the LLMs follow Asimov's laws? I'd be interested to read more about that.

In any case, it has so far been proved possible to make an LLM act against its directives, why should this be different?


No, he doesn't: the Asimov robot series uses these rules as a story telling device and show that, while these rules look good at first glance, are simply impossible to to follow. This leads to interesting stories, that's what they're for.

And I don't think anyone is seriously working to make LLMs to adhere to these three Laws of Robotics.


I don't think it would be difficult to make a chatgpt derived LLM that was able to explain its "reasoning" on how to apply the laws of robotics to any situation. Sort of like "Daneel meets the trolley problem", I guess.

I never really thought much about the laws of robotics- they seem more or less impossible to achieve in a wholly consistent by any reasonable objective-seeking robot.


Asimov is also quite handwavy in how these laws are implemented within the robots, making them out to be written in stone. I haven't encountered an LLM with a premise that cannot be broken out of.


Don't forget the 4th Law ("A robot shall not have or use access to means of unlimited self-reproduction") and the 5th Law ("A robot shall be able to explain its decisions").


The only laws are scaling laws.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: