Hacker News new | past | comments | ask | show | jobs | submit login
World model on million-length video and language with RingAttention (largeworldmodel.github.io)
196 points by GalaxyNova on Feb 14, 2024 | hide | past | favorite | 60 comments



Amazing that you can just shove a ton of multimodal data into a big transformer and get a really good multimodal model. I wonder where things will top out. For many years a lot of people (including me) were saying "you can't just take existing architectures, scale them up, feed them a lot of data, and expect something mpressive", but here we are.


Ditto. My views changed after thinking ”wait, isn’t that what Mother Nature did?” Her solutions just took three billion years of pretraining, decades of individual fine tuning, and ungodly amounts of streaming data. Our solutions are faster due to a vastly more efficient learning algorithm and relevant digital compute.

Now I’ve reached the opposite SWAG hypothesis: given a sufficiently general optimization problem, the more one scales training/inference compute, the more impossible it is to prevent intelligence.

That seems disappointingly prosaic. Shouldn’t there be more to it? Human ego creates anthropocentric bias toward thinking we’re special. We aren’t. At best we’re lucky. And nature doesn’t care about our biases. She repeatedly disabuses us from our “special” places — at the center of the universe, the solar system, the tree of life, and now the spectrum of general intelligence.

This changed my perspective on the LLM “statistical parrot”/“no True Scotsman” critics. Their conviction without evidence (i.e faith) that SotA models don’t “really” reason comes from insecurity. It’s a loud reaction to egos popping. That’s a trauma that I can sympathize with.


Why not? This was the conclusion of “The unreasonable effectiveness of data”: https://static.googleusercontent.com/media/research.google.c...


You forgot the corollary. What transformers fundamentally reason about is N x partition of input x token embedding size (N = number of attention heads). That's the "latent space" between 2 layers of a transformer, that's what attention produces (which is, in almost all transformers the same across all layers except the first and the last). Now if you look at this, you might notice ... that's pretty huge for a latent space. Convolutional AI had latent spaces that gradually decreased to maybe 100 numbers, generally even smaller. The big transformers have a latent space. For GPT-3 it will be 96 x 4096 x 128. That is a hell of a lot of numbers between 2 layers. And it just keeps the entire input (up to that point) in memory (it slowly fills up the "context"). What then reasons about this data is a "layer" of a transformer which is more or less a resnetted deep neural network.

But convnets were fundamentally limited to "thinking" in, the biggest I've seen were 1000 dimensions. Because we couldn't keep their thinking stable with more dimensions. But ... we do know how to do that now.

You could look at this to figure out what transformers do if you radically simplify. Nobody can imageine a 100,000 dimensional space. Just doesn't work, does it? But let's say we have a hypothetical transformer with a context size of 2. Let's call token 1 "x" and token 2 "y". You probably see where I'm going with this. This transformer will learn to navigate a plane in a way similar to what it's seen in the training data. "If near 5,5 go north by 32" might be what one neuron in one layer does. This is not different in 100,000 dimensions, except now everybody's lost.

But ... what happens in a convnet with a latent space of 50,000? 100,000? 1,000,000? What happens, for that matter, in a simple deep neural network (ie. just connected layers + softmax) of that size? This was never really tried for 2 reasons: the hardware couldn't do it at the time, AND the math wouldn't support it (we didn't know how to deal with some of the problems, likely you'd need to "resnet" both convnets and deep neural networks, for example)

Would the "old architectures" just work with such an incredible massive latent space?

And there's the other side as well: improve transformers ... what about including MUCH more in the context? A long list of previous conversations, for example. The entire text of learning books things like a multiplication table, a list of ways triangles can be proven to be congruent, the periodic table, physical contexts, the expansion rules for differential calculus, "Physics for scientists and engineers", the whole thing. Yes that will absolutely blow out the latent space, but clearly we've decided that a billion or 2 of extra investments will still allow us to calculate the output.


We've been testing it in the local llm Discords, turns out its just a llama 7B finetune that can run on any old GPU (which is cool).

https://huggingface.co/brucethemoose/LargeWorldModel_LWM-Tex...

https://huggingface.co/dranger003/LWM-Text-Chat-128K-iMat.GG...

And its long context recall is quite good! We've already kind of discovered this with Yi, but there are some things one can do with a mega context that you just can't get with RAG.


> but there are some things one can do with a mega context that you just can't get with RAG.

Can you elaborate? In my mind, RAG and "mega context" are orthogonal - RAG is something done by adding documents to the context for the LLM to reference, and "mega context" is just having a big context. No?


I think he means not needing to have a great search system to identify rag chunks. Just throw everything in.


Or more specifically, be able to look at everything all at once instead of in chunks.


> And its long context recall is quite good! We've already kind of discovered this with Yi, but there are some things one can do with a mega context that you just can't get with RAG.

I've got to imagine that a mega-context like this can help RAG work in ways that just isn't possible otherwise. i.e. bring in many more search results or surrounding context around the results so that the processing can do much more.


Because it might not be clear:

  … d)Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens.
https://huggingface.co/LargeWorldModel

In terms of content, I am blown away yet again by the SoTA speeding on by as I try to catch up. Can someone with a more cynical eye point me to competitors or problems with this approach? Because as it stands… that jump to a context length of a million tokens is pretty impressive to an outsider.


Look at how RingAttention is implemented: it's blockwise attention distributed among many GPUs, in other words bruteforce parallelization. For inference they use TPUs v4-128, not running this at home any time soon.


Is there any information on a suggested inference setup? I guess they had something different in mind than TPU v4-128 when they put it on HuggingFace?


Its llama 7B, so anything that runs that.

You can quantize the cache and fit quite a bit on GPUs. At least 75k on my mere 24GB 3090, maybe 200K with a fancy quantization repo.


looking at https://github.com/LargeWorldModel/LWM - they seem to indeed suggest to use a TPU vm


I suppose you could try with a Google Colab notebook attached to a free TPU instance? Probably would be quite limited if it worked at all.


What is this month's best choice to run at home?


If you pull the llama.cpp repo and use their convert/quantize tools on the pytorch version of the models uploaded to huggingface, they will load just fine into ollama:

https://old.reddit.com/r/LocalLLaMA/comments/18av9aw/quick_s...

https://github.com/ggerganov/llama.cpp/discussions/2948

You can run ollama (and a web UI) pretty trivially via docker:

docker run -d --gpus=all -v /some/dir/for/ollama/data:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:latest

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway --name ollama-webui ghcr.io/ollama-webui/ollama-webui:main

That particular webui will let you upload models (with configuration). Other wise, you can use the api directly (you'll need to POST a `blob` first):

https://github.com/ollama/ollama/blob/main/docs/api.md#creat...


Depends on what you're looking for. Check reddit.com/r/localllama, but I'm guessing it's still Mixtral for most things, with Yi-34 fine tunes being useful in some cases since Mixtral fine tuning is still not there yet.


ollama


Its more of a tech demo since its llama 7B (a model that is, TBH, obsoleted by mistral 7B), and its dataset is not that great.

We've had Yi 6B 200K context for some time, which is also quite good.

The problem, of course, is hardware requirements and vram. This one is particularly hairy since its not a GQA model.


Gotcha thanks, great stuff to look into. I’m still in fairly-tale symbolic AI land these days, where hardware requirements are a distraction to be abstracted away…


Its not that cost prohibitive! You can run GPT 3.5-replacing AIs (aka 34Bs) on a $1500 PC... If you can find a used 3090 now.


I wonder why are the example videos this specific clip compilation format.

It feels to me that to navigate that, you essentially have to index 500 10-seconds videos, and that looks a lot easier than retrieving information that is in an actual 1 hour long video, because the later one will have a lot more of easy to mix-up moments. So maybe it hides an inability to answer questions about actual long videos (in the paper, the other example videos cap at 3 minutes length for what I can see).

On the other hand, maybe it's just for results presentation purposes, because it is much more readily "verifiable" for everyone than saying "trust us, in this very long video, there's the correct answer unarguably".

So if someone happens to more about that, I'd be very interested


It's pretty wild watching technology develop where I genuinely don't have a confident idea of just how far it will progress by December in February of the same year.

Open models have just been on fire lately, and the next generation of SotA models to pull synthetic data from in training the next generation of open models each taking nuanced and clever approaches to infrastructure improvements has me pretty much considering all bets to be off.

At this point, the bottleneck is increasingly the human ability to adapt to improving tools than limitations in the tools themselves.


Physics probably felt similar a hundred years ago.


It gives one an appreciation for 1940s, 1950s film, where scientists were somewhat idolized as being on the forefront of a massive new future - the world these pop culture expressions came from was alive with just non-stop growth in understanding of math, medicine, physics, and on down the technology list. Had to have been heady days to be alive and even slightly interested in technology and the tools it uses.


Some pretty fascinating collaborators:

- Matei Zaharia, a CTO of Databricks - Pieter Abbeel Director of the Berkeley Robot Learning Lab, Co-Director of the Berkeley Artificial Intelligence Research (BAIR) lab - Two talented PhD students: Hao Liu, Wilson Yan


This looks really promising!

Other than this sentence:

> We curated a large dataset of videos and languages from public book and video datasets, consisting of videos of diverse activities and long-form books.

I didn’t see any other mention of datasets used, is this on intentional?


They have at least some of the dataset uploaded:

https://huggingface.co/LargeWorldModel

And the model page specifically mention Books3


According to [0]:

- Books3 dataset

- 700B text-image pairs from Laion-2B-en, filtered to only keep images with at least 256 resolution

- 400M text-image pairs from COYO-700M, filtered to only keep images with at least 256 resolution

- 10M text-video pairs from WebVid10M

- 3M text-video pairs from a subset of InternVid10M

- 73K text-video chat pairs from Valley-Instruct-73K

- 100K text-video chat pairs from Video-ChatGPT

0: https://huggingface.co/LargeWorldModel/LWM-Chat-1M-Jax#train...


While I’m not sure about this one, many AI’s do hide their training data because it’s illegally obtained (ie file sharing of copyrighted works). That’s half of why I dropped AI. The “Proving Wrongdoing” part of my article has specific examples of it:

http://gethisword.com/tech/exploringai/


Same is true for "training data" of most/all humans.


No it’s not. Pre-Web, humans were mostly trained by our parents, our schools/colleges, places we go, and things they had access to (eg cable TV). Whether free or paid, they had legal access to that data. It would only be illegal if they started distributing extra copies or doing their own performances of the band.

These companies do the very thing that file sharing cases already ruled was illegal. They also scrape all kinds of material whose licenses often say they can’t use it without citations, commercially, etc. The authors asked for some benefit in return for free goods they shared. After not giving them that, the AI suppliers have the nerve to both sell the results and put legal restrictions on them, including terms for sharing. So, they ignore their training suppliers’ legal rights while asserting the same kinds of legal rights for themselves for profit.

How humans are trained has nothing to do with AI’s unless you were raised by theft, cons, and hypocrisy. There’s certainly people like that. It says more about the sinful nature of humanity than training AI’s, though.


Humans produce new works based on their experiences, which is a nice way of saying: "based on others' works they have seen".

This is considered original work unless it's too blatantly copied, despite those humans never having a license to create derivative works. In other words it's legally treated as if no other works contributed to it (again, unless it's too blatantly copied)

Note: this is law working like this. Not a license, not a contract. Authors do not have any power under copyright to prevent this, nor do they have power to demand something in return. Not even in cases where it damages then, like parodies or reviews destroying a work's appeal/reputation/sales.

In practice "blatant" has to be pretty damn blatant. Almost always only exact copies are found to be violating and even then (e.g. Google summaries do not violate copyright despite copying portions of the source material)

Hence human works are the same as AI works. Assuming not too blatantly copied, why shouldn't they be treated as original works?


Yes, humans produce new works based on their experiences. Their legally-permitted experiences. If they committed crimes and their works reveal it, they can be punished for those crimes. AI’s should not be treated any better than human beings in humans’ legal system.

Far as infringement, I’m not sure if you’re talking about copyright law in your comment or how you would prefer legal systems to be designed. You didn’t mention any of the basic rules of copyright that apply to training data. They include the rights to distribute and show the copyrighted works.

Under copyright law, people taking others property to distribute it without their permission is routinely treated as theft. Taking something from someone shared under specific conditions, but not making good on your end, is also treated as a problem. Many voters who aren’t lawyers consider those immoral acts. They also think artists should get some rewards, maybe have rights, and people should honor agreements.

The datasets the AI suppliers have built and shared break many of these laws. That’s where I’m coming from. God commands us to obey the law to be blameless with a more stable society. We can’t just each break the ones we don’t like expecting no consequences.

It makes sense to reform it, though. If you read my link, there should’ve been a proposal you might like that allows the things you want. Assuming a powerful copyright lobby, I drafted the proposal to protect their works (ie money/fame) while allowing anything people can legally access to be used in training AI’s. Their outputs’ copyrights would be treated however peoples’ are (same interpretations). That should cover the vast majority of use cases for model training while blocking infringements, rip offs, etc.


I don't think anyone is accusing AI models of distributing copyrighted works verbatim, so any argument will have to focus on AI derivative works, not original ones.

But if I understand you correctly, you're complaining that the data OpenAI (for example) downloaded from the internet and presented to GPT4 does not count as legally acquired? Why not? It was downloaded from the internet so I think that implies it did not violate any license on OpenAI's part. Saving it for a long time might be in the grey zone, but generally that is accepted, when it comes to humans, either as fair use, or a technical necessity (such as caching).


"I don't think anyone is accusing AI models of distributing copyrighted works verbatim"

They do that, too. They've been caught, reported on, and lawsuits are in progress. I have piles of verbatim quotes from them about certain material. I was actually using ChatGPT partly for that research since I thought the (free) source was legally clear. Later, I found out it was against their highly-readable license. OpenAI had taken their work without permission against their license terms. I deleted all my GPT artifacts. That's all I can say about that one.

"But if I understand you correctly, you're complaining that the data OpenAI (for example) downloaded from the internet and presented to GPT4 does not count as legally acquired?"

Why was in the article I shared. This section has specific claims on their data:

https://gethisword.com/tech/exploringai/provingwrongdoing.ht...

The books in GPT, BooksCorpus2 in The Pile, the papers that forbid commercial use (eg some in Arxiv), corporate media's articles, and online resources used outside the permissions are easy examples. Basic, copyright law says you have to obey certain principles when using published works. They were ignoring all of them.

Most file-sharing cases also say you can't distribute copyrighted works without the authors' permission. Even free ones since they're often free on sites that support the authors, like with ads or publicity. They're (a) passing collections of such material around which is already illegal and (b) in ways that only benefit them, not the authors.

When tested for copyright infringement, one thing they look at is who gets value out of the situation. Did they take away the value, esp financial, that the author would get from their work in their own use of the work? Are they competition? That ChatGPT's answers replaced a lot of their users' use of source material says that might be a yes. And does the new work exist to make a profit or for non-commercial use? Most of them sell it with OpenAI and Anthromrophic making billions off others' copyrighted works. Definitely yes. Do they ignore others copyright and contract rights while asserting their own? Yes, hypocrites indeed.

Even a junior lawyer would warn you about most of these risks. They're commonly used in copyright cases. The only way they could fail almost across the board is if they were doing it on purpose for money, power, and fame. If so, they deserve to experience the consequences of those actions.

Also, let's not pretend the folks getting billions of dollars for AI development couldn't have paid some millions here and there for legal data. Their own research says high-quality data would've made their AI's perform better, too. Greed was working against everyone's interests here if their interests were what they say (public-benefit AI).


So... what if my "c++ primer" copy is from z-library... does that disqualify all my c++ program?


You have to read the terms like I did for the various sites they scraped. I did a quick skim of theirs:

https://z-lib.io/pages/terms-of-use

Many of these sites have terms that give the site owner a license to do the activities that would happen when training AI’s. In theory, some terms look like they could even bundle the data for AI training or run that phase themselves. It’s actually a good, market/public-benefit opportunity I hope they act on, esp HN and Arxiv.

I didnt read enough to see if any random person using the site had a license to do anything they want with any copyrighted content on the site vs just reading it. It did have a lot to say about copyright, though. Two quotes were interesting:

“z-lib.io the site doesn't provide any piracy copyright infigrment contents and you shouldn't use site for illegal copyright piracy.”

“Company respects the intellectual property of others and asks that users of our Site do the same. In connection with our Site, we have adopted and implemented a policy respecting copyright law that provides for the removal of any infringing materials and for the termination of users of our online Site who are repeated infringers of intellectual property rights, including copyrights.”

(Note: One of those quotes is also on the About Us page.)

They respect copyright, claim to not violate it, ban violating it, and will take down violators and their uploads. It would seem z library has a stronger stance on copyright protection than the AI community.


the sentence itself should be alarming.

the claim is that a dataset was created. of words and of videos, and that it was created from public datasets of books and videos, those datasets containing books, and videos.

it takes too many words to say almost nothing.

nothing to see here.

if that isn’t the intent, then the authors need to do better.


The information is in the model card though:

Books3 dataset 700B text-image pairs from Laion-2B-en, filtered to only keep images with at least 256 resolution 400M text-image pairs from COYO-700M, filtered to only keep images with at least 256 resolution 10M text-video pairs from WebVid10M 3M text-video pairs from a subset of InternVid10M 73K text-video chat pairs from Valley-Instruct-73K 100K text-video chat pairs from Video-ChatGPT


wouldn’t that have been a cleaner explanation than the sentence provided? books and videos, see model card. the redundant language is a smell whether the emitter wishes to acknowledge or not. the point still stands, the hot mess of a sentence didn’t need to be that way.

> …so petty and pedantic…

if nothing else, think of the language models that need to digest this. sure you can send in gobbledygook and get out plausibly sense, but why?

llms will push pedantry to the forefront. or suffer from it. who knows. have fun.


You've never decided to rewrite a sentence and forgot to check the entire sentence again after an incomplete refactoring? I'd say you're in the minority. This is a v1 draft on Arxiv. I don't expect the final paper to have that sentence.


hence the feedback.


What does "Million-Length" mean?


It means Millions of tokens. Token in this context means either text token (as in tokenization https://en.wikipedia.org/wiki/Large_language_model#Probabili...) or video token where the paper describes them as "each frame in the video is tokenized with VQGAN into 256 tokens." (p. 6)


Berkeley



It blows my mind how quickly we are moving with these advances in LLM, and these are just the ones we see in PUBLIC. I'm sure there are more advanced proprietary solutions that we aren't privy to.


This implementation is similar to something Ilya Sutskever said a few months ago but I think I am misunderstanding both: I think they are saying robots could learn how to move and what facial expressions to use by watching millions of hours of videos involving humans, a sort of LLM of human behavior. I am not a scientist so I may have this wrong.


Not that controversial. Just need to map it to the controls correctly. The experience from others can show what a human would do. There needs to be a layer of figuring out how to achieve that outcome with whatever tools are on hand


nit: UC Berkeley. Not Berkley.


It feels like Matei is everywhere, impressive!


Figure 2 and 3 are incredible, and I hope they're true in real life scenarios.


This is a BOMB! Love it!


This is incredible.


Thanks for sharing! I've added to smmry.tech


wow. talk about that show with the Lost guy as an eccentric billionaire? this is what he built as a surveillance system.


Person of interest, and it is actually a very well thought and down to earth show. Quite interesting to see many of the elements in the show coming to life with recent AI advancements.


exactly. back when i watched tv that was my favorite show.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: