Yes, but the claim is about "unlimited context length." I doubt attention over each segment can be as good at recall as attention over the full input context.
A lot of embedding models are built on top of T5's encoder, this offers a new option
The modularity of the enc-dec approach is useful - you can insert additional models in between (e.g. A diffusion model), you can use different encoders for different modalities, etc
There's a new 7B version that was trained on more tokens, with longer context, and there's now a 14B version that competes with Llama 34B in some benchmarks.
To be fair, it's a stupid distraction from discussing the model. If every thread just turns into politics it would not make for good discussions. People can start threads about specific ideologies that language models have and they can be discussed there (and have been). Bringing that up every time a model is discussed feels off topic and legit to flag (I didn't flag it)
Edit: but now I see the thread has basically been ruined and it's going to be about politics instead of anything new and interesting about the model, congrats everyone.
Alignment is also a distraction, its OpenAI marketing and something people who don't understand ML talk about, not a serious topic.
Like I said, discussing model politics have a place but bringing it up every time a model is mentioned is distracting and prevents adult discussion. It would be like if every time a company comes up, the thread gets spammed with discussion about the worst thing that company has ever done instead of discussing it in context.
The condescension is unfounded and unnecessary. Discussion of the usefulness of a model or its interface also includes these topics. If the refusal to discuss the topic came from anything other than it simply being absent from training data, that’s highly interesting from multiple perspectives.
For example, ChatGPT’s practical usability is regularly hobbled by alignment concerns, notably more so in 3.5 than 4. It’s a worthy topic, not a distraction and characterizing it as something other than ‘adult discussion’ is nothing more than an egocentric encoding of your specific interests into a ranking of importance you impose on others. A little humility goes a long way.
We’re here to be curious and that includes addressing misconceptions, oversights and incorrect assumptions. That all still counts as adult discussion.
Imagine if a Chinese company releases a model that kicks the state of the art's ass and everyone starts using it because it works so well. Now the censorship has leaked into all the systems that use it.
Any model produced by any North American or European companies, even X, may be trained in a way that would be politically correct to the taste of the company, some are left leaning and some are right leaning, but the topics being censored by the model will still be far lesser than the model being created by a Chinese/Russian companies. This is because for the companies to survive in a totalitarian government, the company must bent and satisfy all the filtering request by the government.
They actually have a performance edge, but they aren't well suited to chat models because you can't do caching of past states like with decoder-only models
That tweet had it backwards, more tokens in tokenizer means that the 16k token context window typically allows for even longer passages than if LLaMA were 16k