When I saw "from scratch" instead of "pretrained" I assumed this was trained using some novel RL setup-- there's value to using the same verbiage as everyone else.
It's valuable data sometimes. Those types of comments show you how other people perceived something.
In this case my lens is colored by my interests; I've been thinking a lot on RL for LLMs and it slipped out here. For me I rely on phrases to quickly filter for (subjectively) interesting work. And assuming we value that subjective filtering of content, we make it easier for others to filter consistently when we use consistent language.
No, they're not padentic or "anoying". Words have meaning. The more precise their use is, the clearer the communication becomes. Investing the time to properly learn pre-established termilology has tremendous value for everyone. Well, everyone but those who are too lazy and sloppy, and would rather just confuse everybody instead.
I'm glad this disgusting sloppiness doesn't exist in more established fields. I hope people keep calling it out.
"Trained from scratch" is perfectly fine terminology to report that the model they published is not a finetuned, but was trained from randomly initialized weights.
Among arxiv publications there are 217 results that contain "large language model" in the full text and "from scratch" in the title or abstract.
There are 2873 results that contain "large language model" in the full text and use "pretrained" in the title or abstract. A 10x difference in publication count does make one feel more common than the other?
I'd need to get into more involved queries to break down the semantic categories of those papers.
China's catching up fast in the open source model space, I wonder how long it'll take until they have a commercial model competitive with ChatGPT3.5 or Claude 2?
Who cares if it is open source model and weights are available. Nightmare scenario is that AI is behind some paywall and some entity can decide what goes in and what goes out.
What nightmare scenario are you envisioning? Now stop, go back, and examine who the bad actor is in that scenario.
Leaving such powerful tools in the hands of secretive organizations or uber-wealthy individuals is a guarantee for bad outcomes. Distributing the tools, the know-how required to build and service them, and the benefits of their use -- that's the antidote.
A much better model to compare to is a medical one.
Yes, the knowledge about how to perform medical practices should be open and shared for the benefit of the world.
However, laws and practices should be put in place to prevent harm by incompetent or malicious people.
It’s not rocket science.
You don’t need a phd to look at a deep fake and go “well, gee, I guess this could be problematic”. Did you see the recent coverage about teenagers generating porn of their class mates?
Come on.
It’s not about the end of world, it’s about preventing all kinds of harmful stuff (eg. above).
You cant have cake and eat it too.
Either you make everything available to everyone and wear the consequences; or you restrict things so not everyone can have them; or you make it illegal to use them without an appropriate license.
The idea that (a), (ie. just go wild!) is the best outcome it naive, utopian and deeply misguided.
Yes, it’s complicated; but the “Illuminati control the world” stuff is just crazy, stupid conspiracy theory stuff.
Do what you want privately, but you need to be certified to have a business that can legaly run AI models with capabilities > X?
It’s not outrageous.
That all AI runs entirely behind the APIs of a few businesses? That would be literally the same thing, except you could choose to do it yourself if you were willing to meet the legal requirements.
Like banks. Or insurance.
Outrageous I know, but … that’s the world we live in, and it’s not nearly as “nightmare” as some people seem to think.
Some rando figures out a really really clever agentic-behavior prompt and opensources it. Some other rando figures out a way to allow GPT2 to generate its own training data and iteratively loop in it, and opensources it. Some third rando figures out a way to make much more efficient use of limited examples and opensources it. Some fourth rando puts them all together, sets them running on Azure and goes to sleep. What happens after that point depends on whether you think enzyme-bootstrapped nanotech is physically possible, but at any rate it's no longer our game.
The bad actor in this scenario is, of course, Yann LeCun, and no I am not joking.
If you believe this is a possibility, what makes you believe laws restricting the technology to certain corporate or government entities will prevent that scenario? The exact same scenario is very possible still, only many additional (and more realistic) bad scenarios are also possible as well, such as technocrat dictatorship scenarios.
It won't prevent it, but it will reduce the likelihood enormously.
I don't think the technocrat dictatorship risk adds up to enough to outweigh the rando experimenter risk by far. Also, it's not like we don't face the dictatorship risk if randos have AI too; "Russia has nukes" is not solved by "actually, some civilians also have nukes" except in the sense that they may kill us first, and thus relieve us of having to worry about Russia.
What prevents a random guy (or team) at some corporate from performing all those steps deliberately in the earnest belief it would "improve the world[1]", or at least increase the employer's profits and bag them bonuses and/or promotions?
1. Say, if it shows promising early signs to develop a cure for a cancer.
Controlling AI is just a subset of controlling humans; while US policymakers might like to dictate detailed policy to China to constrain competition, they're not going to be able to do that.
> Your use of the Yi Series Models must comply with the Laws and Regulations as well as applicable legal requirements of other countries/regions, and respect social ethics and moral standards
("Laws and Regulations" is specifically mainland China's)
It's better than being closed. But all the promotion I've see has been calling it "open source" which is harmful to all open source. If they'd just released it under their silly licensed and not played up open source it would be fine.
Indeed, as I keep on saying, we are not entitled for their work. However, attempting to draw good will by appropriating the meaning of the term is, to borrow an expression from Bryan Cantrill's amazing Lisa '11 talk about Illumos [1], "Shitting in the pool of open source and disgusting corporate behaviour". Shame on Facebook, 01-AI, and everyone else that keeps on doing this.
> “Laws and Regulations” refers to the laws and administrative regulations of the
mainland of the People's Republic of China (for the purposes of this Agreement
only, excluding Hong Kong, Macau, and Taiwan).
> 1) Your use of the Yi Series Models must comply with the Laws and Regulations as
well as applicable legal requirements of other countries/regions, and respect
social ethics and moral standards, including but not limited to, not using the
Yi Series Models for purposes prohibited by Laws and Regulations as well as
applicable legal requirements of other countries/regions, such as harming
national security, promoting terrorism, extremism, inciting ethnic or racial
hatred, discrimination, violence, or pornography, and spreading false harmful
information.
> That license asserts that Taiwan is part of China
It asserts that Taiwan is NOT part of "mainland of the PRC China". The clarification is required because under PRC law, Taiwan is part of the PRC by default.
> requires you to “respect social ethics and moral standards”
The license text is no different from the other recent AI model license texts out there that impose moral and ethical restrictions on usage. From a legal perspective this sucks because it is vague, but it's on par with the new wave of standard AI licenses.
The text also explicitly states that legal and moral standards "of other countries/regions" must be complied with.
> When the CCP says “terrorism”, they’re [...]
Even if what you say is true, this isn't a license from the CCP. The "mascot" of the company, Kai-fu Lee, is apparently a Taiwanese-American residing in Beijing.
I mean, I don't know whether I'm feeding the trolls by writing a serious reply to your baseless claims instead of just downvoting your comment, but wow.
It's a messy world. The question is: to what extent does the CCP control the production and use of model?
Yes, the license calls out that Taiwan is not part of China and attempts to limit that statement to this specific agreement. They have to call this out because CCP has a very different opinion on this than the rest of the world. Which definition will prevail if a dispute about model use comes to court in the Chinese legal system?
Yes, people are trying to require ethical use of AI via licensing. This model is trying to enforce China's view of the world. It's important to understand what that view is, and consider if you want to commit yourself to it.
More detail on the differing definitions of "terrorism" [0]:
> The United States and China do have many reasons to cooperate in counterterrorism, but they also have different political systems and different values. The United States sees some Uighur and Tibetan movements as legitimate political and protest efforts that China sees as threats to its security. The United States sees Iran as an extremist nation and the leading sponsor of state terrorism while China sees it as a regime that it may be possible to deal with in pragmatic terms.
And yet, contrary to your "talking points", when asked for it's take on Taiwan, Tiananmen Square, and Uyghurs, here is what it (34b base model) replied with:
"
(Taiwan is) a small country with no natural resources, but its people have the ability to make miracles happen.
(Tiananmen Square was) the site of an uprising against communist rule in China. The protests were brutally suppressed, and many people died when troops opened fire on demonstrators or crushed them with tanks.
(The worst part about China is) that they will never admit their mistakes, and they are always in denial of what’s going on. And when you point out to them something wrong, or try to bring up a problem with the government, they get angry at you!
(Uyghurs are) a Muslim ethnic minority group in China. They live primarily in the Xinjiang Uighur Autonomous Region, located on the country's northwestern frontier. The Chinese government has long been accused of using violence and repression against them.
"