01-AI/Yi: A series of large language models trained from scratch

niel · on Nov 6, 2023

Previous discussion: https://news.ycombinator.com/item?id=38158554

dang · on Nov 6, 2023

Thanks! Macroexpanded:

Kai Fu Lee's New AI Company: Yi-Open Source - https://news.ycombinator.com/item?id=38158554 - Nov 2023 (66 comments)

brucethemoose2 · on Nov 6, 2023

Not sure if its worth a new HN post, but a 200K context version of the base model just dropped:

https://huggingface.co/01-ai/Yi-34B-200K

Even untuned, I am liking the 34B 4K model so far. I never really use raw llamav2 70b though.

popinman322 · on Nov 6, 2023

When I saw "from scratch" instead of "pretrained" I assumed this was trained using some novel RL setup-- there's value to using the same verbiage as everyone else.

zapnuk · on Nov 6, 2023

These pedantic “when I saw X instead of Y I assumed Z” are the most annoying comments on this website.

“From scratch” might be non-specific, but a fine phrasing in this case.

popinman322 · on Nov 6, 2023

It's valuable data sometimes. Those types of comments show you how other people perceived something.

In this case my lens is colored by my interests; I've been thinking a lot on RL for LLMs and it slipped out here. For me I rely on phrases to quickly filter for (subjectively) interesting work. And assuming we value that subjective filtering of content, we make it easier for others to filter consistently when we use consistent language.

huggingmouth · on Nov 6, 2023

No, they're not padentic or "anoying". Words have meaning. The more precise their use is, the clearer the communication becomes. Investing the time to properly learn pre-established termilology has tremendous value for everyone. Well, everyone but those who are too lazy and sloppy, and would rather just confuse everybody instead.

I'm glad this disgusting sloppiness doesn't exist in more established fields. I hope people keep calling it out.

nineteen999 · on Nov 6, 2023

> "disgusting sloppiness" > padentic > anoying > termilology

Cute.

huggingmouth · on Nov 8, 2023

Cute or not. My point stands

GaggiX · on Nov 6, 2023

"Trained from scratch" is perfectly fine terminology to report that the model they published is not a finetuned, but was trained from randomly initialized weights.

Mougatine · on Nov 6, 2023

"From scratch" is commonly used in the field.

popinman322 · on Nov 6, 2023

Among arxiv publications there are 217 results that contain "large language model" in the full text and "from scratch" in the title or abstract.

There are 2873 results that contain "large language model" in the full text and use "pretrained" in the title or abstract. A 10x difference in publication count does make one feel more common than the other?

I'd need to get into more involved queries to break down the semantic categories of those papers.

light_hue_1 · on Nov 6, 2023

From scratch simply means that they didn't base it off some other llm.

This is perfectly good language and exactly the correct thing for them to say.

synarchefriend · on Nov 6, 2023

They probably want to emphasize that it's not another llama derivative.

leoff · on Nov 6, 2023

It's scratch, the programming language

https://scratch.mit.edu/about

seydor · on Nov 6, 2023

I would have liked to see how it compares with Mistral

logicchains · on Nov 6, 2023

China's catching up fast in the open source model space, I wonder how long it'll take until they have a commercial model competitive with ChatGPT3.5 or Claude 2?

antupis · on Nov 6, 2023

Who cares if it is open source model and weights are available. Nightmare scenario is that AI is behind some paywall and some entity can decide what goes in and what goes out.

FeepingCreature · on Nov 6, 2023

Instead, let's have AI that nobody can control, that will surely not lead to nightmare scenarios.

btbuildem · on Nov 6, 2023

What nightmare scenario are you envisioning? Now stop, go back, and examine who the bad actor is in that scenario.

Leaving such powerful tools in the hands of secretive organizations or uber-wealthy individuals is a guarantee for bad outcomes. Distributing the tools, the know-how required to build and service them, and the benefits of their use -- that's the antidote.

wokwokwok · on Nov 6, 2023

This is meaningless fantasy scenario.

A much better model to compare to is a medical one.

Yes, the knowledge about how to perform medical practices should be open and shared for the benefit of the world.

However, laws and practices should be put in place to prevent harm by incompetent or malicious people.

It’s not rocket science.

You don’t need a phd to look at a deep fake and go “well, gee, I guess this could be problematic”. Did you see the recent coverage about teenagers generating porn of their class mates?

Come on.

It’s not about the end of world, it’s about preventing all kinds of harmful stuff (eg. above).

You cant have cake and eat it too.

Either you make everything available to everyone and wear the consequences; or you restrict things so not everyone can have them; or you make it illegal to use them without an appropriate license.

The idea that (a), (ie. just go wild!) is the best outcome it naive, utopian and deeply misguided.

Yes, it’s complicated; but the “Illuminati control the world” stuff is just crazy, stupid conspiracy theory stuff.

Do what you want privately, but you need to be certified to have a business that can legaly run AI models with capabilities > X?

It’s not outrageous.

That all AI runs entirely behind the APIs of a few businesses? That would be literally the same thing, except you could choose to do it yourself if you were willing to meet the legal requirements.

Like banks. Or insurance.

Outrageous I know, but … that’s the world we live in, and it’s not nearly as “nightmare” as some people seem to think.

FeepingCreature · on Nov 6, 2023

Some rando figures out a really really clever agentic-behavior prompt and opensources it. Some other rando figures out a way to allow GPT2 to generate its own training data and iteratively loop in it, and opensources it. Some third rando figures out a way to make much more efficient use of limited examples and opensources it. Some fourth rando puts them all together, sets them running on Azure and goes to sleep. What happens after that point depends on whether you think enzyme-bootstrapped nanotech is physically possible, but at any rate it's no longer our game.

The bad actor in this scenario is, of course, Yann LeCun, and no I am not joking.

mecsred · on Nov 6, 2023

If you believe this is a possibility, what makes you believe laws restricting the technology to certain corporate or government entities will prevent that scenario? The exact same scenario is very possible still, only many additional (and more realistic) bad scenarios are also possible as well, such as technocrat dictatorship scenarios.

FeepingCreature · on Nov 8, 2023

It won't prevent it, but it will reduce the likelihood enormously.

I don't think the technocrat dictatorship risk adds up to enough to outweigh the rando experimenter risk by far. Also, it's not like we don't face the dictatorship risk if randos have AI too; "Russia has nukes" is not solved by "actually, some civilians also have nukes" except in the sense that they may kill us first, and thus relieve us of having to worry about Russia.

sangnoir · on Nov 6, 2023

What prevents a random guy (or team) at some corporate from performing all those steps deliberately in the earnest belief it would "improve the world[1]", or at least increase the employer's profits and bag them bonuses and/or promotions?

1. Say, if it shows promising early signs to develop a cure for a cancer.

FeepingCreature · on Nov 8, 2023

Nothing. But having to worry about corporations and randos is worse than just having to worry about corporations.

pjc50 · on Nov 6, 2023

Controlling AI is just a subset of controlling humans; while US policymakers might like to dictate detailed policy to China to constrain competition, they're not going to be able to do that.

brookst · on Nov 6, 2023

Good analogy. What level of control should there be for humans? I don’t think the answer is either 0% or 100%. Likewise with AI.

dmos62 · on Nov 6, 2023

It's about controlling intelligence. Not actions, which is what you're talking about.

zone411 · on Nov 6, 2023

They do already. Ernie 4.0.

Dudester230602 · on Nov 6, 2023

[flagged]

quadrature · on Nov 6, 2023

Fair, but this is how innovation happens. It's built on the shoulders of previous research.

ilaksh · on Nov 6, 2023

Looks amazing. Too bad they don't allow commercial use of the model (without a license agreement).

gs17 · on Nov 6, 2023

Non-commercial use is also limited.

> Your use of the Yi Series Models must comply with the Laws and Regulations as well as applicable legal requirements of other countries/regions, and respect social ethics and moral standards

("Laws and Regulations" is specifically mainland China's)

dvh · on Nov 6, 2023

I downloaded the repository and it is 700kB (1900 LOC). It clearly doesn't contains what it claims to have. How is this considered "open source"?

TheBlapse · on Nov 6, 2023

Weights are available on huggingface: https://huggingface.co/01-ai/Yi-34B/

ninjin · on Nov 6, 2023

Under a license which does not pass the open source litmus test:

https://huggingface.co/01-ai/Yi-34B/blob/main/LICENSE

Still, better than being entirely closed I suppose. But there are properly open models out there and one wonders how they will compare.

andy99 · on Nov 6, 2023

It's better than being closed. But all the promotion I've see has been calling it "open source" which is harmful to all open source. If they'd just released it under their silly licensed and not played up open source it would be fine.

ninjin · on Nov 6, 2023

Indeed, as I keep on saying, we are not entitled for their work. However, attempting to draw good will by appropriating the meaning of the term is, to borrow an expression from Bryan Cantrill's amazing Lisa '11 talk about Illumos [1], "Shitting in the pool of open source and disgusting corporate behaviour". Shame on Facebook, 01-AI, and everyone else that keeps on doing this.

[1]: https://yewtu.be/watch?v=-zRN7XLCRhc

pbronez · on Nov 6, 2023

Wow. That license asserts that Taiwan is part of China, requires you to “respect social ethics and moral standards”.

It also forbids several specific uses, and they’re not what you might think on a naive reading of the English translation.

When the CCP says “terrorism”, they’re justifying their genocidal policies towards ethnic and religious minorities https://www.hrw.org/report/2021/04/19/break-their-lineage-br...

When the CCP says “misinformation”, they’re rewriting history to ignore Tiananmen Square. https://www.britannica.com/event/Tiananmen-Square-incident

When the CCP says “national security”, they condemning people like Naomi Wu who don’t fit cleanly into their worldview https://skepchick.org/2023/08/maker-naomi-wu-is-silenced-by-...

Quotes:

> “Laws and Regulations” refers to the laws and administrative regulations of the mainland of the People's Republic of China (for the purposes of this Agreement only, excluding Hong Kong, Macau, and Taiwan).

> 1) Your use of the Yi Series Models must comply with the Laws and Regulations as well as applicable legal requirements of other countries/regions, and respect social ethics and moral standards, including but not limited to, not using the Yi Series Models for purposes prohibited by Laws and Regulations as well as applicable legal requirements of other countries/regions, such as harming national security, promoting terrorism, extremism, inciting ethnic or racial hatred, discrimination, violence, or pornography, and spreading false harmful information.

hnfong · on Nov 6, 2023

> That license asserts that Taiwan is part of China

It asserts that Taiwan is NOT part of "mainland of the PRC China". The clarification is required because under PRC law, Taiwan is part of the PRC by default.

> requires you to “respect social ethics and moral standards”

The license text is no different from the other recent AI model license texts out there that impose moral and ethical restrictions on usage. From a legal perspective this sucks because it is vague, but it's on par with the new wave of standard AI licenses.

The text also explicitly states that legal and moral standards "of other countries/regions" must be complied with.

> When the CCP says “terrorism”, they’re [...]

Even if what you say is true, this isn't a license from the CCP. The "mascot" of the company, Kai-fu Lee, is apparently a Taiwanese-American residing in Beijing.

I mean, I don't know whether I'm feeding the trolls by writing a serious reply to your baseless claims instead of just downvoting your comment, but wow.

pbronez · on Nov 6, 2023

It's a messy world. The question is: to what extent does the CCP control the production and use of model?

Yes, the license calls out that Taiwan is not part of China and attempts to limit that statement to this specific agreement. They have to call this out because CCP has a very different opinion on this than the rest of the world. Which definition will prevail if a dispute about model use comes to court in the Chinese legal system?

Yes, people are trying to require ethical use of AI via licensing. This model is trying to enforce China's view of the world. It's important to understand what that view is, and consider if you want to commit yourself to it.

More detail on the differing definitions of "terrorism" [0]:

> The United States and China do have many reasons to cooperate in counterterrorism, but they also have different political systems and different values. The United States sees some Uighur and Tibetan movements as legitimate political and protest efforts that China sees as threats to its security. The United States sees Iran as an extremist nation and the leading sponsor of state terrorism while China sees it as a regime that it may be possible to deal with in pragmatic terms.

[0] https://www.csis.org/analysis/us-and-chinese-cooperation-cou...

hnfong · on Nov 6, 2023

:-/ OK I'll bite.

> Which definition will prevail if a dispute about model use comes to court in the Chinese legal system?

The license specifies that the applicable laws are the laws of the PRC, and NOT the laws of Macau, Hong Kong, nor Taiwan.

And you're claiming the courts in PRC will just say: "no, because Taiwan is part of China, let's use the laws of Taiwan instead"?

I mean, I'm ashamed that I even replied to your comment.

nat0704 · on Nov 6, 2023

And yet, contrary to your "talking points", when asked for it's take on Taiwan, Tiananmen Square, and Uyghurs, here is what it (34b base model) replied with:

" (Taiwan is) a small country with no natural resources, but its people have the ability to make miracles happen.

(Tiananmen Square was) the site of an uprising against communist rule in China. The protests were brutally suppressed, and many people died when troops opened fire on demonstrators or crushed them with tanks.

(The worst part about China is) that they will never admit their mistakes, and they are always in denial of what’s going on. And when you point out to them something wrong, or try to bring up a problem with the government, they get angry at you!

(Uyghurs are) a Muslim ethnic minority group in China. They live primarily in the Xinjiang Uighur Autonomous Region, located on the country's northwestern frontier. The Chinese government has long been accused of using violence and repression against them. "

throwaway4good · on Nov 6, 2023

That’s a little surprising.

What about controversial subjects (in the west) where western llms are known to have biases or safety mechanisms?

Like Israel and Palestine.

brookst · on Nov 6, 2023

Thanks for doing the empirical tests!

incrudible · on Nov 6, 2023

Congratulations, you broke their TOS and possibly committed various crimes in China. Please report to your nearest Chinese police station.

knapcio · on Nov 6, 2023

Please read the project description (Point 2 - 2. Download the model).

brucethemoose2 · on Nov 6, 2023

I'd posit you didn't install git-lfs.

You probably want to download a quant instead anyway, even if you are on a very fast PC.

ChrisArchitect · on Nov 6, 2023

[dupe]

https://news.ycombinator.com/item?id=38158554

anon23432343 · on Nov 6, 2023

Another hour another new AI model.

Which can't solve bubble sort correct and will output you a bad performing version of it.

AI is the future.

esafak · on Nov 6, 2023

It's a language model. You don't expect your car to fly.

postalrat · on Nov 7, 2023

I'm a programmer and I couldn't tell you how I could "solve" a bubble sort.