Gemma: New Open Models

tosh · 2024-02-21T13:53:06 1708523586

Benchmarks for Gemma 7B seem to be in the ballpark of Mistral 7B

  +-------------+----------+-------------+-------------+
  | Benchmark   | Gemma 7B | Mistral 7B  | Llama-2 7B  |
  +-------------+----------+-------------+-------------+
  | MMLU        |   64.3   |     60.1    |     45.3    |
  | HellaSwag   |   81.2   |     81.3    |     77.2    |
  | HumanEval   |   32.3   |     30.5    |     12.8    |
  +-------------+----------+-------------+-------------+

via https://mistral.ai/news/announcing-mistral-7b/

sa-code · 2024-02-21T14:59:45 1708527585

Thank you. I thought it was weird for them to release a 7B model and not mention Mistral in their release.

mochomocha · 2024-02-21T15:19:18 1708528758

The technical report (linked in the 2nd paragraph of the blog post) mentions it, and compares against it: https://storage.googleapis.com/deepmind-media/gemma/gemma-re...

nl · 2024-02-22T03:49:04 1708573744

The release page has comparisons to Mistral everywhere: https://ai.google.dev/gemma

sa-code · 2024-02-22T22:20:38 1708640438

Good to know, although cmd+f for "mistral" returns 0 hits on the original link

mirekrusin · 2024-02-21T15:14:22 1708528462

They forgot.

Also phi-2.

brucethemoose2 · 2024-02-21T14:31:12 1708525872

Only 8K context as well, like Mistral.

Also, as always, take these benchmarks with a huge grain of salt. Even base model releases are frequently (seemingly) contaminated these days.

DreamGen · 2024-02-21T16:06:59 1708531619

Mistral Instruct v0.2 is 32K.

tarruda · 2024-02-22T12:25:13 1708604713

Mixtral (8x7b) is 32k.

Mistral 7b instruct 0.2 is just a fine tune of Mistral 7b.

netdur · 2024-02-22T14:12:49 1708611169

original Mistral or GGUF one?

tosh · 2024-02-21T14:48:50 1708526930

Agree: will be interesting how Gemma does on ChatBot Arena

Kydlaw · 2024-02-22T13:35:45 1708608945

They state in their report that they filter evaluation data off their training data, see p.3 - Filtering:

"Further, we filter all evaluation sets from our pre-training data mixture, run targeted contamination analyses to check against evaluation set leakage, and reduce the risk of recitation by minimizing proliferation of sensitive outputs."

YetAnotherNick · 2024-02-21T17:29:08 1708536548

According to their paper, average of standard task of Mistral is 54.0 and for Gemma it's 56.4, so 4.4% relative better. Not as big as you would expect for the company which invented transformers and probably has 2-3 order more compute for training it vs few month old French startup.

Also for note on their human evaluations, Gemma 7B IT has a 51.7% win rate against Mistral v0.2 7B Instruct.

jcuenod · 2024-02-21T14:48:06 1708526886

Came here to post the same thing for Phi-2:

  +-------------+----------+-------------+
  | Benchmark   | Gemma 2B | Phi-2 2.7B  |
  +-------------+----------+-------------+
  | MMLU        |   42.3   |     56.7    |
  | MBPP        |   29.2   |     59.1    |
  | BoolQ       |   69.4   |     83.3    |
  +-------------+----------+-------------+

[0] https://www.kaggle.com/models/google/gemma

[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...

rfw300 · 2024-02-21T16:02:54 1708531374

A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.

phh · 2024-02-21T17:34:27 1708536867

Fun that's not my experience of Phi-2. I use it for non-creative context, but function calling, and I find as reliable as much bigger models (no fine-tuning just constraining JSON + CoT). Phi-2 unquantized vs Mixtral Q8, Mixtral is not definitely better but much slower and RAM-hungry.

kgeist · 2024-02-21T20:22:23 1708546943

What prompts/settings do you use for Phi-2? I found it completely unusable for my cases. It fails to follow basic instructions (I tried several instruction-following finetunes as well, in addition to the base model), and it's been mostly like a random garbage generator for me. With Llama.cpp, constrained to JSON, it also often hangs because it fails to find continuations which satisfy the JSON grammar.

I'm building a system which has many different passes (~15 so far). Almost every pass is a LLM invocation, which takes time. My original idea was to use a smaller model, such as Phi-2, as a gateway in front of all those passes: I'd describe which pass does what, and then ask Phi-2 to list the passes which are relevant for the user query (I called it "pass masking"). That would save a lot of time and collapse 15 steps to 2-3 steps on average. In fact, my Solar 10.7B model does it pretty well, but it takes 7 seconds for the masking pass to work on my GPU. Phi-2 would finish in ~1 second. However, I'm really struggling with Phi-2: it fails to reason (what's relevant and what's not), unlike Solar, and it also refuses to follow the output format (so that I could parse the output programmatically and disable the irrelevant passes). Again, my proof of concept works with Solar, and fails spectacularly with Phi-2.

phh · 2024-02-21T21:01:35 1708549295

My non-domain-specific prompt is:

> You are a helpful assistant to 'User'. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'. 'System' will give you data. Do not respond as 'System'. Allow yourself inner thoughts as 'Thoughts'.

and then I constrain its answers to Thoughts: [^\n]* and Assistant: <JSON schema>, and I have two shots included in the prompt.

I haven't been able to get anything useful out of Phi-2 in llama.cpp (but I only tried quantized models). I use python/huggingface's transformers lib instead.

nl · 2024-02-22T03:50:35 1708573835

Interesting. I've had no success at all using any of the Phi2 models.

kgeist · 2024-02-23T10:34:31 1708684471

An update on my endeavour: so, model switching is very costly under llama.cpp (I have to switch between Llama and Phi2 because my GPU has low amounts of VRAM). And this switch (reloading the weights into VRAM) defeats the whole purpose of the optimization. Having only Llama on GPU without reloading takes less time than if I'd use Llama+Phi2. And Phi2 alone is pretty bad as a general purpose LLM. So I'm quite disappointed.

lobocinza · 2024-02-25T02:19:05 1708827545

I recently upgraded to AM5 and as I have an AMD GPU I'm using llama.cpp on CPU only and I was positively surprised by how fast it generate stuff. I don't have the case of massive workloads so YMMV.

refulgentis · 2024-02-21T17:33:51 1708536831

Hear hear! I don't understand why it has persistent mindshare, it's not even trained for chat. Meanwhile StableLM 3B runs RAG in my browser, on my iPhone, on my Pixel ..

djsavvy · 2024-02-21T18:45:45 1708541145

How have you been using RAG in your browser/on your phones?

refulgentis · 2024-02-21T20:45:09 1708548309

To be released, someday [sobs in engineer]

Idea is usage-based charging for non-local and a $5/month sub for syncing.

keep an eye on @jpohhhh on Twitter if you're interested

now that I got it on web, I'm hoping to at least get a PoC up soon. I've open-sourced the consitutent parts as FONNX and FLLAMA, Flutter libraries that work on all platforms. FONNX has embeddings, FLLAMA has llama.

https://github.com/Telosnex/fonnx

https://github.com/Telosnex/fllama

myaccountonhn · 2024-02-21T18:22:00 1708539720

I tested it for an offline autocompletion tool and it was hilariously bad.

daemonologist · 2024-02-21T15:11:50 1708528310

Really looking forward to the day someone puts out an open model which outperforms Flan-T5 on BoolQ.

FergusArgyll · 2024-02-21T14:54:12 1708527252

the real gold will be when this gets finetuned. (maybe by mistral...)

brucethemoose2 · 2024-02-21T14:59:48 1708527588

TBH the community has largely outrun Mistral's own finetuning. The 7B model in particular is such a popular target because its so practical to train.

whimsicalism · 2024-02-21T15:19:37 1708528777

Strong disagree - a Mistral fine tune of llama 70b was the top performing llama fine tune. They have lots of data the community simply does not.

brucethemoose2 · 2024-02-21T15:24:39 1708529079

Miqu was (allegedly) an internal continued pretrain Mistral did as a test, that was leaked as a GGUF.

Maybe its just semantics, it is technically a finetune... But to me theres a big difference between expensive "continuation training" (like Solar 10.7B or Mistral 70B) and a much less intense finetuning. The former is almost like releasing a whole new base model.

It would be awesome if Mistral did that with their data, but thats very different than releasing a Gemma Instruct finetune.

whimsicalism · 2024-02-21T15:31:56 1708529516

There’s typically a difference in LR between a ‘continued pretrain’ and ‘fine tune.’ I don’t have the details around miqu, but was merely trying to say that Mistral could produce a better version of these models than the OSS community might. If the size of the corpora they use means we are no longer in fine tuning territory, then okay.

speedgoose · 2024-02-21T18:58:42 1708541922

Arthur Mensch, the Mistral CEO, confirmed the leak. https://twitter.com/arthurmensch/status/1752737462663684344

saintradon · 2024-02-21T23:00:01 1708556401

Also, it led to one of the funniest pr I've seen in a while

https://huggingface.co/miqudev/miqu-1-70b/discussions/10

sanjiwatsuki · 2024-02-21T18:32:56 1708540376

No shot. Mistral Medium's outputs from API were virtually identical. Miqu really was Mistral Medium which happened to be a continued pretrain

itomatik · 2024-02-21T17:32:52 1708536772

how does one finetune llama (or any other LLM) using mistral?

is the flow like this?

- take small dataset

- generate bigger dataset using mistral (how this is this done?)

- run LoRA to fine tune gemma extended dataset.

itomatik · 2024-02-21T23:59:23 1708559963

I should have said "run LoRA or your favorite fine-tuning technique to produce your fine-tuned llama."

heckl239u · 2024-02-23T05:20:48 1708665648

https://www.youtube.com/watch?v=1Mn0U6HGLeg some test vids came out on the 7b model. Shock it doesn't perform well at all.

attentive · 2024-02-22T02:29:06 1708568946

In my subjective tests it's not even close to Mistral. While my local gemma is quantized, so is mistral.

But I also tried gemma on huggingface.co/chat which I assume isn't quantized.

lawxls · 2024-02-21T15:05:09 1708527909

Honestly, this is more of a PR stunt to advertise the Google Dev ecosystem than a contribution to open-source. I'm not complaining, just calling it what it is.

Barely an improvement over the 5-month-old Mistral model, with the same context length of 8k. And this is a release after their announcement of Gemini Pro 1.5, which had an exponential increase in context length.

scarmig · 2024-02-21T15:12:40 1708528360

Who cares if it's a PR stunt to improve developer good will? It's still a good thing, and it's now the most open model out there.

moffkalast · 2024-02-21T15:49:45 1708530585

How is it more open than Mistral with Apache 2.0? Google wants people to sign a waiver to even download it.

scarmig · 2024-02-21T16:08:23 1708531703

Fair enough; that was more directed at LLaMA and derivatives, which have commercial restrictions.

observationist · 2024-02-21T16:49:56 1708534196

How exactly is it the "most open model" ?

It's more like a masterclass in corporate doublespeak. Google’s "transparency" is as clear as mud, with pretraining details thinner than their privacy protections. Diving into Google’s tech means auctioning off your privacy (and your users' privacy) to the highest bidder.

Their "open source" embrace is more of a chokehold, with their tech biases and monopolistic strategies baked into every line of code. Think of it as Google's way of marking territory - every developer is a fire hydrant.

These megacorps aren’t benevolent patrons of open source; they're self-serving giants cloaking power grabs under the guise of "progress".

Use these products at your own risk. If these companies wanted to engage in good faith, they'd use Apache or MIT licensing and grant people the agency and responsibility for their own use and development of software. Their licenses are designed to mitigate liability, handcuff potential competitors, and eke every last drop of value from users, with informed consent frequently being an optional afterthought.

That doesn't even get into the Goodharting of metrics and actual performance of the models; I highly doubt they're anywhere near as good as Mistral.

The UAE is a notoriously illiberal authoritarian state, yet even they have released AI models far more free and open than Google or Meta. https://huggingface.co/tiiuae/falcon-40b/blob/main/README.md

If it’s not Apache or MIT, (or even some flavor of GPL,) it’s not open source; it’s a trojan horse. These "free" models come at the cost of your privacy and freedoms.

These models aren't Open or Open Access or Free unless you perform the requisite mental gymnastics cooked up by their marketing and legal teams. Oceania has always been at war with Eastasia. Gemma is doubleplusgood.

stale2002 · 2024-02-21T17:21:04 1708536064

You said a lot of nothing without actually saying specifically what the problem is with the recent license.

Maybe the license is fine for almost all usecases and the limitations are small?

For example, you complained about metas license, but basically everyone uses those models and is completely ignoring it. The weights are out there, and nobody cares what the fine print says.

Maybe if you are a FAANG, company, meta might sue. But everyone else is getting away with it completely.

observationist · 2024-02-21T18:10:41 1708539041

I specifically called out the claims of openness and doublespeak being used.

Google is making claims that are untrue. Meta makes similar false claims. The fact that unspecified "other" people are ignoring the licenses isn't relevant. Good for them. Good luck making anything real or investing any important level of time or money under those misconceptions.

"They haven't sued yet" isn't some sort of validation. Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion, their IP taken, and repurposed or buried. These tech companies have never behaved otherwise, and to think that they will is willfully oblivious.

They don't deserve the benefit of the doubt, and should be called out for using deceitful language, making comparisons between their performative "openness" and actual, real, open source software. Mistral and other players have released actually open models and software. They're good faith actors, and if you're going to build a product requiring a custom model, the smart money is on Mistral.

FAANG are utilizing gotcha licenses and muddying the waters to their own benefit, not as a contribution to the public good. Building anything on the assumption that Meta or Google won't sue is beyond foolish. They're just as open as "Open"AI, which is to say not open at all.

stale2002 · 2024-02-21T19:43:40 1708544620

> Anyone building an actual product that makes actual money that comes to the attention of Meta or Google will be sued into oblivion

No they won't and they haven't.

Almost the entire startup scene is completely ignoring all these licenses right now.

This is basically the entire industry. We are all getting away with it.

Here's an example, take llama.

Llama originally disallowed commercial activity. But then the license got changed much later.

So, if you were a stupid person, then you followed the license and fell behind. And if you were smart, you ignored it and got ahead of everyone else.

Which, in retrospect was correct.

Because now the license allows commerical activity, so everyone who ignores it in the first place got away with it and is now ahead of everyone else.

> won't sue is beyond foolish

But we already got away with it with llama! That's already over! It's commerical now, and nobody got sued! For that example, the people who ignored the license won.

esafak · 2024-02-21T22:41:59 1708555319

The nice thing about this is that the calculus is in favor of startups, who can roll the dice.

crossroadsguy · 2024-02-21T15:20:52 1708528852

That’s about the point of having a developer ecosystem, isn’t it?

kiraaa · 2024-02-21T15:21:48 1708528908

mistral 7b v0.2 supports 32k

brucethemoose2 · 2024-02-21T15:29:10 1708529350

This is a good point actually, and an underappreciated fact.

I think so many people (including me) effectively ignored Mistral 0.1's sliding window that few realized 0.2 instruct is native 32K.

tarruda · 2024-02-22T10:58:49 1708599529

Mixtral 8x7B has 32k context.

Mistral 7b instruct 0.2 is just an instruct fine tune of Mistral 7b and stays with a 8k context.

simonw · 2024-02-21T16:32:04 1708533124

The terms of use: https://ai.google.dev/gemma/terms and https://ai.google.dev/gemma/prohibited_use_policy

Something that caught my eye in the terms:

> Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.

One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.

a2128 · 2024-02-21T16:50:46 1708534246

This is actually not that unusual. Stable Diffusion's license, CreativeML Open RAIL-M, has the exact same clause: "You shall undertake reasonable efforts to use the latest version of the Model."

Obviously updating the model is not very practical when you're using finetuned versions, and people still use old versions of Stable Diffusion. But it does make me fear the possibility that if they ever want to "revoke" everybody's license to use the model, all they have to do is just post a model update that's functionally useless for anything and go after anyone still using the old versions that actually do anything.

slowmovintarget · 2024-02-21T20:06:50 1708546010

So if they wish to apply censorship they forgot, or suddenly discovered a reason for, they want you to be obligated to take it.

Good faith possibilities: Copyright liability requires retraining, or altering the underlying training set.

Gray area: "Safety" concerns where the model recommends criminal behavior (see uncensored GPT 4 evaluations).

Bad faith: Censorship or extra weighting added based on political agenda or for-pay skewing of results.

philsnow · 2024-02-21T21:30:45 1708551045

Sounds like it would be interesting to keep track of the model's responses to the same queries over time.

> Gemma-2024-Feb, what do you think of the situation in the South China Sea?

> > The situation in the South China Sea is complex and multi-faceted, involving a wide range of issues including political conflicts, economic challenges, social changes, and historical tensions.

> Gemma-2024-Oct, what do you think of the situation in the South China Sea?

> > Oceania has always been at war with EastAsia.

threecheese · 2024-02-21T22:25:23 1708554323

This is a great idea; I wonder if anyone is working on AI censorship monitoring at scale or at all. A secondary model could compare “censorship candidate” prompt results over time to classify how those results changed, and if those changes represent censorship or misinformation.

generalizations · 2024-02-21T22:55:12 1708556112

There's also (I think?) been some research in the direction of figuring out more abstract notions of how models perceive various 'concepts'. I'd be interested in the LLM version of diffs to see where changes have been implemented overall, too.

But really, the trouble is that it's tough to predict ahead of time what kinds of things are likely to be censored in the future; if I were motivated to track this, I'd just make sure to keep a copy of each version of the model in my personal archive for future testing with whatever prompts seem reasonable in the future.

mistermann · 2024-02-21T20:26:32 1708547192

We are already culturally incapable of skillfully discussing censorship, "fake news", etc, this adds even more fuel to that fire.

It is an interesting time to be alive!

iandanforth · 2024-02-21T18:26:45 1708540005

These are all very new licenses that deviate from OSI principles, I think it's fair to call them "unusual".

simcop2387 · 2024-02-21T19:07:01 1708542421

I think they meant not unusual in this space, not unusual in the sense of open source licensing.

alwayslikethis · 2024-02-21T22:46:34 1708555594

For this sentence to parse, you need to either add or remove a "not".

simonw · 2024-02-21T19:36:37 1708544197

That's useful context, thanks - I hadn't realized this clause was already out there for other models.

wongarsu · 2024-02-21T21:05:10 1708549510

I don't think a broken model would trigger that clause in a meaningful way, because then you simply can't update with reasonable effort. You would be obliged to try the new model in a test environment, and as soon as you notice it doesn't perform and making it perform would require unreasonable effort you can simply stay on the old version.

However you might be required to update if they do more subtle changes, like a new version that only speaks positively about Google and only negatively about Microsoft. Provided this doesn't have an obvious adverse impact on your use of the model.

ummonk · 2024-02-21T19:49:59 1708544999

Switching to a model that is functionally useless doesn't seem to fall under "reasonable efforts" to me, but IANAL.

Silphendio · 2024-02-21T22:59:15 1708556355

It's worth noting that Stable Diffusion XL uses the OpenRAIL++-M License, which removed the update obligation.

jacooper · 2024-02-21T20:51:07 1708548667

Why the hell do they use such a crappy license in the first place?

tgtweak · 2024-02-21T16:33:27 1708533207

I don't think there's a way they can enforce that reasonably. There's no connection to the mothership to report back what version is being used or license keys at runtime...

Seems more like a "if we discover something unsafe you should update your model and we aren't liable if you don't" than something that would make your model stop working.

summerlight · 2024-02-21T18:08:18 1708538898

This kind of defensive statements in ToS are usually due to obscure regulation or leading cases and model developers need a way to limit liability. There's no practical way to enforce this, but they can claim that when bad things happen it's purely on model users rather than model developers.

pram · 2024-02-21T16:58:39 1708534719

They have to make sure you’re receiving the most cutting edge chiding lectures when you make naughty and problematic requests.

astrange · 2024-02-21T19:00:50 1708542050

You can't make a local model do that. eg force the answer to begin with "Yes" or use control vectors so it agrees with it.

xyzzyz · 2024-02-21T19:36:19 1708544179

This is strangely reminiscent of the Soviet Union, where after they got rid of Lavrentiy Beria, they mailed the update to subscribers of the Great Soviet Encyclopedia, where they asked to remove the three pages with Beria’s biography and replace them with the three provided pages.

legohead · 2024-02-21T16:36:34 1708533394

Sounds like it's "reasonable" for you not to update then.

wahnfrieden · 2024-02-21T19:10:07 1708542607

It says you must make efforts (to a reasonable extent), not that you must give a reason for not making efforts

reissbaker · 2024-02-21T20:09:45 1708546185

This is a TOS, meaning their enforcement option is a lawsuit. In court, if you convincingly argue why it would take an unreasonable amount of effort to update, you win. They can't compel you to unreasonable effort as per their own TOS.

generalizations · 2024-02-21T22:58:16 1708556296

This assumes they even know that the model hasn't been updated. Who is this actually intended for? I'd bet it's for companies hosting the model. In those cases, the definition of reasonable effort is a little closer to "it'll break our stuff if we touch it" rather than "oh silly me, I forgot how to spell r-s-y-n-c".

reissbaker · 2024-02-22T07:36:54 1708587414

Hosting companies can probably just claim they're covered under Section 230, and Google has to go bother the individual users, not them.

staticman2 · 2024-02-22T17:43:07 1708623787

I don't believe that would apply if the host is curating the models they host.

alwayslikethis · 2024-02-21T20:25:14 1708547114

Oh I tried to update, it's just that my router drops the connection after a few hundred MBs...

wongarsu · 2024-02-21T21:10:29 1708549829

If you evaluate what it takes to update, and judge the effort unreasonable, that should be enough. Maybe make a powerpoint presenting that result, if you want something for the lawyers. If you don't see a way forward that leads to a result with reasonable effort you don't have to continue working on it until you hit some arbitrary threshold for unreasonable effort.

maronato · 2024-02-21T18:07:48 1708538868

This sounds like a clause to cover themselves in case older versions have any serious issues

catchnear4321 · 2024-02-21T19:03:53 1708542233

reasonable effort - meaning if their changes meaningfully impact my usage, negatively, it would be unreasonable to ask me to upgrade.

sounds good.

this is not financial advice and ianal.

res0nat0r · 2024-02-21T19:29:38 1708543778

Isn't this just lawyer speak for "we update our model a lot, and we've never signed off on saying we're going to support every previous release we've ever published, and may turn them off at any time, don't complain about it when we do."

CodesInChaos · 2024-02-21T20:12:34 1708546354

We're talking about downloadable weights here, so they can't turn them off, or force you (through technical means) to use a newer version.

reissbaker · 2024-02-21T20:12:53 1708546373

It's a local model, they can't turn it off. It's files on your computer without network access.

catchnear4321 · 2024-02-21T21:28:49 1708550929

but what if they send a lawyer to ask firmly? (kindly, but firmly.)

reissbaker · 2024-02-22T06:37:16 1708583836

They'd need to send a lot of lawyers, considering that they have no idea how many people are using the model, and very little way of finding out. And they'd need a TOS violation. It would be generally expensive for them to do at scale; this isn't about "turning it off" arbitrarily, it's a CYA in case someone specific does something really bad that makes Google look bad: Google can patch the model to make it not comply with the bad request, and then demand the person running the model update or else lose their license to use the product. It's a scalpel, not an off switch.

4bpp · 2024-02-21T17:17:13 1708535833

Ugh, I would fully expect this kind of clause to start popping up in other software ToSes soon if it hasn't already. Contractually mandatory automatic updates.

bsimpson · 2024-02-21T23:58:19 1708559899

I appreciated this post clarifying the distinction between "open model" and "open source":

https://opensource.googleblog.com/2024/02/building-open-mode...

I'm not sure how to feel about the restrictions. "No porn" feels prudish, particularly for this millennium. I tend to err on the side of freedom in intellectual/political matters; however, the others seem fairly reasonable as far as restrictions go.

phillipcarter · 2024-02-21T17:15:37 1708535737

Huh. I wonder why is that a part of the terms. I feel like that's more of a support concern.

silverliver · 2024-02-22T10:34:38 1708598078

You don't have to agree to this policy to use the model.

samstave · 2024-02-21T20:53:04 1708548784

model watermarking? does this exist?

redder23 · 2024-02-21T19:14:59 1708542899

[flagged]

pests · 2024-02-21T20:19:29 1708546769

They just want no liability for old models.

yencabulator · 2024-02-22T02:18:51 1708568331

You think they have any liability for the latest model?

https://ai.google.dev/gemma/terms#4.4-limitation

alekandreev · 2024-02-21T13:17:49 1708521469

Hello on behalf of the Gemma team! We are really excited to answer any questions you may have about our models.

Opinions are our own and not of Google DeepMind.

voxgen · 2024-02-21T18:33:08 1708540388

Thank you very much for releasing these models! It's great to see Google enter the battle with a strong hand.

I'm wondering if you're able to provide any insight into the below hyperparameter decisions in Gemma's architecture, as they differ significantly from what we've seen with other recent models?

* On the 7B model, the `d_model` (3072) is smaller than `num_heads * d_head` (16*256=4096). I don't know of any other model where these numbers don't match.

* The FFN expansion factor of 16x is MUCH higher than the Llama-2-7B's 5.4x, which itself was chosen to be equi-FLOPS with PaLM's 4x.

* The vocab is much larger - 256k, where most small models use 32k-64k.

* GQA is only used on the 2B model, where we've seen other models prefer to save it for larger models.

These observations are in no way meant to be criticism - I understand that Llama's hyperparameters are also somewhat arbitrarily inherited from its predecessors like PaLM and GPT-2, and that it's non-trivial to run hyperopt on such large models. I'm just really curious about what findings motivated these choices.

owl_brawl · 2024-02-21T19:42:50 1708544570

I would love answers to these questions too, particularly on the vocab size

lordswork · 2024-02-21T15:12:15 1708528335

Is there any truth behind this claim that folks who worked on Gemma have left Google?

https://x.com/yar_vol/status/1760314018575634842

lordswork · 2024-02-21T20:37:31 1708547851

I confirmed all the folks listed on page 12 are still at Google (listed below). I am guessing the linked tweet is a BS claim.

   # Product Management
   Tris Warkentin
   Ludovic Peran

   # Program Management
   Minh Giang

   # Executive Sponsors
   Clement Farabet
   Oriol Vinyals
   Jeff Dean
   Koray Kavukcuoglu
   Demis Hassabis
   Zoubin Ghahramani
   Douglas Eck
   Joelle Barral
   Fernando Pereira
   Eli Collins

   # Leads
   Armand Joulin
   Noah Fiedel
   Evan Senter

   # Tech Leads
   Alek Andreev†
   Kathleen Kenealy†

trisfromgoogle · 2024-02-23T19:26:38 1708716398

Always funny to see your own name as you scroll through HN comments =). You're right, though!

elcomet · 2024-02-21T18:19:30 1708539570

It seems very easy to check no? Look at the names in the paper and check where they are working now

lordswork · 2024-02-21T20:35:36 1708547736

Good idea. I've confirmed all the leadership / tech leads listed on page 12 are still at Google.

Can someone with a Twitter account call out the tweet linked above and ask them specifically who they are referring to? Seems there is no evidence of their claim.

elcomet · 2024-02-22T18:04:54 1708625094

It's also possible Google removed names of people who left. It's not really a research paper, more a marketing piece, so it might be possible (I don't think they would do that with a conf paper)

lordswork · 2024-02-22T19:32:12 1708630332

We'll see if the person making this claim responds with specific Gemma developers that have left. Otherwise, I think it's safe to assume they are just lying.

CaffeinatedDev · 2024-02-21T17:26:28 1708536388

Them: here to answer questions

Question

Them: :O

lordswork · 2024-02-21T17:55:28 1708538128

To be fair, I think they are in London, so I assume they have winded down for the day. Will probably have to wait ~12-18 hours for a response.

bluefinity · 2024-02-21T22:20:50 1708554050

To be fair, the tweet says that they don't work on the models at Google anymore, not that they have left Google.

Might be true, might not be. It's unsourced speculation.

LorenDB · 2024-02-21T18:38:48 1708540728

EDIT: it seems this is likely an Ollama bug, please keep that in mind for the rest of this comment :)

I ran Gemma in Ollama and noticed two things. First, it is slow. Gemma got less than 40 tok/s while Llama 2 7B got over 80 tok/s. Second, it is very bad at output generation. I said "hi", and it responded this:

``` Hi, . What is up? melizing with you today!

What would you like to talk about or hear from me on this fine day?? ```

With longer and more complex prompts it goes completely off the rails. Here's a snippet from its response to "Explain how to use Qt to get the current IP from https://icanhazip.com":

``` python print( "Error consonming IP arrangration at [local machine's hostname]. Please try fufing this function later!") ## guanomment messages are typically displayed using QtWidgets.MessageBox ```

Do you see similar results on your end or is this just a bug in Ollama? I have a terrible suspicion that this might be a completely flawed model, but I'm holding out hope that Ollama just has a bug somewhere.

mark_l_watson · 2024-02-21T19:54:53 1708545293

I was going to try these models with Ollama. Did you use a small number of bits/quantization?

LorenDB · 2024-02-21T20:39:31 1708547971

The problem exists with the default 7B model. I don't know if different quantizations would fix the problem. The 2B model is fine, though.

fosterfriends · 2024-02-21T17:15:38 1708535738

Not a question, but thank you for your hard work! Also, brave of you to join the HN comments, I appreciate your openness. Hope y'all get to celebrate the launch :)

lnyan · 2024-02-21T14:11:06 1708524666

Will there be Gemma-vision models or multimodal Gemma models?

alekandreev · 2024-02-21T23:43:28 1708559008

We have many exciting things planned that we can't reveal just yet :)

Jayakumark · 2024-02-21T17:03:47 1708535027

Have the same question.

h1t35h · 2024-02-21T13:25:01 1708521901

It seems you have exposed the internal debugging tool link in the blog post. You may want to do something about it.

trisfromgoogle · 2024-02-21T13:29:28 1708522168

Ah, I see -- the link is wrong, thank you for flagging! Fixing now.

h1t35h · 2024-02-21T13:30:38 1708522238

The blog post shares the link for debugging tool as https://*.*.corp.google.com/codelabs/responsible-ai/lit-gemm...

.corp and the login redirect makes me believe it was supposed to be an internal link

barrkel · 2024-02-21T14:19:57 1708525197

https://codelabs.developers.google.com/codelabs/responsible-...

littlestymaar · 2024-02-21T13:31:25 1708522285

Same for the “safety classifier”

neximo64 · 2024-02-21T13:30:32 1708522232

The link to the debugging tool is an internal one, no one outside Google can access it

wrexx0r · 2024-02-21T13:36:18 1708522578

The link in the Debugging section redirects to a Google SSO login page

pama · 2024-02-21T13:30:09 1708522209

Will these soon be available on lmsys for human comparison against other models? Can they run with llama.cpp?

ErneX · 2024-02-21T13:37:27 1708522647

Yes to llama.cpp

https://twitter.com/ggerganov/status/1760293079313973408

sbarre · 2024-02-21T13:48:12 1708523292

I came here wondering if these models are "open" in the sense that they'll show up on sites like Ollama where you can download and run them locally.

Am I correct to conclude that this means they eventually will?

It's unclear to me from Google's docs exactly what "open" means for Gemma

benpacker · 2024-02-21T13:51:30 1708523490

Yes - they are open weights and open inference code, which means they can be integrated into Ollama.

They are not “open training” (either in the training code or training data sense), so they are not reproducible, which some have suggested ought to be a component of the definition of open models.

OJFord · 2024-02-21T14:04:06 1708524246

It really should shouldn't it? I'm quite ML-naïve, but surely providing the model without 'training code or training data' is just like providing a self-hostable binary without the source code? Nobody calls that open source, it's not even source available.

michaelt · 2024-02-21T15:50:58 1708530658

It is widely believed (and in some cases acknowledged) that a lot of models are trained on copyrighted data scraped from the web. In some cases, even scrapes of ebook piracy websites - google 'books3' to learn more.

Some companies (such as those working on AI) believe this is legal, others (such as the copyright holders to those books) believe it isn't.

In any case, IMHO it's unlikely any cutting edge models will be offering us their training data any time soon.

sanroot99 · 2024-02-22T06:43:49 1708584229

Can training data be generated from llm,with right prompt?

sunnybeetroot · 2024-02-21T14:16:14 1708524974

That’s why they’re called open as in free to use how you wish, not open source where the source of the training is also provided.

OJFord · 2024-02-21T14:18:06 1708525086

But my point is there's no analogy for that that we call open? It's like self-hostable, or free (as in beer).

sunnybeetroot · 2024-02-21T14:20:20 1708525220

That’s a fair comment, maybe free-to-use is more appropriate.

etiam · 2024-02-21T23:02:03 1708556523

Yes, and there has been some discussion of that

Meta’s LLaMa 2 license is not Open Source https://news.ycombinator.com/item?id=36820122

idiotsecant · 2024-02-21T14:20:32 1708525232

Man, people will find anything to complain about.

OJFord · 2024-02-21T14:24:42 1708525482

I'm not complaining, I'm unlikely ever to use it (regardless of how open or not it is) so it doesn't really matter to me, just surprised to learn what people mean by 'open' in this context.

SushiHippie · 2024-02-21T13:53:51 1708523631

https://huggingface.co/google/gemma-7b-it/tree/main

yes, similar to the llama models, you'll also need to accept the license to download them officially. But the llama models have been unofficially downloadable without accepting the license for quite a while, so it's probably just a matter of time.

sbarre · 2024-02-21T13:50:37 1708523437

Can the Gemma models be downloaded to run locally, like open-source models Llama2, Mistral, etc ?

Or is your definition of "open" different?

austinvhuang · 2024-02-21T14:38:39 1708526319

Yes models can be downloaded locally. In addition to the python NN frameworks and ggml as options, we also implemented a standalone C++ implementation that you can run locally at https://github.com/google/gemma.cpp

kathleenfromgdm · 2024-02-21T13:53:25 1708523605

Yes, you can get started downloading the model and running inference on Kaggle: https://www.kaggle.com/models/google/gemma ; for a full list of ways to interact with the model, you can check out https://ai.google.dev/gemma.

dartharva · 2024-02-21T13:56:26 1708523786

Can we have llamafile releases as well?

https://github.com/Mozilla-Ocho/llamafile

syntaxing · 2024-02-21T14:06:03 1708524363

A small typo in your model link that breaks it. There’s an extra ; on the end.

kathleenfromgdm · 2024-02-21T14:08:59 1708524539

Corrected - thanks :)

Kostic · 2024-02-21T13:53:27 1708523607

It should be possible to run it via llama.cpp[0] now.

[0] https://github.com/ggerganov/llama.cpp/pull/5631

nerdix · 2024-02-21T15:55:14 1708530914

Amazing how quickly this happened.

mrob · 2024-02-21T16:49:46 1708534186

Mistral weights are released under an Apache 2.0 license, but Llama 2 weights are released under a proprietary license that prohibits use by large organizations and imposes usage restrictions, violating terms 5 and 6 the Open Source Definition[0]. Even if you accept that a model with a proprietary training dataset and proprietary training code can be considered "open source", there's no way Llama 2 qualifies.

For consistency with existing definitions[1], Llama 2 should be labeled a "weights available" model.

[0] https://en.wikipedia.org/wiki/The_Open_Source_Definition

[1] https://en.wikipedia.org/wiki/Source-available_software

tomp · 2024-02-21T14:18:35 1708525115

Their definition of "open" is "not open", i.e. you're only allowed to use Gemma in "non-harmful" way.

We all know that Google thinks that saying that 1800s English kings were white is "harmful".

hackerlight · 2024-02-21T18:40:47 1708540847

> We all know that Google thinks that saying that 1800s English kings were white is "harmful".

If you know how to make "1800s english kings" show up as white 100% of the time without also making "kings" show up as white 100% of the time, maybe you should apply to Google? Clearly you must have advanced knowledge on how to perfectly remove bias from training distributions if you casually throw stones like this.

trackflak · 2024-02-21T20:20:06 1708546806

Tell me you take this seriously: https://twitter.com/napoleon21st/status/1760116228746805272

It has no problem with other cultures and ethnicities, yet somehow white or Japanese just throws everything off?

I suppose 'bias' is the new word for "basic historic accuracy". I can get curious about other peoples without forcibly promoting them at the expense of my own Western and British people and culture. This 'anti bias' keyword injection is a laughably bad, in your face solution to a non-issue.

I lament the day 'anti-bias' AI this terrible is used to make real world decisions. At least we now know we can't trust such a model because it has already been so evidently crippled by its makers.

wantsanagent · 2024-02-21T15:55:55 1708530955

Not sure why you're getting downvoted. I would have thought HN of all places would recognize the power and value of OSI licensing and the danger of the proliferation of these source available but definitely not Open Source licenses.

neximo64 · 2024-02-21T13:22:48 1708521768

How are these performing so well compared to Llama 2, are there any documents on the architecture and differences, is it MoE?

Also note some of the links on the blog post don't work, e.g debugging tool.

kathleenfromgdm · 2024-02-21T13:52:19 1708523539

We've documented the architecture (including key differences) in our technical report here (https://goo.gle/GemmaReport), and you can see the architecture implementation in our Git Repo (https://github.com/google-deepmind/gemma).

declaredapple · 2024-02-21T13:21:14 1708521674

Congrats on the launch and thanks for the contribution! This looks like it's on-par or better compared to mistral 7B 0.1 or is that 0.2?

Are there plans for MoE or 70B models?

kathleenfromgdm · 2024-02-21T14:14:26 1708524866

Great question - we compare to the Mistral 7B 0.1 pretrained models (since there were no pretrained checkpoint updates in 0.2) and the Mistral 7B 0.2 instruction-tuned models in the technical report here: https://goo.gle/GemmaReport

audessuscest · 2024-02-21T14:25:12 1708525512

Does this model also thinks german were black 200 years ago ? Or is afraid to answer basic stuff ? because if this is the case no one will care about that model.

graphe · 2024-02-21T16:59:40 1708534780

I disagree, coding and RAG performance is all that matters to me. I'm not using an LLM to learn basic facts I already know.

audessuscest · 2024-02-21T17:56:12 1708538172

we're at basic knowledge level, if your RAG imply some of it, you can get bad result too. Anyway, would you use a model who makes this nonsense response or one that doesn't? I know which one I will prefer for sure...

graphe · 2024-02-21T18:33:42 1708540422

If this was better at specific RAG or coding performance I would absolutely, certainly without a doubt use it over a general instruct model in those instances.

audessuscest · 2024-02-22T09:05:47 1708592747

People getting so used to being manipulated and lied to that they don't even bother anymore is a huge part of the problem. But sure, do what suits you the best.

TheHypnotist · 2024-02-21T17:32:07 1708536727

How do you ragebait for premium pearl clutching?

freedomben · 2024-02-21T16:54:05 1708534445

I don't know anything about these twitter accounts so I don't know how credible they are, but here are some examples for your downvoters that I'm guessing just think you're just trolling or grossly exaggerating:

https://twitter.com/aginnt/status/1760159436323123632

https://twitter.com/Black_Pilled/status/1760198299443966382

robswc · 2024-02-21T17:25:12 1708536312

Yea. Just ask it anything about historical people/cultures and it will seemingly lobotomize itself.

I asked it about early Japan and it talked about how European women used Katanas and how Native Americans rode across the grassy plains carrying traditional Japanese weapons. Pure made up nonsense that not even primitive models would get wrong. Not sure what they did to it. I asked it why it assumed Native Americans were in Japan in the 1100s and it said:

> I assumed [...] various ethnicities, including Indigenous American, due to the diversity present in Japan throughout history. However, this overlooked [...] I focused on providing diverse representations without adequately considering the specific historical context.

How am I supposed to take this seriously? Especially on topics I'm unfamiliar with?

trackflak · 2024-02-21T20:12:32 1708546352

From one of the Twitter threads linked above:

> they insert random keyword in the prompts randomly to counter bias, that got revealed with something else I think. Had T shirts written with "diverse" on it as artifact

This was exposed as being the case with OpenAI's DALL-E as well - someone had typed a prompt of "Homer Simpson wearing a namebadge" and it generated an image of Homer with brown skin wearing a namebadge that said 'ethnically ambiguous'.

This is ludicrous - if they are fiddling with your prompt in this way, it will only stoke more frustration and resentment - achieving the opposite of why this has been implemented. Surely if we want diversity we will ask for it, but sometimes you don't, and that should be at the user's discretion.\

Another thread for context: https://twitter.com/napoleon21st/status/1760116228746805272

zitterbewegung · 2024-02-21T13:21:16 1708521676

Do you have a plan of releasing higher parameter models?

alekandreev · 2024-02-21T14:04:54 1708524294

We have many great things in research and development phases, so stay tuned. I’m hopeful we can share more in the coming weeks and month!

brucethemoose2 · 2024-02-21T16:31:39 1708533099

That is awesome!

I hope y'all consider longer context models as well.

Also, are ya'll looking alternative architectures like Mamba? Being "first" with a large Mamba model would cement your architectural choices/framework support like llama did for Meta.

efilife · 2024-02-22T03:56:04 1708574164

This doesn't answer the question at all

memossy · 2024-02-21T15:16:55 1708528615

Training on 4096 v5es how did you handle crazy batch size :o

tosh · 2024-02-21T13:49:27 1708523367

Are there any plans for releasing the datasets used?

alekandreev · 2024-02-21T14:17:53 1708525073

This would be really interesting in my opinion, but we are not releasing datasets at this time. See the C4 dataset for an earlier open dataset from Google.

CuriouslyC · 2024-02-21T14:19:15 1708525155

It's cool that you guys are able to release open stuff, that must be a nice change from the modus operandi at goog. I'll have to double check but it looks like phi-2 beats your performance in some cases while being smaller, I'm guessing the value proposition of these models is being small and good while also having more knowledge baked in?

alekandreev · 2024-02-21T23:38:43 1708558723

We deeply respect the Phi team and all other teams in the open model space. You’ll find that different models have different strengths and not all can be quantified with existing public evals. Take them for a spin and see what works for you.

owl_brawl · 2024-02-21T19:42:07 1708544527

Hi alekandreev,

Any reason you decided to go with a token vocabulary size of 256k? Smaller vocab/vector sizes like most models in this size seem to be using (~16-32k) are much easier to work with. Would love to understand the technical reasoning here that isn't detailed in the report unfortunately :(.

moffkalast · 2024-02-21T15:56:48 1708531008

I'm not sure if this was mentioned in the paper somewhere, but how much does the super large 265k tokenizer vocabulary influence inference speed and how much higher is the average text compression compared to llama's usual 30k? In short, is it really worth going beyond GPT 4's 100k?

nuclearjam · 2024-02-23T09:56:48 1708682208

May I ask what is the ram requirement for running the 2B model on CPU on an average consumer windows laptop? I have 16 gb RAM but I am seeing CPU/memory traceback. I’m using the transformer implementation.

dmnsl · 2024-02-21T16:07:24 1708531644

Hi, what is the cutoff date ?

alekandreev · 2024-02-21T23:39:03 1708558743

September 2023.

legohead · 2024-02-21T17:20:23 1708536023

All it will tell me is mid-2018.

jmorgan · 2024-02-21T19:01:46 1708542106

Hi! This is such an exciting release. Congratulations!

I work on Ollama and used the provided GGUF files to quantize the model. As mentioned by a few people here, the 4-bit integer quantized models (which Ollama defaults to) seem to have strange output with non-existent words and funny use of whitespace.

Do you have a link /reference as to how the models were converted to GGUF format? And is it expected that quantizing the models might cause this issue?

Thanks so much!

espadrine · 2024-02-21T20:09:49 1708546189

As a data point, using the Huggingface Transformers 4-bit quantization yields reasonable results: https://twitter.com/espadrine/status/1760355758309298421

kleiba · 2024-02-21T19:35:12 1708544112

> We are really excited to answer any questions you may have about our models.

I cannot count how many times I've seen similar posts on HN, followed by tens of questions from other users, three of which actually get answered by the OP. This one seems to be no exception so far.

alekandreev · 2024-02-21T23:45:45 1708559145

Sorry, doing our best here :)

kleiba · 2024-02-24T13:56:05 1708782965

Thank you!

spankalee · 2024-02-21T19:49:49 1708544989

What are you talking about? The team is in this thread answering questions.

AlexeyBelov · 2024-02-23T08:36:13 1708677373

Only simple and convenient ones.

vorticalbox · 2024-02-21T13:51:03 1708523463

are there plans to release an official GGUF version to use with llama.ccp?

espadrine · 2024-02-21T13:57:41 1708523861

It is already part of the release on Huggingface: https://huggingface.co/google/gemma-7b/blob/main/gemma-7b.gg...

It is a pretty clean release! I had some 500 issues with Kaggle validating my license approval, so you might too, but after a few attempts I could access the model.

vorticalbox · 2024-02-21T17:51:04 1708537864

I didn't see this when searching thanks

quickgist · 2024-02-21T15:37:52 1708529872

Will this be available as a Vertex AI foundational model like Gemini 1.0, without deploying a custom endpoint? Any info on pricing? (Also, when will Gemini 1.5 be available on Vertex?)

turnsout · 2024-02-21T14:19:57 1708525197

What is the license? I couldn’t find it on the 1P site or Kaggle.

trisfromgoogle · 2024-02-21T14:22:24 1708525344

You can find the terms on our website, ai.google.dev/gemma:

https://ai.google.dev/gemma/terms

spiantino · 2024-02-21T17:49:14 1708537754

out of curiosity, why is this a "terms" and not a license? I'm used to reading and understanding the software as coming with a license to use it. Do the terms give us license to use this explicitly?

turnsout · 2024-02-21T18:20:12 1708539612

They do, but unlike a known license, these terms are custom and non-standard. Which means I would guide my commercial clients away from this particular model.

sqreept · 2024-02-21T14:05:13 1708524313

What are the supported languages of these models?

alekandreev · 2024-02-21T14:12:50 1708524770

This v1 model is focused on English support, but you may find some multilingual capabilities.

cypress66 · 2024-02-21T16:52:17 1708534337

Can you share the training loss curve?

brucethemoose2 · 2024-02-21T14:33:13 1708525993

Will there be "extended context" releases like 01.ai did for Yi?

Also, is the model GQA?

hustwindmaple1 · 2024-02-21T15:27:40 1708529260

It's MQA, documented in the tech report

artninja1988 · 2024-02-21T13:34:56 1708522496

I find the snyde remarks around open source in the paper and announcement rather off putting.

As the ecosystem evolves, we urge the corporate AI community to move beyond demanding to be taken seriously as a player in open source for models that are not actually open, and avoid preaching with a PR statement that can be interpreted as uniformed at best or malicious at worst.

trisfromgoogle · 2024-02-21T14:01:29 1708524089

It would be great to understand what you mean by this -- we have a deep love for open source and the open developer ecosystem. Our open source team also released a blog today describing the rationale and approach for open models and continuing AI releases in the open ecosystem:

https://opensource.googleblog.com/2024/02/building-open-mode...

Thoughts and feedback welcome, as always.

mrob · 2024-02-21T16:56:08 1708534568

If you truly love Open Source, you should update the the language you use to describe your models so it doesn't mislead people into thinking it has something to do with Open Source.

Despite being called "Open", the Gemma weights are released under a license that is incompatible with the Open Source Definition. It has more in common with Source-Available Software, and as such it should be called a "Weights-Available Model".

surajrmal · 2024-02-22T15:57:46 1708617466

Open source is not defined as strictly as what you are suggesting it is. If you wish to have a stricter definition, a new term should probably be used. I believe I've heard it referred to as libre software in the past