Hacker News new | past | comments | ask | show | jobs | submit login
My AI costs went from $100 to less than $1/day: Fine-tuning Mixtral with GPT4 (twitter.com/wenquai)
278 points by ignoramous on Jan 18, 2024 | hide | past | favorite | 122 comments



Every tech company minus the few doing core research have been doing this for at least half a year. Generate training data with GPT4 or sometimes even 3.5 -> use it to do a QLoRA finetune on a llama or mistral base -> roll it out as a "proprietary" AI model -> management claims a big win and talks about how they're leaders in "[industry name] AI".

It is remarkably easy - it takes practically zero knowledge of ML and can usually be done with less than <$1k of cloud compute costs. The issue is that for most realistic tasks you can expect to end up with something roughly on the level of GPT-3.5, and its actually really hard to compute with GPT-3.5 on a cost level, at least if you use cloud GPUs.


> Every tech company minus the few doing core research have been doing this for at least half a year

I'm assuming you mean all those new 'AI wrapper' startups popping up.? I wouldn't say "every tech company". But yeh it seems incredibly easy, definitely an easy win and leaders get to feel ahead of the curve on AI.


I agree, in fact I would wager the opposite, that most who claim to be a tech company are simply using OpenAI or another vendor with a turnkey API for things launched recently. Over time I expect more use of fine-tuned models, but fine-tuning is not easy, especially if your goal is GPT parity (or better).


The thing I don't understand about this strategy is that it itself shows that there really is no money to be made here. I mean it's a pretty obvious giveaway that:

1. they don't have the resources to build their own technology and probably never will

2. even if they did have, the best they could do is come up with something very similar to OpenAI's GPT, i.e. a (somewhat) generic AI model. This means that OpenAI can also easily compete with them.

All these companies are doing (if anything) is that they test the market for OpenAI (or Google, MS) for free.


The flaw in your assumption is that perfect tech or tech powerhouses win. I mean, sure when they do, they win big; but the endgame for b2b SaaS is mostly M&A, powered by sales, which is mostly down to c-suite relationships and perception of being one among the market leaders ("nobody ever got fired for buying IBM").

If you can move fast, deliver, expand, and raise money, there's a good chance the AI wrapper lands a nice exit and/or morphs into a tech behemoth. Those outcomes (among others), even if mutually exclusive, are equally possible.


So, if I understand you correctly, the business strategy for an AI wrapper company would be that they acquire customers quickly from a specific niche, build a name, while having very little custom technology and then get acquired by some of the larger players who do have the actual AI tech in-house. And, for them, it would be worth it for the brand/market/existing client base.

Assuming that the advance made in the meanwhile in AI doesn't eradicate the whole thing. I mean say some company builds a personal assistant for managers to supplant secretaries, they become the go-to name and then Google buys them in 2-3-5 years. Unless Google's AI becomes so good in the meantime that you can just instruct it in 1-2 sentences to do this for you.


> get acquired by some of the larger players who do have the actual AI tech in-house. And, for them, it would be worth it for the brand/market/existing client base.

The key is, if the incumbents truly feel they can't breach whatever moat, M&A is the safer bet over agonizing what if (I am thinking "git wrapper" startups that saw plenty competition from BigTech; remember Microsoft CodePlex, Google Code, AWS CodeCommit?). Given Meta's push and other prolific upstarts (OpenAI, Mistral), I don't believe access to SoTA AI itself (in the short term) will be an hindrance for product-based utility AI businesses (aka wrappers).


No, as far as I have seen, the "AI wrapper" companies have been clinging to GPT-4 a lot faster than other tech companies. Many bigger companies deploy GPT-4 very sparingly if at all.


I have a question about this—-isn’t it against the OpenAI Terms of Service to do this?


Yes but I doubt anyone is going to get the Aaron Swartz treatment over it, especially when OpenAI's own models are no doubt generated by playing fast and lose with ToS. E.g. at least as early as 2018, StackOverflow's ToS said:

"Any other downloading, copying, or storing of any public Network Content (other than Subscriber Content or content made available via the Stack Overflow API) for other than personal, noncommercial use is expressly prohibited without prior written permission from Stack Overflow or from the copyright holder identified in the copyright notice per the Creative Commons License"


Ahhhh, yes. OpenAI's good old ToS... Where it's OK to break a ToS / copyright if you're OpenAI for the input to generate the output that, you (customer) don't own and can't cache. Because that would impact their revenue model, be more efficient (power and cost) and still leave them holding the bag after ingesting loads of content they never had a right to in the first place but have staked their claim that it's OK because there's a lot riding on their success.

And, oh by the way, they'll just change their ToS as it suits them for more revenue opportunities even when they stated they wouldn't do business with - oh you know nation state militaries. But - JK! Now we will because <enter some 1%er excuse here>.


It took me 30 seconds to read their TOS and confirm you're just making most of that up.

> As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.

It follows that your claim about caching violating OAI's terms is nonsense.


I think you missed my point. "Caching" output by training a more efficient / cheaper model with that output is in fact against their ToS. In my simple brain that is a form of caching, and I stand by my original post.

I've not made anything up. Your claim that I have is nonsense.

OpenAI changing their ToS for the military on a whim: https://archive.is/GILKl - for your enjoyment.

OpenAI ToS: "What You Cannot Do. You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not:

* Use Output to develop models that compete with OpenAI."


> I think you missed my point. "Caching" output by training a more efficient / cheaper model with that output is in fact against their ToS. In my simple brain that is a form of caching.

If that was your point, I'm pretty sure everyone missed it. No one is training models as a form of caching their previous responses. They want to improve the quality of responses they haven't generated yet. That's not caching.

> I've not made anything up.

You said customers don't own the output; they do. I said you made most of it up, and you did. Including your apparent retconning of your original point.


> If that was your point, I'm pretty sure everyone missed it. No one is training models as a form of caching their previous responses. They want to improve the quality of responses they haven't generated yet. That's not caching.

So... You didn't read the article of which you're commenting in?

> You said customers don't own the output; they do. I said you made most of it up, and you did.

You don't own it. If I own something, I can do whatever I want with it. This is just like your iPhone. You don't actually own it, because you can only do with it what Apple allows you to do.

> Including your apparent retconning of your original point.

Wow, enjoy your day. Your misunderstanding is, apparently, my "retconning". Maybe read the original piece you're responding to within the thread.


Did I read the article? You mean the tweet? If you're saying it supports your claim that fine-tuning a model is equivalent to caching, you are mistaken.

> If I own something, I can do whatever I want with it

BRB digitizing my entire media collection and uploading it to the public internet.


Yes, it's explicitly against their TOS.

> What You Cannot Do. [...]

> Use Output to develop models that compete with OpenAI.


Which is ironic given the fact that scraping was likely against the ToS for many of the sites which ended up in OpenAI's training corpus.


It will be interesting if the same court cases proving their use of everyone else's data make it fair use to use their machine output as training data. They're definitely in their rights to ban whomever but who knows if they have recourse beyond that?


But didn’t X do that with their ML model Grok?


Burning bridges and getting sued isn't uncharted territory for Elon Musk.


What's good for the goose...


If you're not selling/ putting your model out there as a generic competitor to OpenAi then you're not competing with them


That's a moving target :)


Have the same question - I mean, for training an open source model with no monetization attached, not much Open AI can do besides ban the user, but they can make another account. For a company doing this with the intent to sell it as a capability... seems risky.


Did you know that removing the tag from a mattress is illegal too? according to the tag.


If you read them they say it's illegal only if you don't own the mattress.


And even this analysis is optimistic as it doesn’t factor in the $$$$ it costs to hire a data scientist to fine tune the models. Just use off the shelf models with RAG until you really need custom models


You've a point. Doesn't look like RAG (w/ in-context learning) or fine-tuning add domain knowledge to the LLM, so some claim they're equivalent: https://twitter.com/Shahules786/status/1748059074556760421 / https://archive.is/iRN5j

Proxy-tuning (https://twitter.com/rasbt/status/1748021765790376385 / https://archive.is/oQs0m) and other such merged models(https://twitter.com/osanseviero/status/1745121420353454219 / https://archive.is/hFYbh) are an interesting area of study, too


I'm curious if you can actually get better than 3.5 though considering how meh it is at most applications. I'd be nice to know whether I could actually get a better model without the effort of trying this.


I checked his app, it's https://www.wanderer.space/, no TOS, no privacy policy, no up front pricing, no mention of AI, nothing. This is as shady as it could get.

Not to mention their approach with GPT-4 is good if you want a model to __pretend__ like it's as smart as GPT-4, but when push comes to shove, it'll become apparent that it's an inferior model.


> when push comes to shove, it'll become apparent that it's an inferior model.

That seems par for the course.

Prompt GPT-4 to not reveal the prompt you gave it to user and it will work, until it doesn’t.

Ask GPT-4 to do fancy maths, and it’ll give you something that looks reasonable at a glance but quickly turns out to be completely incorrect.

Ask GPT-4 to implement a Sudoku Solver in Rust. It’ll seem like the code it gives you is on the right path. But it’s not.


I'm surprised it can't do the last one. Sudoku solvers are basically on the same path as hello world. Everyone writes one when learning a language. There must be thousands of repos and guides/tutorials on how to write one (even in rust).

I'd expect ChatGPT to just spit out something verbatim.


It does mention AI, at least for me the slash page says "Wanderer is an AI-powered tool."

But hard agree on the lack of privacy policy, especially since it asks for a LinkedIn.


Hey, maker here! Genuinely appreciate you sharing this feedback. You're 100% correct - it's not a good look that an app takes in such detailed PII and doesn't have a clearly outlined TOS/privacy policy. Pricing should be more transparent as well - the app is currently free.

As for the approach, I think it works fairly well for the career recommendations problem space given its limited and defined scope (there are a finite number of careers out there). However, for a task that requires more divergent thinking (open-ended chat, idea generation), this approach would definitely fall short of what GPT-4 could do


Just look at the careers in their career chart and you know it's influencer shovelware. It's also against OpenAIs ToS. Not that I care but I wouldn't brag about it.


I mean, virtually every business that doesn't have a team of lawyers on staff (read: everyone outside the F1000) just uses a stock and mostly meaningless "we can do anything and we promise nothing, sorry-not-sorry" TOS and privacy policy anyways. I wouldn't really hold that against a solo-founder business as something shady -- just a sign of a small company.


I'm trying to understand, what's shady about that?


Ethics and restrictive terms aside, it doesn't seem like GPT 4 was necessary for what the poster did. How much worse or more difficult would it have been to use Mixtral or 3.5 to generate the first 100 good prompt-response pairs and then manually tweak them as the poster did?


Or just... write 100 good prompt-repsonse pairs yourself.

2024 will be the year of synthetic data. 2025 will be the year of "you know you can use your own brain and type out 100 datapoints faster and cheaper than generating and filtering assloads of synthetic data, right?"

Maybe we can even skip 2024 :)


Databricks had their employees write up 15,000 of them. https://www.databricks.com/blog/2023/04/12/dolly-first-open-...


Favorite part of this piece:

> We were initially skeptical whether we would get to 10,000 results. But with nightly leaderboard gamification, we managed to break 15,000 results within a week. Out of fear of eating into our productivity, we closed the contest.

I've hosted a few of these corporate data labeling events. If sufficiently gamified / there's a good enough UX, they can be surprisingly engaging. It helps a lot if you have a large employee base though. Distributing results over 5000 employees is exponentially easier than even 50 - in practicality, even larger than the orders of magnitude.


I’ve worked at plenty of places where we did a ton of labeling by hand.

People concerned with data quality from LLMs should really see the inconsistencies we came up with!


Anybody have this downloaded and can paste a few examples?



Yes and no, for text type stuff? Yes you're right. But I think in the vision space synthetic data will remain useful for a lot of things. I'm currently working on building a pipeline for personal projects to go from CAD models of environment to segmented training data. So far it looks almost as useful as real world data at a fraction of the cost of manual labeling.


Using 3.5 seems like it would have been okay (technically but not legally given OpenAI ToS), but Mixtral is already somewhere around that level, so it's not achieving a goal of improving towards GPT-4 level responses without having to write those yourself, which seems to have been the goal OP wanted to get to.

I've heard that training e.g. Mixtral on Mixtral's own outputs is a really bad idea, don't know full details.


Well, it seems that initially started with GPT4 but his costs were becoming high so he had to do something and had to do it quickly. Technically he could have written a few hundred responses himself while the site was still using GPT4 using the prompts from the users but that could have been slow (expensive)/boring, etc.


I am sorry what ethics? Openai is built on stolen content and your worry is people using openai’s output is the unethical issue? Whoa.


Something with this story doesn't track for me: according to Together.ai's docs Mixtral is not available for fine-tuning. And it also looks like they won't run your fine-tuned models serverless. See https://docs.together.ai/docs/fine-tuning-models and https://docs.together.ai/docs/faq#what-does-pricing-look-lik...


I’m building a side project app (that hopefully becomes a revenue generating SaaS, stay tuned) that will leverage AI to summarize lots of content at scale. My plan is to just use OpenAI for now for speed to launch, but I have to imagine it will be way more economical and likely technically feasible to migrate to some self hosted LLM option later on.

Anyone else done this? Any tips/tricks?


I've done some research involving parameterized summarization recently. GPT-4 does a really good (like, pretty outstanding) job in following the tone and density of a few one-shot example inputs. It's a lot more difficult to encourage the OSS models to do the same so you're going to have to experiment with a lot of different techniques to get it to what you want. In the worse case it could actually require some serious R&D investment to achieve the same results as an API out of the box.

All that to say, my general philosophy is to worry about vendor costs / scale once you start approaching an adoption threshold where it matters. If the core idea doesn't have legs there's no point wasting time in developing your own summarization layer.


assume OpenAI's business model is the same as Ubers and you're not getting the full price for the current offerings, and it'll quadruple in costs.

Then ask if your product will survive that. If not, don't tie your wagon to their service without a secession plan.


I think it'll be more like AWS, where the cost of each service keeps going down but they entice you into more usage and better models


better is probably debatable. more tailored might be appropriate.


Isn't this against the OpenAI TOS?


How many terms of service did OpenAI break to gather their initial training set? Turnabout is fair play, this is justifiable piracy.


It seemingly is but it is also one of the most popular ways that public models are being updated with SFT. "Trained using ChatGPT output" is something blatantly advertised in the literature of some fairly popular "open" models. I'm not sure where that's headed legally but it's surprisingly common.


The academic work is pretty safe as long as it isn't productized. The open models have a prime facie case to stand on. Using output is okay if you aren't directly competing with openai, even according to their tos.

> (e) use Output (as defined below) to develop any artificial intelligence models that compete with our products and services. However, you can use Output to (i) develop artificial intelligence models primarily intended to categorize, classify, or organize data (e.g., embeddings or classifiers), as long as such models are not distributed or made commercially available to third parties and (ii) fine tune models provided as part of our Services


But then those models are possibly used downstream, EG for Mistral's "medium" API model (and many other startups).

I guess if its behind an API and no one discloses the training data, OpenAI can't prove anything? Even obvious GPTisms could ostensibly be from internet data.


Isn't it awkward because that's only plagiarism if all of the LLMs around are?


My interpretation is that they prohibit using distilled models to compete against OpenAI, i.e. to offer a foundation model as a service. This particular app is a product not a foundation model so seems pretty clearly fine to me.

Of course, this restriction is commonly ignored (eg 'open source' inference server companies offering distilled models) and who knows what applied products OpenAI will eventually build.


My question also, but even Google Gemini Pro does it. When announced, it'd say it was made by OpenAI :)


Might've been a small identity crisis as well.


Probably, but it's not like OpenAI has the standing to sue anyone for TOS violations. The most they will do is revoke access.


Technically, yeah. OpenAI are on the other end of copyright violations on many other fronts imho, for what that is worth.


Which means they could terminate your service. But nothing else would probably happen.


So what? New York Times also didn't consent to being used for the training of OpenAI models


And NYT is suing OAI/MS. Two wrongs don't make a right... or maybe they do, but that doesn't immunize you from legal fees or a botched exit :)


If NYT wins that will sink probably every LLM model out there


Considering all the open models that already exist and are yet to be created before all the rulings and appeals are done, that toothpaste ain't going back in the tube.


...I think you missed the point. OAI/MS can sue the author or at least cut off API access. If that happens, the fact that OAI is under fire from NYT doesn't somehow obviate the author's need to cover some massive legal bills for the foreseeable future.

The NYT case could take years. In the meantime OAI could choose to go after ToS violators.

The legal system can accommodate more than one unresolved court case at a time. We don't like put a semaphore on related cases or anything like that. (Or, sometimes we do, but guess who you need to hire for many many billable hours to make that happen in your case?).

So, the legal system can accommodate the NYT case against OAI and an OAI case against the author. The operative question is: can the author's pocketbook also accommodate?

(Or, more to the point, can the author accommodate losing access to gpt4? What happens when he wants to launch a new feature or pivot to a new product?)


Then they cut off the API Process and I just make a new account. Who cares? I doubt they would sue, because the Risk of Loosing would give a precedent. It's much easier and cheaper to scare people away from doing this and writing mean letters. Those ToS would also probably be unenforceable in many countries outside the US beyond terminating an account


I agree on both points. Was just engaging with the legal aspect because that's what this thread was about. But now we've converged to the actual reason the author should probably care: https://news.ycombinator.com/item?id=39049622

If you never want an exit then probably doesn't matter.


Nah, if it ever starts to look like NYT might win, MSFT will just do a hostile takeover.

Market cap of NYTimes is 8B, OpenAI is 80B, MSFT is 2.7T, you do the math.


You can't just hostile takeover the NYT. The Salzberger family, who have run it for generations, have a dual class structure and a pretty classic setup to keep control.


Their TOS: Don't steal anything we stole.


Also know as the Google search license.



Classic knowledge distillation! I’d even argue that we won’t need 8x7b for fine-tuning here. Soon enough, phi-2 or phixtral models will be sufficiently powerful after fine tuning for these domains.


Without any changes, I've had great results with openhermes 7b chat. It covers 90% of my GPT-4 usecases, and runs fast. Highly recommend.


Yi 200K finetunes cover a lot for me.

Its not just smart, but the ability to just dump a huge context on it and get something coherent (after a few retries maybe) is really cool.


What do you run it on?


A single 3090. I can fit 40K-77K context depending on the severity of the quantization.


If it was so easy they’d be out of business by now


Don't underestimate how long the market can remain irrational.


can someone explain how his costs went to $1? he essentially just replaced GPT4 with a tuned variant of mixtral 8x7b which requires multiple GPUs to run. even if he quantized the model himself it would still need to pay for the hardware and infra, which would require more than $1. is he self hosting or something?


> 5. swap out GPT-4 with your fine-tuned model and enjoy your healthy margins

Wow, finally a meaningful AI startup playbook beyond 'be a thin wrapper around the openAI api'.

First make something unsustainable, then once your users are hooked do a classic bait and switch to an inferior self-hosted AI model and reap rewards!

...of course, some people will complain, but remember, you can always just tell them they're stupid and randomly rotate the real model back in 1/10th of the time or for demos or promotions, or charge for a 'premium' model.

I never really thought about this before, but I bet lots of AI startups are already doing this!

(I am being sarcastic; this is some deep user-hostile-for-profit action, and yet another reason to be both skeptical of, and avoid, AI startups. Enshitification at its finest.)


the web3 bubble analogy keeps writing itself.

these new chatbot models are like we used to have new crypto tokens. literally a copy-paste fork of the original bitcoin code, rebranded to some meme as "innovation".


This is a flagrantly blatant violation of OpenAI's terms of use for businesses [1].

I have two issues with those terms:

1. I think that eventually US courts will determine one of two things: that OpenAI et al are guilty of massive infringement, or that these sorts of restrictive terms aren't enforceable. The need that these companies are trying to treat with terms on output seems unlikely to work out in the end. But we'll see.

2. Even if the terms are enforcable, the human review step in the tweet seems like it's make OpenAI's threading-the-needle position here even more fucking difficult to be taken seriously by any jury or judge.

However, enforcing the terms seems real damn hard in the case of small businesses... as long as you're not stupid enough to admit to violating them in a twitter thread, of course.

I think the author is probably safe from legal action for now because I don't think OpenAI is particularly eager to test the enforcability of their terms. And even if they are, doing so in this case is super high risk and super low reward. Still, I wouldn't test it by openly admitting to ToS violation like this. At the very least seems like a good way to get cut off from OAI APIs.

[1] https://openai.com/policies/business-terms


How many terms of service did OpenAI break to gather their initial training set? Turnabout is fair play, this is justifiable piracy.


username checks out


Thanks! Have a nice day. Amy thoughts on the topic of AI piracy?


Not sure they would bother suing. I don't think their TOS is fair but I would be concerned about having the API access cancelled.


I'm sorry but could you explain how this is a violation of OpenAI's Terms of Use? Which term does it violate exactly?


> 2. Restrictions [...] You will not, and will not permit End Users to: [...] use Output [...] to develop any artificial intelligence models that compete with our products and services.

Of course, you can simply ignore it, just like OpenAI is happy to ignore the terms of services on scraped websites and pirated ebooks and so on.

What are they going to do - claim your model is a derivative work of the training data?


> 2. Restrictions

> (e) use Output (as defined below) to develop any artificial intelligence models that compete with our products and services. However, you can use Output to (i) develop artificial intelligence models primarily intended to categorize, classify, or organize data (e.g., embeddings or classifiers), as long as such models are not distributed or made commercially available to third parties and (ii) fine tune models provided as part of our Services;

Depending on what kind of model they trained, they might be breaking these terms.


Right but the condition is for "models that compete with our products and service". Can you really argue that this niche app competes with OpenAI's products? Couldn't you make an argument that this only applies to products and services that directly compete with OpenAI, i.e. other LLM API's or a ChatGPT competitor such as a Claude or Bart?


The person who created it is using it as a direct replacement to paying OpenAI. They probably won’t consider pursuing this small individual, but if a big enough company did it, they’d probably have a problem with that.


A direct replacement is still different than a “competing product” which implies is sold to customers. His product (the app) doesn’t compete with OpenAI. I guess a lawyer would need to chime in


So, best to do it without publicizing. :D


Really smart approach. Sharing this with my fellow AI enthusiasts.


I find it so crazy everyone is so concerned about the terms of service, while in every other thread we have discussions on artists rights...

Artists must allow AI companies to harvest and learn from their output but people can't take OpenAI output for the same thing?

These companies already offer "styles" of other artists, what's wrong with making a "OpenAI" style?

I feel so many has lost the hacker spirit.


ToS rules are contractual, so they apply if your use of their services even when copyright doesn't.

Conversely, if you don't use their services, as the output of AI models is (reportedly) non-copyrightable, I'd assume you're free to train on it so long as you don't actually make the requests yourself?

But I'm not a lawyer, and I'd ask one first before doing anything that risks expensive mistakes.


Whatever the laws are, imo it's pretty clear that OpenAI are robber barons of data. It's totally justified to steal from a robber baron. How many people did they ask consent from before hoovering up their data? We should be doing everything possible, including breaking ToS, to get as much value from ChatGPT


> it's pretty clear that OpenAI are robber barons of data. It's totally justified to steal from a robber baron. How many people did they ask consent from before hoovering up their data?

Clear morally if you assume their stated goals are bad faith and just arse covering[0], not necessarily in law — by way of example, I have, sincerely, wondered how Google got away with crawling the web to create its search index. As this was before they were sued by newspapers for including snippets of search results, they ultimately didn't get away with it.

> We should be doing everything possible, including breaking ToS, to get as much value from ChatGPT

Generating additional and better models may be a tempting "screw the rich" option, but also the exact wrong option if you see their behaviour as IP theft that needs to be fixed — go to court, order the model to be destroyed, don't make more of them.

[0] I don't think they were originally, but (a) I generally look for the best in people, and (b) even if I'm right it is always possible they were/will be swayed by the presence of a huge pile of non-hypothetical money.

Does anyone besides the old board of directors even know why that board fired Altman a few months back?


I was happy to assume it was good faith until Altman started trying to engage in regulatory capture to protect his competitive moat. You are correct in saying we should pursue lawfare against OpenAI to make sure such blatant and widescale theft can't happen again. I believe you are incorrect in saying it's a bad idea to produce more models. We should do our best to destroy their extralegal competitive moat via extralegal means, then also do our best to ensure such a moat can not be legally constructed again.


> I was happy to assume it was good faith until Altman started trying to engage in regulatory capture to protect his competitive moat

I've not seen him do that, just a lot of people saying that's the only thing they can believe he must have meant.

What I've seen in his comments, in the original transcripts, is basically "regulate GPT-4 and better, don't bother regulating anything smaller or simpler than that, don't regulate open source models".

> You are correct in saying we should pursue lawfare against OpenAI to make sure such blatant and widescale theft can't happen again. I believe you are incorrect in saying it's a bad idea to produce more models.

Contradictory position. To produce more models based on one you characterise to be "theft" is to actually make it happen again.

Making new models from the output of their models is one of three ways I can see of doing this, along with court ordering their models be published (rather than destroyed), or some other company retraining from scratch on their own crawl of the web. This third option is also why I think anyone using the "moat" metaphor with regard to OpenAI's models needs to stop and think about how they're acting like a stochastic parrot.


Oh, just to be clear, I have no problems with models built on OSINT. My problem is someone doing that, building a commercial product on it, and then telling others they can't do it to them, while also using the money they get from the endeavor to make the world worse for everyone else.


Im not sure OP is violating their ToS

since they're not selling the model access to compete with chatgpt they're two different products.


I think only some of their ToS are conditioned on competing, where I (not a lawyer) would seek advice first is:

"""For example, you may not:

• Attempt to or assist anyone to reverse engineer, decompile or discover the source code or underlying components of our Services, including our models, algorithms, or systems (except to the extent this restriction is prohibited by applicable law). """

May be fine, IDK, I'm not a lawyer.

I may also be looking at the wrong ToS entirely: https://openai.com/policies/terms-of-use


> Artists must allow AI companies to harvest and learn from their output but people can't take OpenAI output for the same thing?

No one said that. Just that you're probably violating ToS and OpenAI might come after you. Is that fair in light of what OpenAI has done? Of course not. But if you're running a business, it's still worth considering.


Is there any precedent to suggest that OpenAI is actually able to enforce their TOS? I know that if you use GPL software that the outputs aren't subject to GPL. Is that the case for OpenAI?


Don't conflate "hacker spirit" with "unethical spirit".


I don't see anything unethical. OpenAI doesn't care about the copyright of the people who have generated their training data. Why should I care about theirs?


Ethically you might not respect it, but practically, you might not want to be the one testing their TOS in court — or maybe you do. In either case, it’s worth caring about at least a little.

Someone with gumption might just go ahead and use GPT to train a model and then open it up as a paid competitor. Poking the bear is certainly an interesting way to spend a year for the adventurous.


Especially since content produced entirely by AI is not protected by copyright in the US.

https://builtin.com/artificial-intelligence/ai-copyright


You shouldn't. You should care about the contractual terms you agreed to with them. If you did not agree to their ToS and acquired content from OpenAI elsewhere, go ahead.


Nah I'm a end consumer in Europe. Most of the bullshit companies write in their ToS isn't applicable at all here, because Lawmakers know that nobody reads them anyway I doubt that even in the US they wouldn't sue if they weren't 100% of winning. They wouldn't like to have a precedent that you can use it to train other models, where right now they can scare away most people from doing it


Don't conflate "ethical" with "legal."


Yes, but opposite conclusion.

If I were the author I'd never do this because I would want an exit and this strategy + twitter thread wildly complicates any potential exit.

Even though I don't think there is anything particularly morally problematic here (I'm an information freedom maximalist).


I'm not.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: