I love Kagi, used my free searches and customized the crap out of it. It would absolutely be worth the cost if I wasn't penny pinching because of cost of living increases and inflation.
I've been lately using FOSS metager.org as my main search engine and it has decent results. Often a lot better than google, especially with quoted words and 'site:' specifiers which it never ignores.
It also has domain blacklist similar to kagi.
This looks nice and the payment model is more convenient (on demand). Haven't heard of this before. From the first use, it looks like a search aggregator as it takes results from many engines. Each source engine costs differently.
> The only Rails projects that I worked on that never had performance problems are the ones that never reached any scale. All Rails projects that gained traction that I worked on, needed serious refactorings, partial rewrites, tuning and tweaking to keep 'em running.
You'll be hard pressed to find any stack that doesn't require this.
A big problem with rails, though, is how easy it makes it to "do the bad thing" (and in rare cases, how hard it makes it to do the "good" thing).
A has_many/belongs_to that crosses bounded domains (adds tight coupling) is a mere oneliner: only discipline and experience prevents that. A quick call to the database from within a view, something that not even linters catch, it takes vigilance from reviewers to catch that. Reliance on some "external" (i.e. set by a module, concern, lib, hook or other method) instace-var in a controller can be caught by a tighly set linter, but too, is tough.
Code that introduces poor joins, filters or sorts on unindexed columns, N+1 queries and more, are often simple, clean-looking setups.
`Organization.top(10).map(&:spending).sum` looks lean and neat, but hides all sorts of gnarly details in ~three~ four different layers of abstraction: Ruby-language because "spending" might be an attribute or a method, you won't know, Rails, because it overloads stuff like "sort", "sum" and whatnot to sometimes operate on data (and then first actually load ALL that data) and sometimes on the query/in-database. It might even be a database-column, but you won't know without looking at the database-model. And finally the app for how a scope like top(10) is really implemented. For all we know, it might even make 10 HTTP calls.
Rails (and ruby) lack quite some common tools and safety nets that other frameworks do have. And yes, that's a trade-off, because many of these safety nets (like strong and static typing) come at a cost to certain use-cases, people or situations.
Edit: I realize there are four layers of abstraction if the all-important and dictating database is taken into account, which in Rails it always is.
It's a bot prevention measure (allegedly). It also conveniently blocks out people that are more likely to not want to share all of their personal information with them. No skin off my back, the less toxic social media garbage I'm allowed to use, the better for my sanity, good riddance.
Meta/Google/etc... at the end of the day are businesses, and the currency they deal with when transacting with us is our personal information and they convert that to dollars with their partners.
As I also value freedom of association, I am perfectly alright with them refusing to "do business" with me because I do not find the denomination and terms of the transaction acceptable.
I do not really have any inclination to want to compel them to serve me by force of gun, either.
I think both they and myself are better off in this scenario where we just choose to not do business with each other and I find some other business that has more acceptable terms - or I just go without, these services aren't as essential as some people think.
> I fail to see what separates Alma from CentOS Stream at this point
Alma is getting their source packages from CentoOS Stream, but they are using the specific versions that are in the Redhat releases they are targetting. They aren't just rebuilding Stream sources from HEAD.
People seem to be forgetting that "AA" game devs companies are an established "category" in video gaming. I wouldn't have considered the Divinity series from the same devs to be indie back when it came out.
Indie to me means a SMALL team (less than 20 total) selling something that is primarily a passion project, and not controlled by a larger entity. IE Minecraft stopped being indie when it was bought by Microsoft even though the team didn't get very large.
Censorship doesn't only mean removing content after the fact.
Could also be 'don't do that because 8% of our target market wouldn't like it' during early development or in non public versions. Or silly image stuff like car manufacturers not allowing their licensed cars to show damage in racing games.
Last one is blatantly obvious, the first one, you'll never know about it because it was internal.
Without any evidence to suggest otherwise, it's just baseless conspiracy theory to think some "shadow figure" was secretly pulling the strings to make their preferred evil-capitalist-ideal version of BG3 (lol).
BG3 is indie because they financed the game with money they raised themselves and self-published. It's that simple.
"The science" has become religious dogma to many people, the complete anthesis of science. When people say they (follow) "the science" I don't even bother listening to what they have to say anymore.
Even if they win against openAI, how would this prevent something like a Chinese or Russian LLM from “stealing” their content and making their own superior LLM that isnt weakened by regulation like the ones in the United States.
And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.
Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. Especially with the massive pace of open source innovation in this space.
We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
> Even if they win against openAI, how would this prevent something like a Chinese or Russian LLM from “stealing” their content and making their own superior LLM that isnt weakened by regulation like the ones in the United States.
Foreign companies can be barred from selling infringing products in the United States.
Russian and Chinese consumers are less interested in English-language articles.
I can’t really get behind the argument that we need to let LLM companies use any material they want because other countries (with other languages, no less) might not have the same restrictions.
If you want some examples of LLMs held back by regulations, look into some of the examinations of how Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive.
>Chinese LLMs are clearly trained to avoid answering certain topics that their government deems sensitive
But they're not; you can download open source Chinese base models like Yi and Deepseek and ask them about Tianmen Square yourself and see, they don't have any special filtering.
I don't think they're looking to prevent the inevitable, but rather see a target with a fat wallet from which a lot of money can be extracted. I'm not saying this in a negative way, but much of the "this is outrageous!" reaction to AI hasn't been about the building of models, but rather the realization that a few players are arguably getting very rich on those models so other people want their piece of the action.
If LLMs actually create added value and don't just burn VC money then they should be able to pay a fair price for the work of people they're relying upon.
If your business is profitable only when you get your raw materials for free it's not a very good business.
By that logic you should have to pay the copyright holder of every library book you ever read, because you could later produce some content you memorised verbatim.
The rules we have now were made in the context of human brains doing the learning from copyrighted material, not machine learning models. The limitations on what most humans can memorize and reproduce verbatim are extraordinarily different from an LLM. I think it only makes sense to re-explore these topics from a legal point of view given we’ve introduced something totally new.
Human brains are still the main legal agents in play. LLMs are just a computer programs used by humans.
Suppose I research for a book that I'm writing - it doesn't matter whether I type it on a Mac, PC, or typewriter. It doesn't matter if I use the internet or the library. It doesn't matter if I use an AI powered voice-to-text keyboard or an AI assistant.
If I release a book that has a chapter which was blatantly copied from another book, I might be sued under copyright law. That doesn't mean that we should lock me out of the library, or prevent my tools from working there.
I see two separate issues, the one you describe which is maybe slightly more clear cut: if a person uses an AI trained on copyrighted works as a tool to create and publish their own works, they are responsible if those resulting works infringe.
The other question, which I think is more topical to this lawsuit, is whether the company that trains and publishes the model itself is infringing, given they're making available something that is able to reproduce near-verbatim copyrighted works, even if they themselves have not directly asked the model to reproduce them.
I certainly don't have the answers, but I also don't think that simplistic arguments that the cat is already out of the bag or that AIs are analogous to humans learning from books are especially helpful, so I think it's valid and useful for these kinds of questions to be given careful legal consideration.
You make it seem as if the copyright holder is making more money on a library book, than on one sold in retail, which does not appear to be the case in the US.
The library pays for the books and the copyright holder gets paid. This is no different from buying a book retail, which you can read and share with family and friends after reading, or sell it, where it can be read again and sold again. The book is the product, not a license for one person to access the book.
What do you actually believe, with that statement? Do you believe Libraries are operating illegally? That they aren't paying rightsholders?
Also: GPT is not a legal entity in the united states. Humans have different rights than computer software. You are legally allowed to borrow books from the library. You are legally allowed to recite the content you read. You're not allowed to sell verbatim recitation of what you read. This is, obvious, I think? But its exactly what LLMs are doing right now.
The difference here is scale. For someone to reproduce a book verbatim from memory it would take years of studying that book. For an LLM this would take seconds.
The LLM could reproduce the whole library quicker than a person could reproduce a single book.
What if even though it's a small portion of the training data, their content has an outsized influence on the output being generated? A random NYT article about Donald Trump and a random Wikipedia article about some obscure nematode might be around the same share of training data but if 10,000x more users are asking about DJT than the nematode, what is fair? Obviously they'll need to pay royalties on the usage! /s
Yup, and I think that'll quickly uncover the reality that LLMs do not generate enough value relative to their true cost. GPT+ already costs $20/month. M365 Copilot costs $30/user/month. They're already the most expensive B2B-ish software subscriptions out there, there's very little market room to add in more cost to cover payments to rightsholders.
Imagine if tomorrow it was decided that every programmer had to pay out money for every single thing they went on the internet to learn about beyond official documentation, every Stack Overflow question they looked at, every question they went to a search engine to find. The amount of money was decided by a non-tech official who was in charge of figuring out how much of the money they earned was owed to the places they learned from. And people responded, "Well, if you can't pay up for your raw materials, then this just isn't a good business for you."
So I suppose it would be the like saying that if you used Stack Overflow to find answers, all of the work you created using information from it would have to be explicitly under the Creative Commons license. You wouldn't even be able to work for companies who aren't using that license if some of your knowledge comes from what you learned on Stack Overflow. Used Stack Overflow to learn anything about programming? You're going to have to turn down that FAANG offer.
And if you learned anything from videos/books/newsletters with commercial licenses, you would have to pay some sort of fee for using that information.
If your code contains verbatim copy-paste of entire blocks of non-trivial code lifted from those videos/books/newsletters with commercial licenses, then yes you would be liable for some licensing fees, at minimum.
If they are determined to have broken the law then they should absolute be made to pay damages to aggrieved parties (now, determining if they did and who those parties are is an entirely unknown can of worms)
The data will have to become more curated. Exclusivity deals will probably become a thing too. Good data will be worth the money and hassle; garbage (or meh) data won't.
I can certainly imagine email correspondence. Even audio interviews. You're right that it seems at least presently AI is less likely to earn confidences. But I don't know how far off the movie "Her" actually is.
The NYT's strongest argument for infringement is that OpenAI is reproducing their content verbatim (and to make matters worse, without attribution). IANAL but it seems super likely to me that this will be found to be infringing sooner or later.
Do I really want to use a Chinese word processor that spits unattributed passages from the NYT into the articles I write? Once I publish that to my blog now I'm infringing and I can get sued too. Point is I don't see how output which complies with copyright law makes an LLM inferior.
The argument applies equally to code, if your use of ChatGPT, OpenAI etc. today is extensive enough, who knows what copyrighted material you may have incorporated illegally into your codebase? Ignorance is not a legal defense for infringement.
If anything it's a competitive advantage if someone develops a model which I can use without fear of infringement.
Edit: To me this all parallels Uber and AirBnB in a big way. OpenAI is just another big tech company that knew they were going to break the law on a massive scale, and said look this is disruptive and we want to be first to market, so we'll just do it and litigate the consequences. I don't think the situation is that exotic. Being giant lawbreakers has not put Uber or AirBnB out of business yet.
>IANAL but it seems super likely to me that this will be found to be infringing sooner or later.
It better. Copyright has essentially fucking ceased to exist in the eyes of AI people. Just because you have a shiny new toy doesn't mean the law suddenly stops applying to you. The internet does its best to route around laws and government but the more technologically up to date bureaucracy becomes, the faster it will catch up.
Yeah I mean I'm not even really a fan of how copyright law works, but I don't see how you can just insert an "AI exemption." So OpenAI can infringe because they host an AI tool, but we humans can't? That would be ridiculous. Or is "I used AI when I created this" a defense against infringement? Also seems ridiculous. Why would we legally privilege machine creation of creative works over human creation in the first place? So I don't see what the credible AI-related copyright law reform is going to be yet.
Which means that either OpenAI is allowed to be the only lawbreaker in the country (because rich and lawyers), or nobody is. I say prosecute 'em and tell them to make tools that follow the law.
They probably didn’t start with a lawsuit. They started asking for royalties. They probably didn’t get an offer they thought was fair and reasonable so they sued.
These media businesses have shareholders and employees to protect. They need to try and survive this technological shift. The internet destroyed their profitability but AI threatens to remove their value proposition.
Sorry, how exactly LLM threatens NYT? Are people supposed to generate news themselves? Or like wait a year or so before NYT articles are consumed by LMMs?
NYT doesn't just publish "news" as in what happened yesterday; they also publish analysis, reviews of books and films, history, biography and so on. That's why people cite NYT articles from decades ago.
On the one hand, they should realize they are one of today’s horse carriage manufacturers. They’ll only survive in very narrow realms (someone has to build the Central Park horse carriages still), but they will be miniscule in size and importance.
On the other hand, LLMs should observe copyright and not be immune to copyright.
SciHub was an early warning, IMHO, that there's a strong risk of the first world fumbling the ball so badly with IP that tech ecosystems start growing in the third world instead. The dominant platform for distributing scientific journal papers is no longer Western. Maybe SciHub is economically inconsequential, but LLM's certainly are not!
Imagine if California had banned Google spidering websites without consent, in the late 90's. On some backwards-looking, moralizing "intellectual property" theory, like the current one targeting LLM's. 2/3rd of modern Silicon Valley wouldn't exist today, and equivalent ecosystems would have instead grown up in, who knows where. Not-California.
We're all stupidly rich and we have forgotten why we're rich in the first place.
What's the actionable advice here? US regulation should be the lowest common denominator of all countries one considers in competition? Certainly Chinese and Russian LLMs could vacuum up all the information. China already cares little about copyright and trademark, should they stop being enforced in the US?
My opinion is that the US should do things that are consistent with their laws. I don't think a Chinese or Russian LLM is much of a concern in terms of this specific aspect, because if they want to operate in the US they still need to operate legally in the US.
This suggests to me that copyright laws are becoming out of date.
The original intent was to provide an incentive for human authors to publish work, but has become more out of touch since the internet allowed virtually free publishing and copying. I think with the dawn of LLMs, copyright law is now mainly incentivising lawyers.
> The original intent was to provide an incentive for human authors to publish work, but has become more out of touch since the internet allowed virtually free publishing and copying. I think with the dawn of LLMs, copyright law is now mainly incentivising lawyers.
And yet the content industry still creates massive profits every year from people buying content.
I think internet-native people can forget that internet piracy doesn’t immediately make copyright obsolete simply because someone can copy an article or a movie if sufficiently motivated. These businesses still exist because copyright allows them to monetize their work.
Eliminating copyright and letting anyone resell or copy anything would end production of the content many people enjoy. You can’t remove content protections and also maintain the existence of the same content we have now.
Maybe a specific example will help here. An Author spends a year writing a technical book, researching subtle technical issues, creating original code and finding novel ways of explaining difficult abstractions.
A few weeks after the release it finds books on Amazon who plagiarized the book. Finds copies of the book available for free from Russian sites, and ChatGPT spitting verbatim parts of the source code on the book.
Which parts of copyright law would you say are out of date for the example above?
> Which parts of copyright law would you say are out of date for the example above?
The expectation that the author will get life+70 years of protection and income, when technical publications are very rarely still relevant after 5 years. Also, the modern ease of copying/distribution makes it almost impossible for the author to even locate which people to try to prosecute.
The expectation to make money from artificially restricting an abundant resource. While copyright is a way to create funding, it also massively harms society by restricting future creators from being able to freely reuse previous works. Modern ways to deal with this are patronage, government funding, foundations (e.g. NLNet) and crowdfunding.
Also, plagiarism has nothing to do with copyright. It has to do with attribution. This is easily proven: you can plagiarise Beethoven's music even though it's public domain.
What incentive do people have to publish work if their work is going to primarily be consumed by a LLM and spat out without attribution at people who are using the LLM?
I notice this in myself, even though I've never particularly made money from published prose on the internet.
But (under different accounts) I used to be very active on both HN and reddit. I just don't want to be anymore now for LLM reasons. I still comment on HN, but more like every couple of weeks than every day. And I have made exactly one (1) comment on reddit in all of 2023.
I'm not the only one, and a lot of smaller reddit communities I used to be active on have basically been destroyed by either LLMs, or by API pricing meant to reflect the value of LLM training data.
The fact that copyright protection is far too long is entirely separate from the need for some kind of copyright protection to exist at all. All evidence suggests that it's completely impossible to live off your work unless you copyright it for some reasonable period, with the possible exception of performance art (music, theater, ballet).
A writer or journalist just can't make money if any huge company can package their writing and market it without paying them a cent. This is not comparable to piracy, by the way, since huge companies don't move into piracy. But you try to compete with both Disney and Fox for selling your new script/movie, as an individual.
This experiment has also been tried to some extent in software: no company has been able to live off selling open source software. RedHat is the one that came closest, and they actually live by selling support for the free software they sell. Others like MySQL or Mongo lived by selling the non-GPL version of their software. And the GPL itself depends critically on copyright existing. Not to mention, software is still a best case scenario, since just having a binary version is often not enough, you need the original sources which are easy to guard even without copyright - no one cares so much for the "sources" of a movie or book.
Chinas’s accession to the Universal Copyright Convention, and an alleged desire to comply with international IP law, led to an influx of OECD IP and foreign direct investment(FDI).
In hindsight, China wasn’t diligent in the enforcement of IP violations. However, it’s clear foreign presences and investment grew substantially in China during the early 90s upon the belief IP would be protected, or at the very least there would be recourse for violations.
I used to work as a computer programmer until I retired. Nearly always, my work was part of a collaborative effort, and latterly didn't include any copyright claim. My income was never impacted by unauthorized copying. Until the 80s, there was no copyright on software, and yet even then people made a living programming.
Craftsmen don't claim copyright on their artifacts. Furniture designs were widely copied; but Chippendale did alright for himself. Gardeners at stately homes didn't rely on copyright. Vergil, Plato and Aristotle managed OK without copyright. People made a living composing music, songs and poetry before the idea of copyright was invented. Truck-drivers make a living; driving a truck is hardly a performance art. Labourers and factory workers get by successfully. Accountants and legal advocates get rich without copyright.
None of these trades amounts to "performance arts".
I very much doubt the company or foundation you were working for was selling the non-copyrighted software. If it was, it probably only worked on very specific hardware that you also produced and were selling, and thus copying it was largely useless. If you were working for a university, than the university obviously doesn't make money from selling software, and thus doesn't care for copyright as much.
Also, craftsmen rely on the fact that the part of their work that can't be easily copied, the physical artifact they produce, is most of the value (plus they rely on trademark laws and design patents quite often). Similarly for gardeners. The ancient greek writers were again paid for performance, typically as teachers. Literature was once quite a performative act. And again, at that time, physical copies of writings were greatly valuable artifacts, not that much different from the value of the writing itself, since copying large texts was so hard.
Similarly, the work of drivers, labourers, factory workers, accountants is valuable in itself and very hard or impossible to copy (again, the physical world is the ultimate copyright protection). The output of lawyers is in fact sometimes copyrighted, but even when it's not, it's not applicable to others' cases, so copies of it are not valuable: no one is making a business that replaces lawyers by re-distributing affidavits.
> I very much doubt the company [...] was selling the non-copyrighted software
Well you'd be mistaken. Lately, it was custom software, for a particular client, and of no interest to others. Earlier, it was before software copyright was a thing, and computer manufacturers gave software away to sell the hardware.
At the very beginning, yes, it was "very specific" hardware; it was Burroughs hardware, which used Burroughs processors. But that was before microprocessors, and all hardware was "very specific".
> (plus they rely on trademark laws and design patents quite often)
Craftsmen and labourers were earning a living long before anyone had the idea of a "trademark", still less a "design patent".
> The output of lawyers is in fact sometimes copyrighted
You're right. That's why I didn't say "lawyers", I said "legal advocates". Those are people who speak on your behalf in courts of law, not scribes writing contracts. Anyway, the ancient Greeks and Romans had written laws, contracts and so on; they managed without trademarks and copyrights.
> Well you'd be mistaken. Lately, it was custom software, for a particular client, and of no interest to others. Earlier, it was before software copyright was a thing, and computer manufacturers gave software away to sell the hardware.
Then I am not mistaken: the company was initially selling hardware, with the software being just a value add as you say (no copyright: no interest in trying to sell, exactly my point). Then, you were being paid for building software that (a) was probably not being made public anyway, and (b) would not have been of interest to others even if it were.
Even so, if someone came to your client and offered to take on the software maintenance for a much lower price, you might have lost your client entirely. This has very much happened to contractors in the past.
And my point is you couldn't have a Microsoft or Adobe or possibly even RedHat if you didn't have copyright protecting their business. So, you'd probably not have virtually any kind of consumer software.
> offered to take on the software maintenance for a much lower price
We didn't charge maintenance for this software. We would write it to close the sale of a computer. It was treated as "cost of sale". I'm sure it was cheaper (to us) than the various discounts and kickbacks that happened in big mainframe deals.
As far as Microsoft and Adobe is concerned, I wouldn't regard it as a misfortune if they had never existed. I'm not convinced that RedHat's existence is contingent on copyright.
Yes, but it's still a subset of the arts - it doesn't apply to movies, literature, nor even to the scripting for any of these.
And I should mention YouTubers wouldn't be making that much money if YouTube weren't enforcing copyright, as you could just upload their videos and get the ad money. Without copyright, you could also cut off their in-video promotions and add your own, including your own Patreon - so you would get 100% of the money off their work if you can out-promote them.
It's only live performances which are protected by the physical world's strict no-copying laws (the ones that don't allow the same macro object to be in two places at the same time).
So basically, no medium which allows copying of the works in whole or nearly whole has been successfully run with public works.
I believe you equate incentive to monetary rewards. And while that it probably true for the majority of news outlets, money isn't always necessarily what motivates journalists.
So considering the hypothetical situation where journalists (or more generally, people that might publish stuff) were somehow compensated. But in this hypothetical, they would not be attributed (or only to very limited extent) because LLMs are just bad at attribution.
Shouldn't in that case the fact that information distribution by the LLM were "better" be enough to satisfy the deeper goal of wanting to publish stuff? Ie.: reach as many people looking for that information as possible, without blasting it out or targeting and tracking audiences?
To have a positive impact on the world? Also, presumably NYT still has a business model unrelated to whatever OpenAI is doing with their data and everyone working there is still getting paid for their work...
Oh thank goodness we can rely on charity for our information economy
> Also, presumably NYT still has a business model unrelated to whatever OpenAI is doing with [NYT’s] data…
That’s exactly the question. They are claiming it is destroying their business, which is pretty much self-evident given all the people in here defending the convenience of OpenAI’s product: they’re getting the fruits of NYTimes’ labor without paying for it in eyeballs or dollars. That’s the entire value prop of putting this particular data into the LLMs.
Yep! I like having access to high-quality information and producing, collecting, editing, and publishing that is not free.
Much of it is only cost-effective to produce if you can share it with a massive audience, I.e. sure if I want to read a great investigative piece on the corruption of a Supreme Court Justice I can hypothetically commission one, but in practice it seems much much better to allow people to have businesses that undertake such matters and publish their findings to a large audience at a low unit price.
Now what’s your argument for removing such an incentive?
Why did you specify that this stuff you like, you only like if it's "not free"?
The hidden assumption is that the information you like wouldn't be made available unless someone was paying for it. But that's not in evidence; a lot of information and content is provided to the public due to other incentives: self-promotion, marketing, or just plain interest.
I’ll restate it for clarity: I like high-quality information. Producing and publishing high-quality information is not free.
There are ways to make it free to the consumer, yes. One way is charity (Wikipedia) and another way is advertising. Neither is free to produce; the advertising incentive is also nuked by LLMs; and I’m not comfortable depending on charity for all of my information.
It is a lot cheaper to produce low-quality than high-quality information. This is doubly so in a world of LLMs.
There is ONE Wikipedia, and it is surely one of mankind’s crowning achievements. You’re pointing to that to say, “see look, it’s possible!”?
I contribute to Wikipedia, and I don't consider my contributions to be "charity"; I contribute because I enjoy it. Even in the age of printing presses, copyright law was widely ignored, well into the 20thC. The USA didn't join the Berne Convention until 1989 (and they promptly went mad with copyright).
Yes, there's only one Wikipedia; but there are lots of copies, and lots of similar efforts. Yes, there's one Wikipedia, like there's one Mona Lisa. There are lots of things of which there's only one; in that sense, Wikipedia isn't remotely unique.
Of course not. But paying the server bills won't magically produce the excellent content that you value so much. That's produced by volunteers.
There's a tendency among some people to take the nostrums of economists about the aggregate behaviour of populations as if they described human nature, and to then go on and conclude that because human behaviour in aggregate can be understood in terms of economic incentives, that an individual human can only be motivated economically. I find that an impoverished and shallow outlook, and I think I'm happier for not sharing it.
We absolutely need an information economy where people can research things and publish what they find without needing some deep pocketed sponsors. Some may do it for money, some may do it for recognition. Once AI absorbs all that information and uses it without attribution these incentives go away. I am sure OpenAI, Microsoft and others will love a world where they have a quasi monopoly on what information goes to the public but I don't think we want that.
I would guess the monetisation is going to be limited to either subscriptions or advertising if your reputation allows people to especially value your curation of facts/reporting etc. The big issue with LLMs is the lack of reliability - it might be accurate or it might be an hallucination.
Personally, I think it would be a lot simpler if the internet was declared a non-copyright zone for sites that aren't paywalled as there's already a legal grey area as viewing a site invariably involves copying it.
Maybe we'll end up with publishers introducing traps/paper towns like mapmakers are prone to do. That way, if an LLM reproduces the false "fact", it'll be obvious where they got it from.
The cost of copying and publishing has been almost irrelevant to the need for copyright at least since the times of the printing press. In fact, when copying books was extremely expensive work, copyright was not even that needed - the physical book was about as valuable as the contents, so no money was there to be made from copying someone else's work vs coming up with your own.
All of this can be true (I don’t think it necessarily is, but for the sake of argument), but it’s legally irrelevant: the court is not going to decide copyright infringement cases based on geopolitical doctrines.
Courts don’t decide cases based on whether infringement can occur again, they decide them based on the individual facts of the case. Or equivalently: the fact that someone will be murdered in the future does not imply that your local DA should not try their current murder cases.
The issue here is that the case law is not settled at all and there is no clear consensus on whether OpenAI is violating any copyright laws. In novel cases like this where the courts essentially have to invent new legal doctrines, I think the implications of the decision carries a tremendous amount of weight with the judges and justices who have to make that decision.
Trying to prevent AI from learning from copyrighted content would look completely stupid in a decade or two when we have AIs that are just as capable as humans, but solely due to being made of silicon rather than carbon are banned from reading any copyrighted material.
Banning a synthetic brain from studying copyrighted content just because it could later recite some of that content is as stupid as banning a biological person from studying copyrighted content because it could later quote from it verbatim.
It's not exactly a synthetic brain though, is it? LLMs are more like lookup tables for the texts they're trained on.
We will not have "AIs as capable as humans" in a couple decades. AIs will keep being tools used by humans. If you use copyrighted texts as input to a digital transformation, that's vopyright infringement. It's essentially the same situation as sampling in music, and imo the same solutions can be applied here: e.g. licenses with royalties.
The war on drugs has also been unwinnable from the start and yet they built an economy on top of it, with entire agencies and a prison industry. When it comes to the fabrication and exploitation of illegality, unwinnability may be a feature, not a bug.
Access to ressources is hardly a new problem: when I was an NLP graduate student about a decade ago a teacher of us had scrapped (and continued to do so) a major newspaper for years to make a corpus. The legality of that was questionable at best, yet it was used in academic paper and a subset for training.
The same is equally applicable to image: Google got rich in part by making illegal copies of whatever image he could find. Existing regulations could be updated to include ML model but that won't stop bad or big enough actors to do what they want.
> We’re in a “mutually assured destruction” situation now
No, we aren't. Very good spam generators aren't comparable to mass destruction weapons.
Any piece of pie deemed too big for one person to eat will be split accordingly.
I don’t think NYT, or any other industry, for that matter knows AI isn’t going away: in fact, they likely prefer it doesn’t, so long as they can get a slice of that pie.
That’s what the WGA and SAG struck over, and won protections ensuring AI enhanced scripts or shows will not interfere with their royalties, for example.
Another way to look at it is to consider being stolen part of business model.
There are massive number of piracy content in China, but Hollywood are also making billions in the same time, and in fact China already surpassed NA as #1 market for Hollywood years ago [1].
NYT is obvious different than Disney, and may not be able to bend their knees far enough, but maybe there can be similar ways out of this.
> We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
We've always been in that situation. Computers made the copying, transmission and processing of information trivial since the day they were invented. They changed the world forever.
It's the intellectual property industry that keeps denying reality since it's such an existential threat to them. They think they actually own those bits. They think they can own numbers. It's time to let go of such insane notions but they refuse to let it go.
This argument is moot. Just because some countries - see china - steal intellectual property it doesnt mean we should. There are rules to the games we play specifically so we dont end up like them.
Ok, let’s address this from the standpoint of a node in the network of the thoughtscape. A denizen of the “inter”net, and also a victim of the exploitive nature of artists.
Media amalgamated power by farming the lives of “common” people for content, and attempt to use that content to manage lives of both the commons and unique, under the auspice of entertainmet. Which in and of itself is obviously a narrative convention which infers implied consent (id ask to what facetiously).
Keepsake of the gods if you will…
We are discussing these systems as though they are new (ai and the like, not the apple of iOS), they are not…
this is an obfuscation of the actual theft that’s been taking place (against us by us, not others).
There is something about reaping what you sow written down somewhere, just gotta find it.
It can do though. While the proper definition is "worthy of discussion / debatable", it can also refer to a pointless debate.
"Moot derives from gemōt, an Old English name for a judicial court. Originally, moot referred to either the court itself or an argument that might be debated by one. By the 16th century, the legal role of judicial moots had diminished, and the only remnant of them were moot courts, academic mock courts in which law students could try hypothetical cases for practice. Back then, moot was used as a synonym of debatable, but because the cases students tried in moot courts were simply academic exercises, the word gained the additional sense "deprived of practical significance." Some commentators still frown on using moot to mean "purely academic," but most editors now accept both senses as standard."
Countless Americans are happily 'stealing' intellectual property everyday from other Americans by accessing two websites — SciHub and LibGen — who owe their very existence to them being hosted in foreign countries with weak intellectual property protection and not being subject to US long-arm jurisdiction. Even on this website, using sites like archive.is (which would be illegal if they operated in the US) to bypass paywalls to access copyrighted material is common and rarely frowned upon. I doubt a culture of respecting copyright is as characteristic of "us" as you seem to think.
I see a complete economic collapse unless creators start getting paid both for their data upfront, and paid royalties when their data is used in an LLM response
While I didn't say anything about copyright (obviously our current copyright laws are completely ill-equipped to handle how LLMs work), feel free to replace data with whatever you like. writing, art, music, etc. It's all the same.
You've missed the point he was making -- that Chinese and Russian companies don't care about American copyright and will do whatever is in their interest.
And although you were being flippant, yes, Chinese LLMs are bad actors.
In contrast to child labor laws, which are intended and written to protect vulnerable people from exploitation, current copyright laws are tailored to the interests of Disney et al.
If they were watered down, I wouldn't see any moral or ethical loss in that.
Copyright law is far from perfect, but the concept is not morally bankrupt. It is certainly abused by large entities but it also, in principle, protects small content creators from exploitation as well. In addition to journalists, writers, musicians, and proprietary software vendors, this also includes things like copyleft software being used in unintended ways. When I write copyleft software, it is my intention that it is not used in proprietary software, even if laundered through some linear algebra.
I'm also far more amenable to dismissing copyright laws when there is no profit involved on the part of the violator. Copying a song from a friend's computer is whatever, but selling that song to others certainly feels a lot more wrong. It's not just that OpenAI is violating copyright, they are also making money off of it.
With the exception of source code availability, copyleft is mostly about using copyright to destroy itself. Without copyright (which I feel is unethical), and with additional laws to enforce open sourcing all binaries, copyleft need not exist.
So it is not good when people use copyleft as a justification for copyright, given that its whole purpose was to destroy it.
Source code availability (and the ability to modify the code on a device) is the most important part, IMO , regardless of RMS's original intention. Do you feel that it's ethical that OpenAI is keeping their model closed?
No, because I think such restrictions are unethical in the first place. However, in regards to training, I think it might be a necessary evil to allow companies to ignore copyleft, so smaller entities can ignore copyright to train open models.
On the other hand, you could also argue that if AI takes all financial incentives from professionals to produce original works, then the AI will lose out on quality material to train on and become worse. Unless your argument is there’s no need for anything else created by humanity, everything worth reading has already been written, and humanity has peaked and everyone should stop?
Like all things, it’s about finding a balance. American, or any other, AI isn’t free from the global system which exists around us— capitalism.
The whole "AI training blackhole" thing is a myth. As long as humans are curating the content generated by ML, the content generated is still valid training data. Remember, for every ML generated image you see online, someone had to go through countless attempts to get it to create exactly what they wanted.
Anecdotal but I know lots of creatives (and by creatives I also include some devs) who've stopped publishing anything publicly because of various AI companies just stealing everything they can get their hands on.
They don't mind sharing their work for free to individuals or hell, to a large group of individuals and even companies, but AIs really take it to a whole different level in their eyes.
Whether this is a trend that will accelerate or even make a dent in the grand scheme of things, who knows, but at least in my circle of friends a lot of people are against AI companies (which is basically == M$) being able to get away with their shenanigans.
Working enough that people and companies there exist, live, and are to some degree successful, yes. I've visited multiple times in the past few years and I found it to be pretty normal
Someone probably did a join on a has-many association without narrowing down the right side of the join is my guess - creating a Cartesian product of the result. Can be easy to miss in code review if you aren’t super diligent about it.