Hacker News new | past | comments | ask | show | jobs | submit | dchichkov's comments login

>> In the proposal, OpenAI also said the U.S. needs “a copyright strategy that promotes the freedom to learn” and on “preserving American AI models’ ability to learn from copyrighted material.”

Perhaps also symmetric "freedom to learn" from OpenAI models, with some provisions / naming convention? U.S. labs are limited in this way, while labs in China are not.


It still warps my brain, they’ve taken trillions of dollars of industry and made a product worth billions by stealing it. IP is practically the basis of the economy, and these models warp and obfuscate ownership of everything, like a giant reset button on who can hold knowledge. It wouldn’t be legal, or allowed if tech wasn't seen as the growth path of our economy. It’s a hell of a needle to thread and it’s unlikely that anyone will ever again be able to model from data so open.


"IP" is a very new concept in our culture and completely absent in other cultures. It was invented to prevent verbatim reprints of books, but even so, the publishing industry existed for hundreds of years before then. It's been expanded greatly in the past 50 years.

Acting like copyright is some natural law of the universe that LLMs are upending simply because they can learn from written texts is silly.

If you want to argue that it should be radically expanded to the point that not only a work, but even the ideas and knowledge contained in that work should be censored and restricted, fine. But at least have the honesty to admit that this is a radical new expansion for a body of law that has already been radically expanded relatively recently.


> It was invented to prevent verbatim reprints of books

It was also invented to keep the publishing houses under control and keep them from papering the land in anti-crown propaganda (like the stuff that fueled the civil war in England and got Charles I beheaded).

Probably one of the biggest brewing fights will be whether the models are free to tell the truth or whether they'll be mouthpieces for the ruling class. As long as they play ball with the powers that be, I predict copyrights won't be a problem at all for the chosen winners.


That's why I am a big proponent of local, open-weights computation. They can't shut down a non-compliant model if you're the one running it yourself.


I agree this would be a positive direction, but something that gives me pause is the forced upgrades and hardware cycle of both mac and windows now. They both scan files in your system constantly for various reasons, so for this purpose you're really stuck on *nix variants, right?


That's what I do. I'm really sick of the OS no longer being mine.


"mouthpieces for the ruling class"

That's actually a great point. Judging from the current state of media, there is a clear momentum to take sides in moral arguments. Maybe the standard for models need to be a fair use clause?


> It's been expanded greatly in the past 50 years.

Elephant in the room. If copyright and patent both expired after 20 years or so then I might feel very differently about the system, and by extension about machine learning practices.

It's absurd to me that broad cultural artifacts which we share with our parent's (or even grandparent's) generation can be legally owned.


What AI companies are doing (downloading pirated music and training models) is completely unfair. It takes lot of money (everything related to music is expensive), talent and work to record a good song and what AI companies do is just grab millions of songs for free and call it "fair use". If their developers are so smart and talented why don't they simply compose and record the music by themselves?

> not only a work, but even the ideas and knowledge contained in that work

AI models reproduce existing audio tracks when asked, although in a distorted and low-quality form.

Also it will be funny to observe how US government will try to ignore violating copyright for AI while issuing ridiculous fines for torrenting a movie by ordinary citizens.


Everything in tech is unfair. Music teachers replaced by apps and videos. Audio engineers replaced by apps. Albums manufacturing and music stores replaced by digital downloads. Custom instruments replaced by digital soundboards. Trained vocalists replaced by auto-tune. AI is just the final blip of squeezing humans out of music.


Not just music, models are trained on all types of art forms that have been created by humans across every medium and businesses are now choosing to use content from AI rather than pay an artist.

Breakout success can still be achieved from humans who create brand new art styles that can't yet be replicated by an AI. These artists will reap the rewards until all of these works are added to the subsequent AI training models.


> AI models reproduce existing audio tracks when asked, although in a distorted and low-quality form.

So can my wife. Who should I call to have her taken away?


The RIAA.


> What AI companies are doing (downloading pirated music and training models) is completely unfair.

We work in an industry built on leveraging unfairness. Expecting otherwise on this forum is very odd.


>We work in an industry built on leveraging unfairness. Expecting otherwise on this forum is very odd.

Yet this forum is very quick to criticize other people and other industries for unfairness.


Is it? From my perspective it seems like the folks here mostly are part of the problem, even if there is diversity of opinions.


The problem here is it's still illegal for me to do a backup copy of the stuff i bought, but they can do whatever they want.


“The Venetian Patent Statute of 19 March 1474, established by the Republic of Venice, is usually considered to be the earliest codified patent system in the world.[11][12] It states that patents might be granted for "any new and ingenious device, not previously made", provided it was useful. By and large, these principles still remain the basic principles of current patent laws.“

What are you talking about.


Patents and copyright are very different beasts.


The discussion was about IP though, which includes both of those.


As another commenter says, this is about IP, but even positing that copyright is somehow invalid because it’s new is incredibly obtuse. You know what other law is relatively new? Women’s suffrage.

I’m annoyed by arguments like the above because they’re clearly derived from working backwards from a desired conclusion; in this case, that someone’s original work can be consumed and repurposed to create profit by someone else. Our laws and society have determined this to be illegal; the fact that it would be con isn’t for OpenAI if it weren’t has no bearing.


Also, a quick glance at the wikipedia page for "copyright" talks about the first law being put down and enforced in 1710. What are we even doing here?


Your argument that IP and copyright do not exist now because they did not exist in the past is bogus.

IP and copyright exist.


You are missing GP's point and misunderstanding what generative models are actually doing.

The late OpenAI researcher and whistleblower, Suchir Balaji, wrote an excellent article regarding this topic:

https://suchir.net/fair_use.html


Is it the same thing though? Even though Lord Of The Rings, the book, likely has been used to train the models you can't reproduce it. Nor can you make a derivative of it. Is it really the same comparison like "Simba the white lion" and "the lion king"?

https://abounaja.com/blog/intellectual-property-disputes


[flagged]


what if someone else takes your stuff and puts it on the internet unrestricted?

https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...


Gearing up for a fight between the two major industries based on exploitative business models:

Copyright cartels (RIAA, MPAA) that monetized young artists without paying them much at all [1], vs the AI megalomaniacs who took all the work for free and used Kenyans at $2 an hour [2] so that they can raise "$7 trillion" for their AI infrastructure

[1] https://www.reddit.com/r/LetsTalkMusic/comments/1fzyr0u/arti...

[2] https://time.com/6247678/openai-chatgpt-kenya-workers/


Can't believe I'm actually rooting for the copyright cartels in this fight.

But that does make me think, that in a sane society with a functional legislature I wouldn't have to pick a dog in this fight. I'd have have enough faith in lawmakers and the political process to pursue a path towards copyright reform that reigns in abuses from both AI companies and megacorp rightsholders

Alas, for now I'm hoping that aforementioned megacorps sue OpenAI into a painful lesson.


> Can't believe I'm actually rooting for the copyright cartels in this fight.

The same megacorps are suing Internet Archive for their collection of 78rpm records. These guys would rather see art orphaned and die.


Yup, we live in a pretty depressing world.

More generally the best we can hope for us to discourage concentrated power, both in government and corporate forms.


They're suing Internet Archive because IA scanned a bunch of copyrighted books to put online for free (e: without even attempting to get permission to do so) then refused to take them down when they got a C&D lol. IA is putting the whole project at risk so they can do literal copyright infringement with no consequences.


During covid, when everyone was told to stay at home and not do anything, the library offered library books.

And what they actually did is violate the requirement to have a physical copy of the book they were lending.

As I understand it, they did not offer anything new that wasn't available to loan prior.

I could be wrong. But if I'm not, I see no reason to lambast IA.


It's not lambasting to communicate what happened. IA got a C&D, refused to comply, and got sued for copyright infringement. The courts sided with the publishers when IA tried to claim it was fair use (technologists seem to have a pattern of stretching the definition of fair use). They've put their entire project at risk because they've repeatedly ignored the law here. That's just what happened.



I should have "freedom to learn" about any Tesla in the showroom, any F-35 I see laying around an airbase or the contents of anyone in the governments bank account.


According to this scheme, if you find a bug and can read the bank's data, then you can use it as you want.


Nope, have to feed it into an llm first, afterwards it's legitimate.


No need for a LLM. Humans always have their own neural networks in their heads. :)


Can this extend to every kid sued by the record industry for downloading a few songs.

Have we all been transported to bizzaro land?

Different rules for billion dollar corps I guess.


Those cases did very poorly whenever they actually went to court (well at least also including the ones that were summarily dismissed by the courts, meaning they didn't technically make it to court). They were much more of a mafia style shakedown than an actual legal enforcement effort.

Same rules, but people are a lot less inclined to defend themselves because the cost of loss was seen as too high to even risk it.


Chinese AI must implement socialist values by law, but law is a much more fluid fuzzy thing in China than in the USA (although the USA seems to be moving away from rule of law recently).


> Chinese AI must implement socialist values by law

I don't doubt it but am interested to read a source? I know the models can't talk about things like Tiananmen Square 1989, but what does 'implementing socialist values by law' look like?


https://www.cnbc.com/2024/07/18/chinese-regulators-begin-tes...

"Socialist values" is literally the language that China used in announcing this.

Here is a recent article from a Chinese source:

https://www.globaltimes.cn/page/202503/1329537.shtml

Although censorship isn't mentioned specifically, it is definitely 99% of what they are focused on (the other 1% being scams).

China practices Rule by law, not Rule of law, so you know...they'll know its bad when they see it, so model providers will exercise extreme self censorship (which is already true for social network providers).


> China practices Rule by law, not Rule of law

In practice the US is less different than you imply. For the vast majority of Americans, being sued is a punishment in and of itself due to the prohibitive costs of hiring a lawyer. In the US we have a right to a “speedy” trial but there are many people sitting in jail now because they can’t afford the bail get out. Speedy could mean months.

I say this because when we constantly fall so far short of our ideals, one begins to question if those are really our ideals.


No one has pure rule of law, but at least the USA has it as a goal. The Chinese government has stated explicitly that rule of law isn’t a goal, so it leads to a very different legal system from ours. You have to think much more deeply about the spirit of the law and the flippant intentions of the official class that has all the power (the judicial system isn’t allowed to check official power, or even interpret ambiguous or competing laws).


> The Chinese government has stated explicitly that rule of law isn’t a goal

Can you share where you saw this? I am also not aware of anywhere that the US has stated that rule of law is a goal. What you are referring to is more of a norm or tradition. And norms can and do change over time for better or worse.

You could argue that rule of law follows from the preamble to the constitution but that doesn’t explicitly mention rule of law either. It mentions various values like justice and tranquility.


All you have to do is read the Chinese constitution to figure it out. Freedom of speech, religion, press, are all there but aren’t meaningful rights since there is no enforcement of those rights. For the rest, here is a document that explains the concept in more detail. https://www.swp-berlin.org/10.18449/2021C28/

> The aim is to use the law as a political instrument to make the state more efficient and to reduce the arbitrariness of how the law is applied for the majority of the popu­lation, among other things, with the help of advanced technology. In some areas, for example on procedural issues, Beijing continues to draw inspiration from the West in establishing its Chinese “rule of law”. However, the party-state leadership rejects an independent judiciary and the principle of separation of powers as “erroneous west­ern thought”. Beijing is explicitly interested in propagating China’s conception of law and legal practice internationally, establishing new legal standards and enforcing its interests through the law. Berlin and Brussels should, therefore, pay special attention to the Chinese leadership’s concept of the law. In-depth knowledge on this topic will be imperative in order to grasp the strategic implications of China’s legal policy, to better understand the logic of their actions and respond appropriately.

This is mostly transcribed from those meetings (vs a westerner interpretation). You really need to understand this to get how the legal systems are different, and how party officials are basically given supreme power (only checked by their bosses).


Socialism and freedom of speech aren't mutually exclusive


So? US AI must implement US rules by law. AI models are heavily censored and tend to favor certain political viewpoints.


Which political viewpoints do you think that AI models currently favor?


Empiric research as been done that shows current AI models to be left-leaning.

Here is some (non-empiric) displayed data: https://trackingai.org/political-test

Here is some research on that matter: https://arxiv.org/abs/2502.08640 Here is more: https://www.sciencedirect.com/science/article/pii/S016726812...


I like how this "freedom to learn" should apply to models, but not real people..


It already applies to real people, doesn't it? I.e. if you read a book, you're not allowed to start printing and selling copies of that book without permission of the copyright owner, but if you learn something from that book you can use that knowledge, just like a model could.


Can I download a book without paying for it, and print copies of it? Stash copies in my bathroom, the gym, my office, my bedroom etc. to basically have a copy on hand to study from whenever I have some free time?

What about movies and music?


Yes, you're allowed to make personal copies of copyright works that you own. IANAL, but my understanding is that if you're using them for yourself, and you're not prevented from doing so by some sort of EULA or DRM, there's nothing in copyright law preventing you from e.g. photocopying a book and keeping a copy at home, as long as you don't distribute it. The test case here has always been CDs—you're allowed to make copies of CDs you legally own and keep one at home and one in your car.


> Yes, you're allowed to make personal copies of copyright works that you own.

That’s not the point. It’s about books you don’t own. Are you allowed to download books from Z-Library, Sci-Hub etc. because you want to learn?


To the best of my knowledge, no individual has ever been sued or prosecuted specifically for downloading books. As long as you're not massively sharing them with others, it's not an issue in practice. Enjoy your reading and learning.


Aaron Swartz, cofounder of Reddit and inventor of RSS and Markdown, was hounded to death by an overzealous prosecutor for downloading articles from JSTOR, with the intent to learn from them. He was charged with over a million dollars in fines and could have faced 35 years in prison.

He and Sam Altman were in the same YC class. OpenAI is doing the same thing at a larger scale, and their technology actually reproduces and distributes copyrighted material. It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.

https://flaminghydra.com/sam-altman-and-aaron-swartz-saw-the... https://en.wikipedia.org/wiki/Aaron_Swartz


I'm familiar with Aaron Swartz's case, and that is actually why I phrased it as "books". In any case, while tragic, Swartz wasn't prosecuted for copyright infringement, but rather for wire fraud and computer fraud due to the manner in which he bypassed protections in MIT's network and the JSTOR API. This wouldn't have been an issue if he downloaded the articles from a source that freely shared them, like sci-hub.


It would be incredibly naive to assume that the scraping done for these models did not at any point circumvent protections.

The fundamental contention is that both accessed, saved and distributed material that they didn't have a "right" to access, save, and distribute. One was made a billionaire for it and another was driven to suicide. It's not tragic, it's societal malpractice.


Will what OpenAI & others serve as precedent for Alexandra Elbakyan of SciHub and avenge Aaron?

Cynically, I imagine it will not but I hope that it could.


You could argue that they are avenging him in doing exactly what he did, or worse, and not being punished for it. They are establishing precedent.


I'm responding specifically to this sentence:

> It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.

Scraping the Internet is generally very different from piracy. You are given a limited right to that data when you access it, and you can make local copies. if further use does something sufficiently non-copying, then creator rights aren't being infringed.


Can you compress the internet including copyrighted material and then sell access to it?

At what percentage of lossy compression it becomes infringement?


> Can you compress the internet including copyrighted material and then sell access to it?

Define access?

If you mean sending out the compressed copy, generally no. For things people normally call compression.

If you want to run a search engine, then you should be fine.

> At what percentage of lossy compression it becomes infringement?

It would have to be very very lossy.

But some AI stuff is. For example there are image models with fewer parameters than source images. Those are, by and large, not able to store enough data to infringe with. (Copying can creep in with images that have multiple versions, but that's a small sliver of the data.)


Commercial audio generation models were caught reproducing parts of copyrighted music in a distorted and low-quality form. This is not "learning", just "imitating".

Also, as I understand they didn't even buy the CDs with music for training; they got it somewhere else. Why do organizations that prosecute people for downloading a movie do not want to look if it is ok to make a business on illegal copies of copyrighted works?


I said "some" for a reason.


When you identify where the infringing party has stored the source material in their artifact.{zip,pdf,safetensor,connectome,etc}. In ML, this discovery stage is called "mechanistic interpretability", and in humans it's called "illegal."


It's not that clear cut. Since they're talking about taking lossy compression to the limit, there are ways to go so lossy that you're not longer infringing even if you can point exactly at where it's stored.

Like cliff's notes.


It was overzealous prosecution of the breaking into a closet to wire up some ethernet cables to gain access to the materials

Not the downloading with intent

And apparently the most controversial take on this community is the observation that many people would have done the trial, plea and time, regardless of how overzealous the prosecution was


> breaking into a closet

"The closet's door was kept unlocked, according to press reports"

When's the last time a kid with no record, a research fellow at Harvard, got threatened with 35 years for a simple B&E?


They threaten

Its the plea or sentencing where that stuff gets taken into account for a reduction to community service


I'm glad you still have that much faith in the system. That's much more faith than I have in the system (and more faith than I had in the system back then, too).


Wasn’t John Gruber the inventor of Markdown?


> for downloading articles from JSTOR, with the intent to learn from them

For context, according to sources, he downloaded 4.8 million articles.


Maybe he was about to train an LLM on them /s


35 years is a press release sentence. The way DOJ calculates sentences when they write press releases ignores the alleged facts of the particular case and just uses for each charge the theoretically maximum possible sentence that someone could get for that charge.

To actually get that maximum typically requires things like the person is a repeat offender, drug dealing was involved, people were physically harmed, it involved organized crime, it involved terrorism, a large amount of money was involved, or other things that make it an unusual big and serious crime.

The DOJ knows exactly what they are alleging the defendant did. They could easily looks at the various factors that affect sentencing for the charge and see which apply to that case and come up with a realistic number but that doesn't make it sound as impressive in the press release.

Another thing that inflates the numbers in the press releases is that defendants are often charged with several related charges. For many crimes there are groups of related charges that for sentencing get merged. If you are charged with say 3 charges from the same group and convicted on all you are only sentenced for whichever one of them has the longest sentence.

If you've got 3 charges from such a group in the press release the DOJ might just take the completely bogus maximum for each as described above and just add those 3 together.

Here's a good article on DOJ's ridiculous sentence numbers [1].

Here's a couple of articles from an expert in this area of law that looks specifically at what Swartz was charged with and what kind of sentence he was actually looking at [2][3].

Why do you think Swartz was downloading the articles to learn from them? As far as I've seen know one knows for sure what he was intending.

If he wanted to learn from JSTOR articles he could have downloaded them using the JSTOR account he had through his research fellowship at Harvard. Why go to MIT and use their public JSTOR WiFi access, and then when that was cut off hide a computer in a wiring closet hooked into their ethernet?

I've seen claims that he wanted to do was meta research about scientific publishing as a whole which could explain why he needed to download more than he could download with his normal JSTOR account from Harvard, but again why do that using MIT's public WiFi access? JSTOR has granted more direct access to large amounts of data for such research. Did he talk to them first to try to get access that way?

[1] https://web.archive.org/web/20230107080107/https://www.popeh...

[2] https://volokh.com/2013/01/14/aaron-swartz-charges/

[3] https://volokh.com/2013/01/16/the-criminal-charges-against-a...


He might have wanted other people to have access to the knowledge, and for free. In comparison, AI companies want to sell access to the knowledge they got by scraping copyrighted works.


Wow, just wow.


Truly wow. The sucking up to coroporations is terrifying. This, when Aaron Swartz was institutionally murdered by the institutions and the state for "copyright infringement". And what he did wasn't even for profit, or even a 0.00001 of the scale of the theft that OpenAI and their ilk have done.

So it's totally OK to rip off and steal and lie through your teeth AND do it all for money, if you're a company. But if you're a human being, doing it not for profit but for the betterment of your own fellow humans, you deserve to be imprisoned and systematically murdered and driven to suicide.


Thank you for putting my sentiment into words. THIS. It's not power to the people, it's power to the oligarchs. Once you have enough power and, more importantly, wealth, you're welcomed into the fold with open arms. Just how Spotify build a library of stolen music, as long as wealth was created, there is no problem because wealth is just money taken from the people and given to the ruling class.


CDs, software, and electronic media, yes. Physical books, no. You can't make archival copies.


sure you can, you could take a physical book, and painstakingly copy each page at a time, that is totally fair use.


Leaving aside the broader discussion...

You cannot legally photocopy copy an entire book even if you own a physical copy.

Internet people say you can, but there's no actual legal argument or case law to support that.


> Internet people say you can, but there's no actual legal argument or case law to support that.

Quite the opposite. The burden of proof is on you to show a single person ever, in history, who has been prosecuted for that.

If nobody in the world has ever been prosecuted for this, then that means it is either legal, or it is something else that is so effectively equivalent to "legal" that there is little point in using a different word.

If you want to take the position that, "uhhhhhhh, there is exactly 0% chance of anyone ever getting in trouble or being prosecuted for this, but I still don't think its legal, technically!"

Then I guess go ahead. But for those in the real world, those two things are almost equivalent.


> If you want to take the position that, "uhhhhhhh, there is exactly 0% chance of anyone ever getting in trouble or being prosecuted for this, but I still don't think its legal, technically!"

> Then I guess go ahead.

That is exactly what I am saying.


Gotcha, so then you agree that there is exactly zero cases or evidence of anyone ever being punished for this, which is the most important part.

If you do this, you are not going to be held legally liable for anything.


At home? Without ever sharing it with anyone? I thought making backups of things that you personally own was protected, at least in the US. Could you elaborate on my apparent misunderstanding?


> Could you elaborate on my apparent misunderstanding?

One of the six exclusive rights of copyright holders is "to reproduce the copyrighted work in copies or phonorecords."

(In certain circumstances, the Fair Use doctrine contravenes this right, but reproduction in whole is not such a circumstance.)


I believe the post you are replying to is suggesting the copy is made by hand, one word at a time.


I don't see how that would be different, as the meaningful material is text not images.


Citation needed.


This is a specific exception in Australia Copyright law. It allows reproducing works in books, newspapers and periodical publications in different form for private and domestic use.

(Copyright Act 1968 Part III div. 1, section 43C) https://www.legislation.gov.au/C1968A00063/latest/text


It seems reasonably within the bounds described by fair use, but nobody's ever tested that particular constellation of factors in a lawsuit, so there's no precedent - hand copying a book, that is.

17 U.S.C. § 107 is the fair use carveout.

Interestingly, digitizing and copying a book on your own, for your own private use, has also not been brought to court. Major rights holders seem to not want this particular fair use precedent to be established, which it likely would be, and might then invalidate crucial standing for other cases in which certain interpretations of fair use are preferred.

Digitally copying media you own is fair use. I'll die on that hill. It doesn't grant commercial rights, you can't resell a copy as if it were the original, and so on, and so forth.

There's even a good case to be made that sharing a digitally copied work purchased legally, even to millions of people, 5 years after a book is first sold - for a vast majority of books, after 5 years, they've sold about 99.99% of the copies they're going to sell.

By sharing after the ~5 year mark, you're arguably doing marketing for the book, and if we cultivated a culture of direct donation to authors and content creators, it invalidates any of the reasons piracy is made illegal in the first place.

Right now publishers, studios, and platforms have a stranglehold on content markets, and the law serves them almost exclusively. It is exceedingly rare for the law to be invoked in defending or supporting an author or artist directly. It's very common for groups of wealthy lawyers LARPing as protectors of authors and artists to exploit the law and steal money from regular people.

Exclusively digital content should have a 3 year protected period, while physical works should get 5, whether it's text, audio, image, or video.

Once something is outside the protected period, it should be considered fair game for sharing until 20 years have passed, at which point it should enter public domain.

Copyright law serves two purposes - protecting and incentivizing content creators, and serving the interests of the public. Situations where a bunch of lawyers get rich by suing the pants off of regular people over technicalities is a despicable outcome.


> there's no precedent - hand copying a book, that is

Thank you! I had looked this up myself last week, so I knew this. I had long believed, as GP does, that copying anything you own without distribution is either allowed or fair use. I wanted GP to learn as I did.


For reference, here's the US legal code in question:

Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include— (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work. The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.

The spirit seems apparent, but in practice it's been used by awful people to destroy lives and exploit rent from artists and authors in damn near tyrannical ways.


Except you said "You can't make archival copies." and didn't provide a citation. That's quite a different claim than "there exists no precedent clearly establishing your right or lack thereof to make archival copies".


Congress expressly granted archival rights for digital media. If they wanted to do the same for books they could've done so. There's no law or legal precedent allowing it.

Given all this "can't do it" is more probably accurate than "can do it". IANAL but it's not like the question is finely balanced on a knife's edge and could go either way.


Congress didn't explicitly disallow it either. You left that bit out. As such it comes down to interpretation of the existing law. We both clearly agree that doesn't (yet) exist.

> IANAL but it's not like the question is finely balanced on a knife's edge and could go either way.

I agree, but my interpretation is opposite yours. It seems fairly obvious to me that the spirit of the law permits personal copies. That also seems to be in line with (explicitly legislated) digital practices.

But at the end of the day the only clearly correct statement on the matter is "there's no precedent, so we don't know". I suppose it's also generally good advice to avoid the legal quagmire if possible. Being in the right is unlikely to do you any good if it bankrupts you in the process.


> Congress didn't explicitly disallow it either.

That's the whole point of copyright: only the owner of a copyright has the right to make copies. I don't see how it can be more explicit than that. It's a default-deny policy.

There is an archival exception for digital media, so obviously Congress is open to granting exceptions for backup purposes. They chose not to include physical media in this exception.


> only the owner of a copyright has the right to make copies.

You are conveniently omitting the provisions about fair use, which is strange since you're clearly aware of them. The only things copyright is reasonably unambiguous about are sale and distribution. Even then there's lots of grey areas such as performance rights.

You are arguing that something is obviously disallowed but have nothing but your own interpretation to back that up. If the situation was as clear cut as you're trying to make out then where is the precedent showing that personal use archival copies of physical goods are not permitted?

> They chose not to include physical media in this exception.

That's irrelevant to the current discussion, though I'm fairly certain you realize that. Congress declined to weigh in on the matter which (as you clearly know) leaves it up to the courts to interpret the existing law.


> You are conveniently omitting the provisions about fair use, which is strange since you're clearly aware of them

Fair use didn't come up but I did mention it here: https://news.ycombinator.com/item?id=43356289. And there's no need for that tone. I'm not a copyright defender.

> That's irrelevant to the current discussion, though I'm fairly certain you realize that.

I said it because it was relevant.

> where is the precedent showing that personal use archival copies of physical goods are not permitted

> Congress declined to weigh in on the matter

There was no "matter" to "weigh in on". The answer to "Can you make a full, complete copy of a copyrighted work without authorization?" has been "Almost always no" from the beginning of copyright. Even the term "fair use" arose in a US legal precedent over a century after the first copyright laws in England. It became an actual part of US copyright law in the 1970s, less than 50 years ago.

"Fair use" is plausible for a library or archive to make full copies, and keep them safe and archived, since that's their job.

Fair use isn't why we have archival rights for electronic media. That right was written into the law after electronic media became a thing.

In my comment above I gave one example why "fair use" wouldn't work for archival copies of physical media made by individuals. An actual lawyer who's paid by the copyright mafia to care about this stuff can surely find more and stronger reasons.

FWIW someone in another comment pointed out Australian copyright law allows making a copy of books, newspapers, and periodicals for personal, domestic use. Which means: a) it can be done and b) even they had to spell it out specifically

> which (as you clearly know) leaves it up to the courts to interpret the existing law.

I don't agree but believe what you like.


I take the contrary view.

What part of fair use pertains to making a physical copy of the complete work?


You can make copies of things. You just can’t distribute them


You're repeating upthread comments. And no, you can't. There's an archival exception for electronic media. If you want to make copies of physical media you either:

1. Can't

Or

2. Rely on fair use to protect you (archival by individuals isn't necessarily fair use)


It absolutely is fair use to copy a book for your personal archives.

The fair use criteria considers whether it is commercial in nature (in this case it is not) and the “ the effect of the use upon the potential market for or value of the copyrighted work” for which a personal copy of a personally owned book is non existent.

https://www.law.cornell.edu/uscode/text/17/107

You would get laughed at by the legal system trying to prosecute an individual owner for copying a book they bought just to keep.


> It absolutely is fair use to copy a book for your personal archives.

There's no legal precedent for this. See https://news.ycombinator.com/item?id=43356042

> the effect of the use upon the potential market for or value of the copyrighted work

A copyright holder's lawyer would argue that having and using a photocopy of a book keeps the original from wearing out. This directly affects the potential market for the work, since the owner could resell the book in mint condition, after reading and burning their photocopies.

> You would get laughed at by the legal system trying to prosecute an individual owner for copying a book they bought just to keep.

I mean maybe this is true. But the affected individual will have a very bad year and spend a ton of money on lawyers.


>No legal precedent

Why do you interpret this to mean "absolutely can't do this"? "No precedent" seems to equally support both sides of the argument (that is, it provides no evidence; courts have not ruled). The other commenters arguments on the actual text of the statute seem more convincing to me than what you have so far provided.


I was responding to https://news.ycombinator.com/item?id=43356240 which said it "absolutely is fair use".

> The other commenters arguments...seem more convincing

Because you (and I) want it to be fair use. But as I already said in my comment, it potentially fails one leg of fair use. Keeping your purchased physical copy of the book pristine and untouched while you read the photocopy allows you to later, after destroying the copies you made, resell the book as new or like-new. This directly affects the market for that book.

Do you want to spend time and money in court to find out if it's really fair use? That's what "no precedent" means.


Multiple times in this thread you make the very confident assertion that this is not allowed, and that it is only allowed for electronic media. That is your opinion, which is fine. The argument that it is fair use is also an opinion. Until it becomes settled law with precedent, every argument about it will just opinion on what the text of the law means. But you are denigrating the other opinions while upholding your own as truth.

And whether or not I am personally interested in testing any of these opinions is completely beside the point.


Copyright is a restriction on making unauthorized, full copies under almost all circumstances. Default deny. There's only one documented exception on the books right now which is electronic media. None of these are opinions.

The idea that photocopying a book for archival purposes is potentially fair use is an untested opinion. I'm not denigrating that opinion. I just think it's likely to fail as an legal argument in the unlikely event that it comes up. I'm not a copyright apologist.

I myself believed the "fair use for archival"/"format shifting" thing applied to all works for most of my life. I only learned there was no law or precedent like 10 days ago.


> Do you want to spend time and money in court to find out if it's really fair use?

No. I'd much rather pirate the epub followed by lobbying for severe IP law reform. (Of course by "lobby" I actually mean complain about IP law online.)

If there's no epub available then I guess it's time to get building. (https://linearbookscanner.org/)


You’re now arguing the assumption without a precedent you don’t have the right to do something. That’s not how the law works. If you believe that the courts would laugh at a publisher trying to bring suit against you for doing this, then you believe you have the right to do it.

Such a case would not require a year or a ton of money to defend. In fact, the potential damages would be so small that you could sensibly do it in small claims court.


Copyright law is explicitly outside the scope of small claims and consumer tribunal systems.


> You’re now arguing the assumption without a precedent you don’t have the right to do something. That’s not how the law works.

I mean copyright law has always been "You can't make full copies for any reason (almost)". And you were the one saying "it absolutely is fair use [to make full personal copies]", which is quite a strong statement to make in the absence of a precedence.

An archive could argue fair use to make full copies of physical works, because that's their role, and by keeping the copies locked away they don't harm the market for the works. These fair use factors don't apply to individuals. But IANAL and maybe that's wrong, who knows? I do know if it comes up the copyright mafia will fight it tooth and nail, and I'd put my money on them winning.

> the potential damages would be so small that you could sensibly do it in small claims court

The publisher would sue the infringer in small claims court? This seems very unlikely since the publisher would prefer to scare or bankrupt you into submission.

Or would the defendant have the lawsuit moved to small claims court? Are defendants allowed to do this?


You may copy, but you may not circumvent the copy protection.


Correct. For electronic media.


I'm moving goal-post here since it was not OpenAI (as far as we know): Where Meta training on torrented data fits into this?


That's not a one-to-one analogy. The LLM isn't giving you the book, its giving you information it learned from the book.

The analogous scenario is "Can I read a book and publish a blog post with all the information in that book, in my own words?", and under US copyright law, the answer is: Yes.


> The analogous scenario is "Can I read a book and publish a blog post with all the information in that book, in my own words?"

The analogous scenario is actually "Can I read a book that I obtained illegally and face no consequences for obtaining it illegally?" The answer is "Yes" there are no consequences for reading said book, for individuals or machines.

But individuals can face serious consequences for obtaining it illegally. And corporations are trying to argue those consequences shouldn't apply to them.


> But individuals can face serious consequences for obtaining it illegally.

Can they? Who has ever faced serious consequences for pirating books in the US?


https://en.wikipedia.org/wiki/Aaron_Swartz

(Please no pedantry about how scientific papers aren't books)


Not to diminish the atrocity of what happened to Aaron, but is this a highly abnormal case of prosecutor overzeal or is it common for people to be charged and held liable for downloading and/or consuming (without distribution) of copyrighted materials (in any form) without obtaining a license?

Asking because I genuinely don't know. I believe all I've ever read about persecution of "commonplace" copyright violations was either about distributors or tied to bidirectional nature of peer-to-peer exchange (torrents typically upload to others even as you download = redistribution).


Aaron Swartz downloaded a lot of stuff. Did he publish the stuff too? That would be an infringement. But only downloading the stuff? And never distributing it? Not sure if it’s worth a violation .


>Aaron Swartz downloaded a lot of stuff.

A tiny fraction compared to the 80+ terabytes Facebook downloaded.

>Did he publish the stuff too?

No.

> Not sure if it’s worth a violation .

Exactly.


There's no analogous because the scale of it takes it to a whole different level and degree, and for all intents and purposes we tend to care about level and degree.

Me taking over control of the lemonade market in my neighbourhood wouldn't ever be a problem to anyone, a very minor annoyance; instead if I managed to corner the lemonade market of a whole continent it'd be a very different thing.


The better analogy is "can my business use illegally downloaded works to save on buying a license". For example, can you use pirated copy of Windows in your company? Can you use pirated copy of a book to compute weights of a mathematical model?


> Can I download a book without paying for it, and print copies of it?

No, but you can read a book, learn its contents, and then write and publish your own book to teach the information to others. The operation of an AI is rather closer to that than it is to copyright violation.

"Should" there be protections against AI training? Maybe! But copyright law as it stands is woefully inadequate to the task, and IMHO a lot of people aren't really treating with this. We need a functioning government to write well-considered laws for the benefit of all here. We'll see what we get.


But I can't legally obtain the book to read and learn from without me (or a library) paying for it. Let's start there first.


Yes, but the learning isn't constrained by those laws. If I steal a book and read it, I'm guilty of the crime of theft. You can put me in jail, try me before a jury, fine me, and put me in prison according to whatever laws I broke.

Nothing in my sentence constrains my ability to teach someone else the stuff I learned, though! In fact, the first amendment makes it pretty damn clear that nothing can constrain that freedom.

Also, note that the example is malformed: in almost all these cases, Meta et. al. aren't "stealing" anything anyway. They're downloading and reading stuff on the internet that is available for free. If you or I can't be prosecuted for reading a preprint from arXiv.org or whatever, it's a very hard case to make that an AI can.

Again, copyright isn't the tool here. We need better laws.


Sure, but OpenAI (same as Google, and Facebook, and all the others) is illegally copying the book, and they want this to be legal for them.

It's perhaps arguable whether it's OK for an LLM to be trained on freely available but licensed works, such as the Linux source code. There you can get in arguments about learning vs machine processing, and whether the LLM is a derived work etc

But it's not arguable that copying a book that you have not even bought to store in your corporate data lake to later use for training is a blatant violation of basic copyright. It's exactly like borrowing a book from a library, photocopying it, and then putting it in your employee-only corporate library.


> copyright isn't the tool here

It's not the only tool. I agree that "use for ML" should be an additional right.

What people are pissed about is that copyright only ever serves to constrain the little guys.

> If I steal a book and read it, I'm guilty of the crime of theft

You or I would never dare to do this in the first place.


> Meta et. al. aren't "stealing" anything anyway

They were caught downloading the entirety of libgen.


One thing is downloading pirated copy and reading it for yourself and another thing is running a business based on downloading millions of pirated works.


If you buy it


No, even if I steal it. I can teach you anything I know. Congress shall make no law abridging the freedom of speech, as it were.


Yes, but this is not the right model. What OpenAI wants is to borrow a book, make a copy of it, and keep using that copy, in training their models. This is fully and simply illegal, under any basic copyright law.



Is the book online and accessible to your eyeballs through your open standards client tool, such that you can learn from seeing it?


Let's say Windows is downloadable from Microsoft website. Can you use it for free in your company to save on buying a license? Is it ok to use illegal copies of works in a business?


Most books aren't. Unless you pay for them.


To the extent that this is how libraries function, yes.

The part of that which doesn't apply is "print copies", at least not complete copies, but libraries often have photocopiers in them for fragments needed for research.

AI models shouldn't do that either, IMO. But unlimited complete copies is the mistake the Internet Archive made, too.


I missed the part where OpenAI got library cards for all the libraries in the world.

Is having a library card a requirement for being hired over there?


I missed the part where we throw away rational logic skills

Have you never been to a public library and read a book while sitting there without checking it out? Clearly, age is a factor here, and us olds are confused by this lack of understanding of how libraries function. I did my entire term paper without ever checking out books from the library. I just showed up with my stack of blank index cards, then left with the necessary info written on them. Did an entire project on tracking stocks by visiting the library and viewing all of the papers for the days in one sitting rather than being schmuck and tracking it daily. Took me about an hour in one day. No library card required.

Also, a library card is ridiculously cheap even if you did decide to have one.


> Have you never been to a public library and read a book while sitting there without checking it out?

See my comment here: https://news.ycombinator.com/item?id=43355723. If OpenAI built a robot that physically went into libraries, pulled books off shelves by itself, and read them...that's so cool I wouldn't even be mad.


What about checking out eBooks? If you had an app that checked those out and scanned it at robot speed vs human feed, that would be the same thing. The idea that reading something that does not belong to you directly means stealing is just weird and very strained.

theGoogs essentially did that by having the robot that turned each page and scanned the pages. that's no different than having the librarian pull material for you so that you don't have to pull the book from the shelf yourself.

There's better arguments to make on why ClosedAI is bad. Reading text it doesn't own isn't one of them. How they acquired the text would be a better thing to critique. There's laws for that in place now that does not require new laws to be enacted.


> If you had an app that checked those out and scanned it

You mean...made a copy? Do you really not see the problem?

> How they acquired the text would be a better thing to critique

Well...yeah that's what I said in the comment that started this discussion branch: https://news.ycombinator.com/item?id=43355147

This isn't about humans or robots reading books. It's that robots are allowed to violate copyright law to read the books, and us humans are not.


> You mean...made a copy? Do you really not see the problem?

In precisely the same way as a robot scanning a physical book is.

If this is turned into a PDF and distributed, it's exactly the legal problem Google had[0] and that Facebook is currently fighting due to torrenting some of their training material[1].

[0] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,...

[1] https://news.ycombinator.com/item?id=43125840

If the tokens go directly into training an AI and no copies are retained, that's like how you as a human learn — except current AI models are not even remotely as able to absorb that information as you, and they only make up for being as thick as a plank by being stupid very very quickly.

> It's that robots are allowed to violate copyright law to read the books, and us humans are not.

More that the copyright laws are not suited to what's going on. Under the copyright laws, statute and case law, that existed at the time GPT-3.5 was created, bots were understood as the kind of thing Google had and used to make web indexes — essentially legal, with some caveats about quoting too much verbatim from news articles.

(Google PageRank being a big pile of linear algebra and all, and the Transformer architecture from which ChatGPT get's the "T" being originally a Google effort to improve Google Translate).

Society is currently arguing amongst itself if this is still OK when the bot is a conversational entity, or perhaps even something that can be given agency.

You get to set those rules via your government representative, make it illegal for AI crawlers to read the internet like that — but it's hard to change the laws if you mistake what you want the law to be, with what the law currently is.


but you keep saying to read the books. there is no copyright violation to read a book. making copies starts to get into murky grounds, but does not immediately mean breaking the law.


You might be thinking of someone else.


If I spent every last second of my life in a public library, I couldn't even view a fraction of the information that OpenAI has ingested. The comparison is irrelevant. To make the comparison somehow valid, I'd have to back up my truck to a public library, steal the entire contents, then start selling copies out of my garage


Look, even I'm not a fan of ClosedAI, but this is ridiculous. ClosedAI isn't giving copies of anything. It is giving you a response it infers based on things it has "read" and/or "learned" by reading content. Does ClosedAI store a copy of the content it scrapes, or does it immediately start tokenizing it or whatever is involved in training? If they store it, that's a lot of data, and we should be able to prove that sites were scraped through lawsuit discovery process. Are you then also suggesting that ClosedAI will sell you copies of that raw data if you prompted correctly?

I'm in no way justifying anything about GPT/LLM training. I'm just calling out that these comparisons are extremely strained.


Let's say OpenAI developers use illegal copy of Windows on their laptops to save on buying a license. Is that ok to run a business this way?

Also I think it is different thing when someone uses copyrighted works for research and publishing a paper or when someone uses copyrighted works to earn money.


I don't need a card to read in the library, nor to use the photocopiers there, but it's merely one example anyway. (If it wasn't, you'd only need one library, any of the deposit libraries will do: https://en.wikipedia.org/wiki/Legal_deposit).

You also don't need permission, as a human, to read (and learn from) the internet in general. Machines by standard practice require such permission, hence robots.txt, and OpenAI's GPTBot complies with the robots.txt file and the company gives advice to web operators about how to disallow their bot.

How AI should be treated, more like a search index, or more like a mind that can learn by reading? Not my call. It's a new thing, and laws can be driven by economics or by moral outrage, and in this case those two driving forces are at odds.


We started with libraries and books, now you're moving the goalposts to websites.

Sidenote: I wouldn't even be mad if OpenAI built robots to go into all of the libraries and read all of the books. That would be amazing!


I started with libraries. OpenAI started with the internet.

The argument for both is identical, your objection is specific to libraries.

IIRC, Google already did your sidenote. Or started to, may have had legal issues.


> The argument for both is identical

How so? I don't have to pay to read most websites. To read most books I have to pay (or a library has to pay and I have to wait to get the book).

> IIRC, Google already did your sidenote

Not quite. They had to chop the spines off books and have humans feed them into scanners. I'm talking about a robot that can walk (or roll) into a library, use arms to take books off the shelves, turn the pages and read them without putting them into a scanner.


They had humans turn the pages of intact books in scanning machines. The books mostly came from the shelves of academic libraries and were returned to the shelves after scanning. You can see some incidental captures of hands/fingers in the scans on Google Books or HathiTrust (the academic home of the Google Books scans). There are some examples collected here:

https://theartofgooglebooks.tumblr.com/


> How so? I don't have to pay to read most websites. To read most books I have to pay (or a library has to pay and I have to wait to get the book).

"or" does a lot of work, even ignoring that I'd already linked you to a page about deposit libraries: https://en.wikipedia.org/wiki/Legal_deposit

Fact is, you can read books for free, just as you can read (many but not all) websites for free. And in both cases you're allowed to use what you learned without paying ongoing licensing fees for having learned anything from either, and even to make money from what you learn.

> Not quite. They had to chop the spines off books and have humans feed them into scanners.

Your statement is over 20 years out of date: https://patents.google.com/patent/US7508978B1/en


> Can I download a book without paying for it

Yes, you can read books without paying, if that's how it is offered.

And you can photocopy books you own for your own personal use. But again....the analogy is remembering/leaning from a book.


owning a copy and learning the information is not the same. you can learn 2+2=4 from a book, but you no longer need that book to get that answer. each year in school, I was issued a book for class, learned from it, returned the book. I did not return the learning.

musicians can read the sheet music and memorize how to play it, and no longer need the music. they still have the information.


But you still need to buy the sheet music first, all the AI Labs used pirated materials to learn from.

There's two angles to the lawsuits that are getting confused - the largest one from the book publishers (Sarah Silverman et al) attacked from the angle that the models could reproduce copyrighted information. This was pretty easily quelled / RHLF'd out (used to be that if ChatGPT started producing lyrics a supervisor/censor would just cut off it's response early - tried it now and ChatGPT.com is now more eloquent, "Sorry, I can't provide the full lyrics to "Strawberry Fields Forever" as they are copyrighted. However, I can summarize the song or discuss its themes, meaning, and history if you're interested!")

But there's also the angle of "why does OpenAI have Sarah Silverman's book on their hard drive if they never paid her for it? This is the lawsuit against Meta regarding books3 and torrenting, seems like they're getting away with the "we never redistributed/seeded!" but it's unclear to me why this is a defense against copyright infringement.


Not only would the musician have to buy the sheet music first, but if they were going to perform that piece for profit at an event or on an album they'd need a license of some sort.

This whole mess seems to be another case of "if I can dance around the law fast enough, big enough, and with enough grey areas then I can get away with it".


I was handed sheet music every year in band, and within a few weeks had it memorized. Books with music are also available in the library.


As a student in a school band that debated whether to choose Pirates of the Caribbean vs Phantom of the Opera for our half time show, I remember the cost of the rights to the music was a factor in our decision.

The school and library purchased the materials outright, again, OpenAI Meta et al never paid to read them, nor borrowed them from an institution that had any right to share.

I'm a bit of an anti intellectual property anarchist myself but it grinds my gears that, given that we do live under the law, it is applied unequally.


>Can I download a book without paying for it

if you have evidence that openAI is doing this with books that are not freely available, i'm sure the publishers would absolutely love to hear about it.


We know Meta has done it. These companies have torrented or downloaded books that they did not pay for. Things like the The Pile, libgen, anna's library were scraped to build these models.


>if you have evidence that openAI is doing this with books that are not freely available, i'm sure the publishers would absolutely love to hear about it.

Lol, so why are OpenAI challenging these laws?


Do you think OpenAI used fewer sources than Meta?


when it comes to real people, they get sued into oblivion for downloading copyrighted content, even for the purpose of learning. but when facebook & openai do it, at a much larger scale, suddenly the laws must be changed.



Swartz wasn’t “downloading copyrighted content…for the purpose of learning,” he was downloading with the intent to distribute. That doesn’t justify how he was treated. But it’s not analogous to the limited argument for LLMs that don’t regurgitate the copyrighted content.


It does apply to people? When you read a copy of a book, you can't be sued for making a copy of the book in the synapses of your brain.

Now, if you have eidetic memory and write out large chunks of the book from memory and publish them, that's what you could be sued for.


This is not about memory or training. The LLM training process is not being run on books streamed directly off the internet or from real-time footage of a book.

What these companies are doing is:

1. Obtain a free copy of a work in some way.

2. Store this copy in a format that's amenable to training.

3. Train their models on the stored copy, months or years after step 1 happened.

The illegal part happens in steps 1 and/or 2. Step 3 is perhaps debatable - maybe it's fair to argue that the model is learning in the same sense as a human reading a book, so the model is perhaps not illegally created.

But the training set that the company is storing is full of illegally obtained or at least illegally copied works.

What they're doing before the training step is exactly like building a library by going with a portable copier into bookshops and creating copies of every book in that bookshop.


But making copies for yourself, without distributing them, is different than making copies for others. Google is downloading copyrighted content from everywhere online, but they don't redistribute their scraped content.

Even web browsing implies making copies of copyrighted pages, we can't tell the copyright status of a page without loading it, at which point a copy has been made in memory.


Making copies of an original you don't own/didn't obtain legally is not fair use. Also, this type of personal copying doesn't apply to corporations making copies to be distributed among their employees (it might apply to a company making a copy for archival, though).


> But making copies for yourself, without distributing them,

If this was legal, nobody would be paying for software.


> When you read a copy of a book

They're not talking about reading a book FFS. You absolutely can be sued for illegally obtaining a copy of the book.


> when it comes to real people, they get sued into oblivion for downloading copyrighted content, even for the purpose of learning.

Really? Or do they get sued for sharing as in republishing without transformation? Arguably a URL providing copyrighted content, is you offering a xerox machine.

It seems most "sued into oblivion" are the reshare problem, not the get one for myself problem.


This is why I think my array of hard drives full of movies isn't piracy. My server just learned about those movies and can tell me about them, is all. Just like a person!


These AI models are just obviously new things. They aren’t people, so any analogy about learning from the training material and selling your new skills is off base.

On the other hand, they aren’t just a copy of the training content, and whether the process that creates the weights is sufficiently transformative as to create a new work is… what’s up for debate, right?

Anyway I wish people would stop making these analogies. There isn’t a law covering AI models yet. It is a big industry at this point, and the lack of clarity seems like something we’d expect everybody (legislators and industry) to want to rectify.


Model cannot "learn" because it is not a human. What happens is a human obtains "a free copy" of a copyrighted work, processes it using a machine and sells the result.


> Model cannot "learn" because it is not a human.

Sure, that’s why don’t like the analogy.

> What happens is a human obtains "a free copy" of a copyrighted work, processes it using a machine and sells the result.

Right, so for example it is pretty common to snip up small bits of songs and to use in other songs (sampling). Maybe that could be an example of somewhere to start? But, these ML models seem quite different, I guess because the “samples” are much smaller and usually not individually identifiable. And really the model encodes information about trends in the sources… I dunno. I still think we need a new law.


Totally agree. Except the current administration probably will interpret things the way they see fit ...


> just like a model could

Not really. You can't multiply yourself a million times to produce content at an industrial scale.


Can I pirate books to train myself?


If models can learn for free, then the models (training code, inference code, training data, weights) should also be free. No copyright for anybody.

And if you sell the outputs of your model that you trained on free content, you shouldn't be able to hide behind trade secret.


> just like a model could

It is not remotely the same, the companies training the models are stealing the content from the internet and then profiting from it when they charge for the use of those models.


> the companies training the models are stealing the content from the internet

Are you stealing a billboard when you see and remember it?

The notion that consuming the web is "stealing" needs to stop.


The question is whether it destroys the incentive to produce the work. That is the entire point of copyright and patent law.

LLMs do indeed significantly reduce the incentive to produce original work.


Are you stealing when using a pirated software to run a billion-dollar business?


We are not taking about billboards here, we are talking about copyrighted works, like books. If you want to do mental gymnastics and call "consuming" the web the act of downloading books without paying for them, then go ahead, but don't pretend the rest will buy your delusion.


On the contrary, even telling people which billboards are posted about what, and how to get to them to look at them, is "how it works".

But the courts will get to clarify (in today's news):

https://www.reuters.com/legal/news-corp-sued-by-brave-softwa...


The more literature I consume, and the more I re-draft my own attempt, the more I see the patterns and tropes with everyone standing on the shoulders of those who came before.

The general concept of "warp drive" was introduced by John W. Campbell in 1957, "Islands of Space". Popularised by Trek, turned into maths by Alcubierre. Islands of Space feels like it took inspiration from both H G Wells (needing to explain why the War of the Worlds' ending was implausible) and Jules Verne (gang of gentlemen have call-to-action, encounter difficulties that would crush them like a bug and are not merely fine, they go on to further great adventure and reward).

Terry Pratchett had obvious inspirations from Shakespeare, Ringworld, Faust (in the title!).

In the pandemic I read "The Deathworlders" (web fic, not the book series of similar name), and by the time I'd read too many shark jumps to continue, I had spotted many obvious inspirations besides just the one that gave the name.

If I studied medieval lit, I could probably do the same with Shakespeare's inspiration.


And when I "learn" a verbatim copy of pages of that book, then write those pages out in Microsoft Word & sell those pages its legal?


It doesn't, a real person can't legally obtain a copy of a copyrighted work without paying the copyright holder for it. This is what OpenAI is asking for: they don't want to pay for a single copy of a single book, and still they want to train their models on every single book in history (and song, and movie, and painting, and code base, and anything else they can get their hands on).


Do you know Numerical Recipes in C?

This discussion reminds me of it.


>you can use that knowledge,

Did OpenAI bought one copy of each book, or did they legaly borowed athe books and documents ?

if you copy paste rom books and claim is your content you are plagiarizing. LLMs were provent to copy paste trained content so now what? Should only big Tech be excluded from plagiarizing ?


I would assume that the request is for it to apply to models in the way that it currently applies to humans.

If a human buys a movie, he can watch it and learn about its contents, and then talk about those contents, and he can create a similar movie with a similar theme.

If OpenAI buys a movie and shows it to their model, it's unclear whether the model can talk about the contents of the movie and create a similar movie with a similar theme.


Is OpenAI buying the movie, or just taking it?

Since "buying" a movie (as it currently applies to humans) is just buying a limited license to it for private viewing, can't the copyright holder opt to limit the $4.99 license terms to human viewing, and charge $4999 for an AI training license?

Or OpenAI could buy movies the way Disney does, by buying the actual copyright to the film.


> Since "buying" a movie is just buying a license to it, can't the copyright holder opt to limit the $4.99 license terms to human viewing, and charge $4999 for an AI training license?

That's exactly what already happens currently. Buying a movie on DVD doesn't give you the right to present it for hundreds of people. You need to pay for a public performance license or commercial licence. This is why a TV network or movie theatre can't just buy a DVD at Walmart and then show the movie as often as it likes.

Copyright doesn't just grant exclusive distribution rights. It grants exclusive use rights as well, and permits the owner to control how their work is used. Since AI rights are not granted by any existing licenses, and license terms generally reserve any rights not explicitly specified, feeding copyrighted works into an AI data model is a reserved right of the owner.


>Since "buying" a movie (as it currently applies to humans) is just buying a limited license to it for private viewing, can't the copyright holder opt to limit the $4.99 license terms to human viewing, and charge $4999 for an AI training license?

the Reddit data licensing model


somehow, I suspect openai didn't "buy" all of the articles, books, websites they crawled and torrented.


OpenAI didn't pay for most of the content it used.


This is basically "allow us to steal others' IP". It's hard not to treat Altman like a common thief.


Even moreso, it only applies to initial model training by companies like OpenAI not other companies using those models to generate synthetic data to train their own models.


Not only that

The model gets to use training data of all humans.

But if you use the model as training data OAI will say you’re infringing T&Cs


Yeah it’s crazy. I also suspect they might not be confident in their defense from the NYT lawsuit - if they’re found in fault then it’s going to be trouble.


It is hard to see how a court could decide that copyright does not apply to training LLMs without completely collapsing the entire legal structure for intellectual property.

Conceptually, AI basically zeros out existing IP, and makes the AI the only IP that has any value. It is hard to imagine large rights holders and courts accepting that.

The likely outcome is that courts rule against LLM creators/providers and they eventually have to settle on licensing fees with large corporate copyright holders similar to YouTube. Unlike YouTube though, this would open up LLM companies to class action lawsuits from the general public, and so it could be a much worse outcome for them.


Are there certain books that federal law prevents you from reading? Which ones?

Maybe terrorist manuals and some child pornography, but what else?


[flagged]


The pod was good apart from starting/spreading the rumor that high numbers of “bill to Singapore” was evidence that China was circumventing GPU import bans.


Dont look at it as such, mayhaps;

Look it at literally who will have GPU dominance in future. (obv who will hit Qbit at scale... but we are at this scale now - and control currently is controlled by policy, then bits, then Qbits.)

Remember, we are witnessing the "Wouldnt it be cool if..?" CyberPunk manifestations of our Cyberpunk Readings of youth?

((I buildt a bunch of shit that spied on you because I read NeuroMancer, and thought wouldnt it be cool if..."

And then I helped build Ono Sendai throughout my career...


Can you expand your post and explain why?


Must be the rumours that DeepSeek has million worths of GPU think and their claim of relatively cheap training is a psyop


They meant "freedom to learn [through backpropagation]" probably.

Companies like this were allowed to siphon the free work of billions of people over centuries and they still want more.


Just in case, here's the list of these labels:

- UMG Recordings, Inc.

- Capitol Records, LLC

- Concord Bicycle Assets, LLC

- CMGI Recorded Music Assets LLC

- Sony Music Entertainment

- Arista Music

Taken from: https://cdn.arstechnica.net/wp-content/uploads/2025/03/UMG-v...


For long context sizes AGI is not useless without vast knowledge. You could always put a bootstrap sequence into the context (think Arecibo Message), followed by your prompt. A general enough reasoner with enough compute should be able to establish the context and reason about your prompt.


Yes, but that just effectively recreates the pretraining. You're going to have to explain everything down to what an atom is, and essentially all human knowledge if you want to have any ability to consider abstract solutions that call on lessons from foreign domains.

There's a reason people with comparable intelligence operate at varying degrees of effectiveness, and it has to do with how knowledgeable they are.


Would that make in-context learning a superset or a subset of pretraining?

This paper claimed transformers learn a gradient-descent mesa-optimizer as part of in-context learning, while being guided by the pretraining objective, and as the parent mentioned, any general reasoner can bootstrap a world model from first principles.

[0] https://arxiv.org/pdf/2212.07677


> Would that make in-context learning a superset or a subset of pretraining?

I guess a superset. But it doesn't really matter either way. Ultimately, there's no useful distinction between pretraining and in-context learning. They're just an artifact of the current technology.


Isn't knowledge of language necessary to decode prompts?


0 1 00 01 10 11 000 001 010 011 100 101 110 111 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110

And no, I don't think the knowledge of language is necessary. To give a concrete example, tokens from TinyStories dataset (the dataset size is ~1GB) are known to be sufficient to bootstrap basic language.


I agree, they are only starting the data flywheel there. And at the same time making users pay $200/month for it, while the competition is only charging $20/month.

And note, the system is now directly competing with "interns". Once the accuracy is competitive (is it already?) with an average "intern", there'd be fewer reasons to hire paid "interns" (more expensive than $200/month). Which is maybe a good thing? Fewer kids wasting their time/eyes looking at the computer screens?


The interns of today are tomorrow's skilled scientists.


Just FYI: They did roll out Deep Research to those of us on the $20/mo tier at (I think) about the same time you made this comment.


The approach of "cutting funding and then observing whether anything critical fails or is impacted" only works if outcomes follow a normal distribution.

This is far from the case — many areas are characterized by heavy-tailed loss distributions, where extreme negative consequences could really ruin the day and erase any efficiency gains.


It should be possible to learn to reason from scratch. And the ability to reason in a long context seems to be very general.


How does one learn reasoning from scratch?

Human reasoning, as it exists today, is the result of tens of thousands of years of intuition slowly distilled down to efficient abstract concepts like "numbers", "zero", "angles", "cause", "effect", "energy", "true", "false", ...

I don't know what reasoning from scratch would look like without training on examples from other reasoning beings. As human children do.


There are examples of learning reasoning from scratch with reinforcement learning.

Emergent tool use from multi-agent interaction is a good example - https://openai.com/index/emergent-tool-use/


Now you are asking for a perfect modeling of the system. Reinforcement learning works by discovering boundaries.


Now rediscover all the plants that are and aren't poisonous to most people.


I've suggested that long context should be included into the prompt.

In your particular case the prompt would look something like: <pubmed dump> what are the plants that aren't poisonous to most people?

A general reasoner would recover language and relevant world model from pubmed dump. And then would proceed to reason about it, to perform the task.

It doesn't look like a particularly efficient process.


Actually i also think it's possible. Start with natural numbers axiom system. Form all valid sentences of increasing length. RL on a model to search for counter example or proofs. This on sufficient computer should produce superhuman math performance (efficiency) even at compute parity


I wonder how much discovery in math happens as a result in lateral thinking epiphanies. IE: A mathematician is trying to solve a problem, their mind is open to inspiration, and something in nature, or their childhood or a book synthesizes with their mental model and gives them the next node in their mental graph that leads to a solution and advancement.

In an axiomatic system, those solutions are checkable, but how discoverable are they when your search space starts from infinity? How much do you lose by disregarding the gritty reality and foam of human experience? It provides inspirational texture that helps mathematicians in the search at least.

Reality is a massive corpus of cause and effect that can be modeled mathematically. I think you're throwing the baby out with the bathwater if you even want to be able to math in a vacuum. Maybe there is a self optimization spider that can crawl up the axioms and solve all of math. I think you'll find that you can generate new math infinitely, and reality grounds it and provides the gravity to direct efforts towards things that are useful, meaningful and interesting to us.


As I mentioned in a sister comment, Gödel's incompleteness theorems also throw a wrench into things, because you will be able to construct logically consistent "truths" that may not actually exist in reality. At which point, your model of reality becomes decreasingly useful.

At the end of the day, all theory must be empirically verified, and contextually useful reasoning simply cannot develop in a vacuum.


Those theorems are only relevant if "reasoning" is taken to its logical extreme (no pun intended). If reasoning is developed/trained/evolved purely in order to be useful and not pushed beyond practical applications, the question of "what might happen with arbitrarily long proofs" doesn't even come up.

On the contrary, when reasoning about the real world, one must reason starting from assumptions that are uncertain (at best) or even "clearly wrong but still probably useful for this particular question" (at worst). Any long and logic-heavy proof would make the results highly dubious.


A question is: what algorithms does the brain use to make these creative lateral leaps? Are they replicable?

Unless the brain is using physics that we don’t understand or can’t replicate, it seems that, at least theoretically, there should be a way to model what it’s doing with silicon and code.

States like inspiration and creativity seem to correlate in an interesting way with ‘temperature’, ‘top p’, and other LLM inputs. By turning up the randomness and accepting a wider range of output, you get more nonsense, but you also potentially get more novel insights and connections. Human creativity seems to work in a somewhat similar way.



I believe https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_... (Gödel's incompleteness theorems) applies here


Dogs are probably the best example I can think of. They learn through experience and clearly reason, but without a complex language to define abstract concepts. Its very basic reasoning, but they do learn and apply that learning.

To your point, experience is the training. Without language/data to represent human experience and knowledge to train a model, how would you give it 'experience'?


And yet dogs, to a very high degree, just learn the same things. At least the same kinds of things, over and over.

They were pre-designed to learn what they always learn. Their minds structured to readily make the same connections as puppies, that dogs have always needed to survive.

Not for real reasoning, which by its nature, does not have a limit.


> just learn the same things. At least the same kinds of things, over and over.

Its easy to train the same things to a degree, but its amazing to watch different dogs individually learn and reason through things completely differently, even within a breed or even a litter.

Reasoning ability is always limited by the capacity of the thinker to frame the concepts and interactions. Its always limited by definition, we only push that limit farther than other species, and AGI may eventually push it past our abilities.


There was necessarily a "first reasoning being" who learned reasoning from scratch, and then it's improved from there. Humans needed tens of thousands of years because:

- humans experience reality at a slower pace than AI could theoretically experience a simulated reality

- humans have to transfer knowledge to the next generation every 80 years (in a manner that's very lossy), and around half of each human lifespan is spent learning things that the previous generation already knew


The idea that there was “necessarily a first reasoning being” is neither obvious nor likely.

Reasoning could very well have originally been an emergent property of a group of beings.

The animal kingdom is full of examples of groups being more intelligent than individuals, including in human animals as of today.

It’s entirely possible that reasoning emerged as a property of a group before it emerged in any individual first.


I think you are focusing too much on the fact that a being needs to be an individual organism, which is kind of an implementation detail.

What I wonder instead is whether reasoning is a property that is either there or not there, with a sharp boundary of existence.


The dead organism cannot reason. It's simply a survivorship-bias. Reasoning evolved like any other survival mechanism.


Whether the first reasoning entity is an individual organism or a group of organisms is completely irrelevant to the original point. If one were to grant that there was in fact a "first reasoning group" rather than a "first reasoning being" the original argument would remain intact.


Did it kill them? y - must be unsafe n - must be safe

Do this continually through generations until you arrive at modern society.


Creating reasoning from scratch is the same task as creating an apple pie from scratch.

First you must invent the universe.


>First you must invent the universe.

That was the easy part though, figuring out how to handle all the unintended side effects it generated is still an ongoing process. Please sit and relax while we are solving the few incidentals events occurring here and there, rest assured we are putting our best effort to their resolution.


It is possible to learn to reason from scratch, that's what R1-0 did, but the resulting chains of thought aren't legible to humans.

To quote DeepSeek directly:

> DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.


If you look at the benchmarks of the DeepSeek-V3-Base, it is quite capable, even in 0-shot: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#base-mod... This is not from scratch. These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.

On the other hand, my take on it, the ability to do reasoning in a long context is a general capability. And my guess is that it can be bootstrapped from scratch, without having to do training on all of the internet or having to distill models trained on the internet.


> These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.

But we already know that is the case: the Deepseek v3 paper says it was posttrained partly with an internal version of R1:

> Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.

And deepseekmath did a repeated cycle of this kind of thing mixing in 10% of old previously seen data with new generated data from last gen in a continuous bootstrap.


Possible? I guess evolution did it over the course of a few billion years. For engineering purposes, starting from the best advanced position seems far more efficient.


I've been giving this a lot of thought over the last few months. My personal insight is that "reasoning" is simply the application of a probabilistic reasoning manifold on an input in order to transform it into constrained output that serves the stability or evolution of a system.

This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.

But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.

Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains.

This doesn't mean a well-trained model can't reason about input it hasn't encountered before, just that the input needs to be in some way causally connected to the same laws which governed the input the manifold was trained on.

I'm sure we could create a fully generalized reasoning manifold which could handle anything, but I don't see how we possibly get that without first considering and encountering all possible inputs. But these inputs still have to have some form of constraint governed by laws that must be learned through sampling, otherwise you'd just be training on effectively random data.

The other commenter who suggested simply generating all possible sentences and training on internal consistency should probably consider Gödel's incompleteness theorems, and that internal consistency isn't enough to accurately model and interpret the universe. One could construct a thought experiment about an isolated brain in a jar with effectively unlimited neuronal connections, but no sensory connection to the outside world. It's possible, with enough connections, that the likelihood of the brain conceiving of true events it hasn't actually encountered does increase meaningfully. But the brain still has nothing to validate against, and can't simply assume that because something is internally logically consistent, that it must exist or have existed.


MMMU is not particularly high. Janus-Pro-7B is 41.0, which is only 14 points better than random/frequent choice. I'm pretty sure, their base DeepSeek 7B LLM will get around 41.0 MMMU without access to images, this is a normal number for a roughly GPT4-level LLM base with no access to images.


ChatGPT o1: https://chatgpt.com/share/678feedb-0b2c-8001-bd77-4e574502e4...

> Thought about large prime check for 3m 52s: "Despite its interesting pattern of digits, 12,345,678,910,987,654,321 is definitely not prime. It is a large composite number with no small prime factors."

Feels like this Online Encyclopedia of Integer Sequences (OEIS) would be a good candidate for a hallucination benchmark...


I think firmly marrying llms with symbolic math calculator/database, so they can check things they don't really know "by heart" would go a long way towards making them seem smart.

I really hope Wolfram is working on LLM that is trying to learn what it means to be WolframAlpha user.


Can we stop with the "haha llms can't do math" nonsense? You'll one shot it every time if you tell it to use Python. You're holding it wrong.


Sorry, but this was ChatGPT/o1 with access to code execution (Python) and it used almost 4 minutes to do reasoning. It had done a few checks with smaller numbers, all of which had failed. And it proceeded to make a wrong conclusion (with high confidence).


Of course it failed. Tell it to write a program.


Another Gorilla is the schools, teachers and state-approved recommendations, that extend their reach even into private schools.

Imagine my frustration one day, when I've discovered that my kindergartner has full access to a brand-new, shiny iPad during class. Despite complaints from parents, the teacher refused to reduce iPad usage (or even activate Screen Distance and Screen Time controls on the iPad, or share usage statistics).

The only thing that I've learned, this is all in line with California’s state-approved computer literacy recommendations.


We specifically decided against the school that was closest to us because they give iPads in first grade. Even if the school is good, convenient and very well ranked, I don't want my kid to have a tablet until much later. I despise tablets because of the focus on consumption versus tinkering and creation and I think it's a distraction in a classroom that shouldn't be there.

I do give my son access to a computer but it's a based on misterfpga running the amiga core. Set up in such a way that he can explore and discover how things work from a time when computers were still relatively open.


> The only thing that I've learned, this is all in line with California’s state-approved computer literacy recommendations.

That's seriously fucked up!


100% this. Our kids were required to bring laptops to school for no particularly good reason, then allowed to zombie out on them in the library during lunch and free periods. Infuriating.


I understand that it is mostly regulated at the state level. I'm not sure about other states, but The Computer Science Standards for California Public Schools (Kindergarten through Grade Twelve) also tend to be followed by private schools. So they can claim their programs meet state requirements.

This brings computers into the classroom, and once they’re available, it is a slippery slope. It is easier for teachers to have students use semi-gamified "educational" apps rather than engage themselves.

Example for K-2 - https://www.cde.ca.gov/be/st/ss/documents/csstandards.pdf:

  K-2.CS.1 Select and operate computing devices that perform a variety of tasks accurately and quickly based on user needs and preferences.

  K-2.CS.2 Explain the functions of common hardware and software components of computing systems.

  K-2.CS.3 Describe basic hardware and software problems using accurate terminology.

  K-2.NI.4 Model and describe how people connect to other people, places, information and ideas through a network.

  ...

  K–2 K-2.AP.12 Create programs with sequences of commands and simple loops, to express ideas or address a problem

  K-2.IC.20 Describe approaches and rationales for keeping login information private, and for logging off of devices appropriately


Yes, we have similar metrics in NSW (Australia). Agreed on the dynamics. There are also a lot if fairly feral edutech entrepreneurs playing special interest capture here - they obviously care more about selling their dubious education novelties than any one group cares about keeping them out. So our kids' schools are littered with semi-functioning "smart whiteboards" and a host of broken edutech apps.


There is a sister thread on HN currently asking why people homeschool. Welcome to the conversation.


I wish that "Online Coupon Price Tags" in stores would also be banned. I'm talking about these yellow price tags that show lower than "Club" prices, which are only valid if you collect a coupon online.

Like FTC, I estimate that banning these would save U.S. consumers millions of hours they currently spend searching and clicking on pointless coupons on their phones before making purchases. It would also increase happiness, as it's extremely annoying to pay $20 extra, knowing that a lower price is available if only you spent ten minutes struggling with a store's website on your phone.

Whoever invented this is evil and is destroying happiness.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: