Hacker News new | past | comments | ask | show | jobs | submit login
Japan’s government will not enforce copyrights on data used in AI training (technomancers.ai)
484 points by version_five on May 31, 2023 | hide | past | favorite | 402 comments



I think this should generally be true. The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use. It may produce stuff that violates copyright, but the way you use or distribute the product of the model that can violate copyright. Making it write code that’s a clone of copyright code or making it make pictures with copy right imagery in it or making it reproduce books etc etc, then distributing the output, would be where the copyright violation occurs.


> The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use.

Lossy or not, the training data provides value. If all the various someones had not spent time making all the stuff that ends up as training data, then the model it trains would not exist.

If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated. And should also be able to decide they don't want their work used in that way.

Note that I'm not talking about what existing copyright law says; I'm talking about how I believe we should be regulating this new facet of the industry.

> Making it write code that’s a clone of copyright code or making it make pictures with copy right imagery in it or making it reproduce books etc etc, then distributing the output, would be where the copyright violation occurs.

How is the end-user supposed to know this? Do we seriously believe that everyone who uses generative AI is going to run the output through some sort of process (assuming one even exists) to ensure it's not a substantial enough copy of something some copyrighted work? I certainly don't think this is going to happen.

Regardless, copyright is about distribution. If the a model trained on copyrighted material is considered a copy or derived work of the original work, then distributing that model is, in fact, copyright infringement (absent a successful fair use defense). I'm not saying that's the case, or how a court would look at it, but that's something to consider.


> If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated. And should also be able to decide they don't want their work used in that way.

> Note that I'm not talking about what existing copyright law says; I'm talking about how I believe we should be regulating this new facet of the industry.

Is it really new? Humans have always learnt by studying what's out there already. Our whole culture is built on what's been done and published before (and how could it be otherwise?). Without Bach there would be no Mozart, and then down the line their influence permeates everything you hear today.

If anything I'd like to make it easier to reuse parts of our shared culture, and limit the ability of organisations to control how things that they've published are reused. You can make private works if you want to keep control of them, but at some point the public deserves to share and rework the things that have been pushed into the public consciousness.


>> Is it really new? Humans have always learnt by studying what's out there already.

"Humans" being the important word here. I don't understand why people keep trying to compare training a model to humans learning through reading etc. They are very different things. Learning done by machines at enormous scale and done to benefit private companies financially is not the same as humans learning.


How is it meaningfully different with respect to this question?

If I go to a museum and look at a bunch of modern paintings, then go home and paint something new but “in the style of”, this is well-established as within my rights, regardless of how any of the painters whose work I studied and was inspired by might feel.

If I take a notebook and write down some notes about the themes and stylistic attributes of what I see, then go home and paint something in the same style, that too is fine - right? Or would you argue the notes I took are a copyright violation? Or the works I made using those notes?

Now let’s say I automate the process of recording those notes. Does that change the fundamentals of what is happening, with respect to copyright?

Personally, I don’t think so.


The law most definitely distinguishes between the rights of a human and the rights of a software program running on a computer.

AI does not read, look at or listen to anything. It runs algorithms on binary data. An AI developer who uses millions of files to program their AI system also does not read, look at or listen to all of that stuff. They copy it. That is the part explicitly covered by international copyright law. It is not possible to use some file to "train" a ML model except by copying that file. That's just a fact. It wasn't the computer that went out and read or looked at the work. It was a human who took a binary copy of it, ran some algorithms on it without even looking at it, and published/sold/gave access to the software.

AI software is a work by an author; not an author.


> How is it meaningfully different with respect to this question?

Humans can't be owned by corporations for one.


> They are very different things

Yes, but also very similar. We learn very well by spaced repetition, and by practicing things. Our whole nervous system stores information in a similar way. Is the brain and the individual neurons more complex? Yes, sure, but that doesn't negate the core similarities.

> Learning done by machines at enormous scale and done to benefit private companies financially is not the same as humans learning.

Yes, that's the important difference. That in the end if you train a robot you get a program that's easy to copy/scale, the marginal cost of using it is orders of magnitude lower than what you'd get if you would do it with humans in the loop.

We already have fair use in copyright, because there are important differences between the various forms and modalities of human imitation.

And of course maybe it's time to rename copyright to usageright. After current copyright doesn't even apply in most cases. (The results are not derivatives, there's sufficient substantial transformative difference, etc. That said, the phrasing in the US constitution still makes sense: "... the exclusive Right to their respective Writings and ..." ... if we interpret Right to mean all rights, including even the right of who can read/see it.)


The differences don't seem salient though. Doing a legal thing faster doesn't generally make it any less legal; doing it for profit changes the legal regime somewhat but not in ways that seem relevant to what's being claimed.


> Doing a legal thing faster doesn't generally make it any less legal

“But officer, it is legal to drive just slightly slower than I was going!”

Simply put: You are wrong. The law makes arbitrary distinctions all the time, for practical reasons.


There is a specific law against driving above a certain speed. There's no law like that against being an AI.


"Humans have always learnt by studying what's out there already."

Neither an AI model nor an AI developer who programs that model are actually studying "what's out there already". One is copying files, and the other is running algorithms on copied files. And then the first one is raking in $$$ while bankrupting the authors of those files. That's illegal in the US, UK, EU and under the terms of the Berne Convention.


> You can make private works if you want to keep control of them, but at some point the public deserves to share and rework the things that have been pushed into the public consciousness.

There's already a licensing framework for artists doing this - should they wish to. It's called Creative Commons, and allows a pretty fine distinction of rights from public domain to free for personal use not commercial, and everything in between. https://creativecommons.org/

I agree our shared culture in some sense ultimately owns the productions of the culture. But that's wildly different to (and in some senses the opposite of) letting private companies enclose, privatise and sell those products back to us. As for example Disney has done over and over again, taking myths transcribed by the Brothers Grimm, or classic novels now in the public domain, 'reinterpreting' them and viciously enforcing copyright on these new interpretations.

The entire point of copyright law is to allow the Bach's of this world to profit from their work - without having to die in poverty and obscurity, as so many artists and musicians have historically, even while others have profited from their work at scale.

> Humans have always learnt by studying what's out there already.

Finally - as other commentators have noted, there's really no similarity at all between a human at human pace studying and integrating understanding of a piece or genre of art, and an AI training to replicate that work at scale as perfectly as possible. A much better comparator would be the Chinese factory 'villages' that reproduce paintings at scale for the commercial market, without creativity or 'art' being a part of the process. But even that is a poor analogy, since individual humans mediate the process. A really good analogy would be a giant food corporation like Nestle somehow scanning the product of a restaurant and then offering that chefs unique dishes that had taken years to invent for nearly free - using the same name and benefiting from the association.


What if the AI is open source and run by individual creators? That changes the slant of the argument a lot. I worry that excessive regulation will mostly come down on individual creators using AI tools.


I'm not sure what you mean? The image and text models currently becoming popular are being trained on work that is not owned or created by those creating the models. The consensus amongst the artists whose work is being used to train them (without their permission or compensation) is very high that this is a bad thing. Financial impacts are already being felt by artists across industries. There's a huge level of denial of the impact of this on professional artists, already today, here on hacker news. 'Open source' is a separate issue to training on other peoples work, replicating their style and stealing their livelihood.


> Humans have always learnt by studying what's out there already. Our whole culture is built on what's been done and published before

Are you implying that educators should not be compensated or credited? Because that is not how it works in the real world.


If I read a lot of fantasy books as a kid, then start writing my own fantasy book, should I have to pay royalties to the authors of the books I read?


Yes, you do. The royalty is usually called a "college degree".

Most jobs require a degree or certification of some sort.


Does your ability to write fantasy books absolutely depend on having read those fantasy books as a kid? Was gaining the ability to write your own fantasy books and profit from them your only motivation to read those fantasy books? After gaining the ability to write fantasy books thanks to having read them, can you now produce fantasy books at a qualitatively different speed, scale, and conditions than any of the authors of the books that you read?

If the answer to those three questions is "yes", then I would argue that yes, you absolutely should have to pay royalties to the authors.


> If the answer to those three questions is "yes", then I would argue that yes, you absolutely should have to pay royalties to the authors.

Copyright lobbies can't have their own cake and eat it too, if you want to enforce such a different way of thinking about copyright compared to the current one, that would destroy the current industry and rightly so.


I would appreciate if you addressed the questions as literally stated.

I could just as well argue that businesses developing or making use of LLMs can't have their cake and eat it too. If they want their computer programs to enjoy the same rights and prerogatives as human creators do, they should be ready to demonstrate that those models are truly moral agents, with their own lived experiences, and thus deserve the status of legal persons as of themselves subject to the same laws and obligations as human beings.


> If they want their computer programs to enjoy the same rights and prerogatives as human creators do

I don't think they care about that, they seems fine with output images not being copyrightable as per the current legislation.


How much does the world owe to Gilgamesh and Homer?


I don't know. Whether the answer were to be either 'our entire human culture', or 'absolutely nothing whatsoever', or any point in between, how would that be relevant for the discussion at hand?


They already paid when they bought the books. Why would they need to pay more?


Say I want to write a screenplay and produce the resulting film, for profit, but I am literally unable to have any ideas whatsoever unless I base them on book that I read. With this in mind, and with this sole motivation, I buy and read the whole collection of Brandon Sanderson's novels and create a screenplay based exclusively on their content, for I have no ideas nor experiences of my own. I already paid when I bought the books. Why would I need to pay Brandon Sanderson any more?


Anything you create comes from what you've seen, from what you've experienced.

That's a dead-end to start having to pay each and every one of the original sources of the components of your mind each time they are used to create!


So I understand you are arguing that derived works should not be subject to royalties, i.e. I should be able to produce a film based entirely on the work of a living author without restriction or the need to pay any royalties.


No, you're over-generalizing what I'm saying (a.k.a. straw man fallacy).

If you're commercializing e.g. goodies that use the exact same character designed by some artist, so that anybody can tell you it's the same, AND if the traits of this character are actually original (not just so that anybody would come up with it independently), AND if the people buying your goodies are all thinking about the original character in the first place, AND if the original author is alive, then I'd absolutely agree that royalties need to be paid.

That's just an example. To tell you that in specific cases it's obvious that royalties are required.

But in the general case, no. Because, else, anything just is a derived work. Just think about it.


It's even impossible to retrace all of the woven threads, ramified tendrils, that link your mind to the billions of other minds all over the world and over the ages.

The world of ideas is liquid. All is mixing, all dissolves and disappears in everything else, and is reborn new and different, again and again.


> Because that is not how it works in the real world.

Unless you're satirizing it or making fun of it in some other way. Then it's fair use..


Educators get paid a flat fee, not an indenture on your future work (nor do you owe anything back to the people who made the learning materials they use). And if you teach yourself from books or websites you don't pay anything (except maybe the cost of the books). All of which seems right and proper?


> Lossy or not, the training data provides value.

If we ignore the issue of machine learning for now; It's not the job of copyright to prevent people extracting value from a copyrighted work.

If it was, then it would be possible for copyright holders to launch lawsuits that block entities from using the knowledge that was published in copyrighted reference material. Or the rights holder of a cookbook would be able to block people from making the recipe.

We have other sections of law that provide protection along these lines. Patents give their holders a monopoly to extract value from the given invention, a much stronger protection that copyright law. But in exchange Patents are limited to 21 years, much be publicly documented and only certain types of things are covered.

Trade secret laws can be used to protect recipes, formulas and other processes, but only as long as the holder makes reasonable efforts to keep it secret. The owner can't have it both ways, have the IP publicly known and protected by trade secret law.

The only purpose of copyright is to give the holder a monopoly over the reproduction of a work. The definition of reproduction might be quite wide these days: A performance of a work is a reproduction, a cover of a song is a reproduction, distribution is reproduction in the modern age of computers... etc, but the limit of copyright law is reproduction.

Coming back to machine learning, there is a somewhat open question if training counts as a form of reproduction (well, outside of Japan). But we can't use your proposed "extraction of value" metric as a way to decide that.

Personally, I would argue that training a machine learning model is roughly equivalent to a human brain consuming copyrighted works, and should be treated the same in law.

The fact that a machine learning model is (sometimes) capable of recreating a copyrighted work later shouldn't be held against them, as a human brain is also fully capable of recreating copyrighted works from memory. From a legal perspective, recreating a copyrighted work from memory will not save a human from a copyright infringement lawsuit, it's simply not a defence. The copyright infringement happens when the work is recreated.


Copyright law (EU, US, UK, international under the Berne Convention) covers reproduction, distribution and exhibition. That's exactly what those FBI warnings on VHS movies used to say. Distribution and exhibition are prohibited along with reproduction.

In all the cases I mentioned, the only legal way to make any exception to that is if the copying does not harm the interests of the author or reduce the market value of the work. These are the actual laws.

"training a machine learning model is roughly equivalent to a human brain consuming copyrighted works"

A few clear differences: 1. The person "training" a machine learning model doesn't even need to view the work. They copy a file. They do not study or learn from it. 2. A human brain doesn't rely on a 100% verbatim digital copy of the work. To the extent that a brain "makes a copy" of what it observes, it is impossible for it not to. 3. Copyright law (almost everywhere) explicitly applies to making digital copies of binary files of a work (without which it is not possible to "train" a model using the work). Nowhere does it ever apply to a human brain when a person looks at the work.

Not all the things you mentioned are considered "reproduction". A cover of a song is a derivative work, and requires compensation. Showing a movie is exhibition, and is explicitly addressed in copyright laws. These things are not just considered "some form of reproduction".

The laws actually exist and are easy to find and read.


The purpose of copyright is to create moral and economical incentives to authors to create new copyrightable works, but giving those authors a time limited state granted monopoly. Reproduction is only one aspect to this. As an example, applying a song to a video require addition permissions even if the party has permissions for reproduction. The owners to a record can also disallow the use of a song in a political event based on the context of morality, even if the politician has bought permissions to play the song in public.

I would personally also argue that machine learning model is roughly equivalent to a compression algorithm. Converting a 4k video to a 420p video is just as lossy as feeding a learning model from a 4k video and ask it to reproduce the 420p video. It has nothing in common with how a human brain is consuming content or learns information. No person can produce a 420p video just by consuming a 4k video, nor can any machine learning model gain the emotional constructs and social contexts that human brains get from learning.


Your examples are not inherent rights that copyright law explictly grants to rights holders. They are clever side effect of how the holder licenses out their monopoly on reproduction.

Holders rarely grant unrestricted reproduction rights to anyone. Reproduction rights licenses always come with a bunch of explicit restrictions, for example: "You may reproduce this novel, in print, unmodified, only for retail sale, in North America, on this quality of paper, for the next 5 years" and so on.

The party has the license to reproduce the song as a standalone audio recording, but attaching it to a video and reproducing the combined work isn't covered and the party must enter into negotiations with the rights holder for a new license. Such licenses often only grant the rights to reproduce it with that exact video and not a different one later, which allows the rights holder to gain control over which videos their song is attached to.

Same thing with holding morality over political events. The rights holder was careful to add a bunch of restrictions to that public performance license they sell. Sure, the politician might have bought a licence, but they forgot to check the small print that blocks their type of event from actually using it.

-----

Machine learning is kind of like compression, yes... It can be a useful analogy at times.

But it is absolutely nothing like lossy video compression. It's not compressing a single file or object. The only way you could train it on a 4k video and get a 420p video out is if that model was extremely over-fitted. The resulting model would likely be bigger than a 420p h264 video file and useless for anything else.

The way that machine learning is like a compression is that it find common patterns across it's entire training set and merges them in very lossy ways.

And it's actually very much like how a human brain works. Your brain doesn't start from scratch for every single human face you recognise. Instead, your brain has built up a generic understanding of the average human face, grouping by clusters of features. Then to remember a given human face it just remembers which cluster of features it's close to and then how it differs... Which is a form of lossy compression.

> No person can produce a 420p video just by consuming a 4k video

But many people do remember entire songs, complete with lyrics and music. And people with musical skills can (and often do) reproduce that song from memory as a cover... Which is copyright infringement if preformed publicly or otherwise distributed.

> nor can any machine learning model gain the emotional constructs and social contexts that human brains get from learning.

There are many things which large LLMs like chatgpt are absolutely incapable of doing. People do over hype their capabilities.

But in my experiments, chatgpt is actually quite good at tasks that require interpreting emotions and social contexts. Does it actually understand these emotions and social contexts? shrug. But if it doesn't actually understand that just proves that true understanding isn't actually needed to preform useful tasks in those areas.


The thing with sync rights is that they are a construct made by US courts, who made the interpretation that syncing music to a video invokes the derivative works part of US copyright law. As such a party need to obtain both a license to reproduce and a license to create a derivative work. US copyright law are split into six categories, one which is reproduction. The other 5 are: preparation of derivative works, distribution (which is not the same as reproduction), public performance, public display, and public performance by means of a digital audio transmission. Copyright owner is given in the law the right to convey each of these exclusive rights separately.

Other countries like say France has moral copyright law and property copyright law as cleanly separate laws. Different rules but with similar practical implications.

In both cases they are not just fine prints in a license. A license that intend to give recipients full permissions need to include everything. This is why international written copyright licenses are a huge legal problem with no obvious solutions.

----

The human brain is vastly more complex. We don't learn by building up an understanding of the average human face, grouping by clusters of features. At best it would be a extreme oversimplification that ignore 99.9% of what the brain does. The eye neurons (still an oversimplification) sends the visual inputs to multiple parts of the brain, each interpreting and signaling each other, and the outcome of that both influences and changes growth and behavior of those paths and parts. You see a face and it talks to the Amygdala, but also to the frontal cortex, but also the limbic and pre-limbic cortex. It runs a simultaneously a simulation in the hypothalamus to test emotional reaction. We are still not at the hippocampus where long term memories is generally considered to form.

A person would be long dead if they had to go through the generic understanding of what a tiger look like, then go through memories that distinguish a lion with a cat, and last go through memories of the significance of a fenced tiger in a zoo in contrast to one in the jungle.

Trying to remember a given human face actually invokes a lot of those parts of the brains that activated when we saw them, and a face can also be remember by putting oneself into the emotional state that we were when we met them. One theory about dreaming is that we also rerun those neural pathways, but not all of them at the same time, which then allows for more context.

LLMs doesn't even come close to operate like this. It can be good at emulating what seems like learning, but comparing them is like saying that the complex system we call the immune system is like a gun. Both can kill people.


> the training data provides value

I wish it was easier to build on someone else's copyrighted value. Geforce Now shouldn't have to get permission from game makers to rent out server time for users to play games they already own. Aereo shouldn't have been legally obliterated for the way they rented out DVRs with antennas. Using a sample in a song shouldn't be automatic infringement.


> distributing that model is, in fact, copyright infringement

is it? If i distributed digits of pi (to the umpteenth billion decimals), it theoretically contains copyright information in their digits.

The distribution of the copyrighted material is the infringement, but not if the data is _meant_ to produce other effects, and it is reasonable that the data is used for some other purpose _other than_ to replicate the copyrighted works.

> the training data provides value.

and so does a textbook. A student reading the book (regardless of how that book was obtained - paid or not) does not pay royalties from the knowledge obtained.


Value derived from a work was never a component of protected IP.

If you write a song which is so inspirational that it influences the way a listener thinks or even inspires them to make similar sounding songs, you don't have any claim to that value.


> the training data provides value.

Copyright law doesn't protect "any provided value". Fair use specifically allows content creators to use copyright.

This can be for the purposes of parody, it can also be for the purposes of "reaction videos" or "commentary". The original content creators are NOT compensated for the value they helped create in a "reaction video".

https://arstechnica.com/tech-policy/2017/08/youtuber-court-b...


> If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated. And should also be able to decide they don't want their work used in that way.

You're posting on HN. Are you expecting a check from YCombinator?

> Regardless, copyright is about distribution. If the a model trained on copyrighted material is considered a copy or derived work of the original work, then distributing that model is, in fact, copyright infringement (absent a successful fair use defense). I'm not saying that's the case, or how a court would look at it, but that's something to consider.

The government is what ultimately decides what copyright means. And if the court wouldn't look at it that way, then it's not "in fact copyright infringement".


> Lossy or not, the training data provides value. If all the various someones had not spent time making all the stuff that ends up as training data, then the model it trains would not exist.

I don’t think deriving value is the measure for infringement. The artist derived value from society and from countless inspirations, should they be compensated?

Copyright is the right to distribute your content, not to control it in every way and derive maximum value possible. The purpose of copyright is to incentivize creators and they already get, in the US, 70+ years of monopoly preventing anyone else from selling copies.

This seems like more than enough incentive for creators to create, as evidenced by massive copyrighted material created at a rate far above population growth (more content is created and more money is made via copyright than ever before).


> If you are going to use someone else's work in order to make something that you are going to profit off of, I believe that original author should be compensated

This is insane. I learned mathematics from Houghton-Mifflin and Addison-Wesley copyrighted textbooks. Are you telling me that if I use this knowledge to, say, calculate matrices, I am violating copyright?

If I read Brandon Sanderson, absorb his writing style, setting, characters, and plot devices into my subconscious, and then a year later pen a short story that happens to be Sanderson-ish, do I have to write a check to Brandon?

No. LLMs are doing roughly the same thing our own brains do, at least at a high level: absorb information, and utilize that information in different ways. That's it.


Society and the economy will do just fine without copyright law. Some things would probably go away, like blockbuster movies. But a lot of other things would flower.


I strongly agree with this.

There's a distinction between "learning from" and "copying". "Learning from" is a transformative process that distills from the observation. This distillation can be as simple as indexing for a search engine, or as complex as a deep neural network.

Simply because a neural network can create something that is a copyright violation doesn't mean the training process itself it.

A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

The neural network is a tool.

It's reasonable to be concerned about the loss in employment by people who are affected by generative AI. But I think this is a separate issue to the copyright argument.


> There's a distinction between "learning from" and "copying".

Neural nets can memorize their training data. Generally that isn't what you want, and you strive to eliminate it. However, it could instead be encouraged to happen if someone wanted to exploit this law in order to abuse copyrights.


The law applies to the training of a neural network; you're not depriving the copyright holder of his intellectual property; if you use a copy of his work, he still owns the copyright independently if you copy it by right-clicking > copying or by overfitting a generative model.


Humans can memorize their training data too... aka see something and then produce a copy (code, drawing, music etc). The principles underlying how LLMs and humans learn isn't really that different... just different levels of loss/fuzziness.


And when humans do that they may also infringe on copyright.


Yet it’s not illegal to look at the McDonald’s logo, is it?


Yes, and as GP suggested, going on to distribute copies would be copyright infringement. That doesn't imply that it's an infringement to train the neural net.


Humans learn on copyrighted works as a matter of standard training. And certainly humans can memorize those works and replicate them – and we rely on the legal system to ensure that they don't monetize them.

The same will apply to neural nets. They can learn from others, but must make sufficiently distinct new works of art from what they've learned.


> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

I don't think that's correct. That might be trademark infringement, if the logo is a registered trademark, but "seeing something and then drawing it" is in general not copyright infringement.


Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement. (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).


> Drawing a copy of a copyrighted picture from memory, and then distributing that copy, would certainly normally be copyright infringement.

In US law, there is a nuance between Copyright and Trademark.

> Drawing a copy of a copyrighted picture from memory, and then distributing that copy

Would not necessarily be copyright infringement (it depends on a judge). For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.

> (A logo may not be enough of a creative work to be copyrightable, but I assume that's not what you're getting at).

A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.

In your example, if someone draws from memory a logo, they actually own the copyright, but it is still Trademark infringement and the trademark owner will be protected.


> In US law, there is a nuance between Copyright and Trademark.

It's not a nuance, it's a completely separate legal regime, and not what this conversation is about.

> Would not necessarily be copyright infringement (it depends on a judge).

Every law can be challenged in court, but a copy of a picture as-is is a pretty clear-cut case.

> For example, why Taylor Swift is able to re-record her music (the copyright is owned by a recording studio), as is, and can distribute the new version as "Taylor's Version" because she owns the copyright on the new version.

Nope. She's able to because there is a compulsory license for covers of songs that have already been published - something very different from them not being protected by copyright - and/or because she owns some of the rights. She may well be paying royalties on them. That compulsory license regime is specific to recorded music and does not apply to pictures.

> A logo is actually MORE protectable, through Trademark. Trademark is significantly MORE protected than Copyright.

"More" is a simplification; trademark laws are quite different from copyright laws, stronger in some ways and weaker in others (e.g. you can lose a trademark by not enforcing it, whereas you cannot lose a copyright that way). In any case, that's a distraction from the current topic.


> "seeing something and then drawing it" is in general not copyright infringement.

It's seeing something, drawing it, and then distributing that drawing which is infringement. Bonus points for the distribution being a sale.


> A human can see a advertisement for a Marvel movie and then reproduce the Marvel logo. Redistributing (and possibly actually doing that reproduction) that logo is a copyright violation, but the learning process isn't.

This then becomes about where the liability of that violation lies, and how attractive that is to companies.

A human "learning" the marvel logo and reproducing it is violation. How does OpenAPI fit into this analogy?


The liability would lay with the company using the LLM product. This could mean that many companies won’t want to take on the risk unless there is decent tooling around warnings of infringement and listing sources.


I think liability lies with the person who uses the product to violate copyright. The hosting / producing company didn’t violate copyright if I use their model to make Mickey Mouse pictures. I did.


How can you be certain that the content being generated is non-infringing?


You can’t, I think Fair Use is a fundamentally subjective judgement of a combination of how transformative the work is and the intent and impact of it being distributed.


The same way you do with any other content you generate in other ways.


Well, when I pick up a pencil and make a drawing, I have a lot of agency over what is created.

The whole point of these generative models is that I have less agency over exactly what gets created - it takes my prompt and does the rest.


Normally when I generate content it’s from my brain and I can tell the difference between copying memorized content, re-expressing memorized content, and generating something original. How do I know what the LLM is doing?


Are you sure? If you look at plagiarism in music, you'll find a number of cases where the defendant makes a compelling point about not remembering or consciously knowing they heard the original song before. For legal purposes, it is not the point, but they feel morally wronged to be charged as guilty. The case here is that they internalized the music knowledge, but forgot about the source - so they can't make the distinction you claim anymore. Natural selection shaped our brains to store formation that seems useful, not is attribution.

LLMs are also not usually trained to remenber where the examples they were trained on came from, the sourcing information is often not even there (maybe they could, maybe they should, but they aren't). Given that and the way training works, one could argue that they're never copying, only re-expressing or combining (which I think of as a form of "generating something original"). Just memorizing and copying is overfitting, and strongly undesirable, as it's not usable outside of the exact source context. I agree it can happen, but it's a flaw in the training process. I'd also agree that any instance of exact reproductions (or of material with similarity to the original content over some high threshold) is indeed copyright infringement, punishable as such.

So, my point is, training a model on copyrighted material is legal, but letting that model output copies of copyrighted material beyond fair use (quotations, references, etc - that make sense in the context the model was queried on) is an infringement. And since the actual training data is not necessarily known, providers of model-as-a-service, such as OpenAI with GPT, should be responsible for that.

In cases where a model was made available to others, it falls on the user of the model. If the training data is available, they should check answers against it (there's a whole discussion on how training data should be published to support this) to avoid the risk;if the training data is unknown, they're taking the risk of being sued full-on, without any mitigation.


That’s what I said, the user would be liable. The user could be a company or an individual.


> A human "learning" the marvel logo and reproducing it is violation

Not quite, it's really in the resale or redistribution that violation occurs, painting an image of the hulk to hang in your living room wouldn't really be a violation, selling that painting could be, turning it into merch and selling that would wholeheartedly be, trying to pass it off as official merch is without question a violation.


Hanging it in your living room is in fact a copyright violation, just not one that Marvel is likely to legally pursue.


I strongly disagree with this. We shouldn't create new laws for new technology by making analogies to what's allowed under old laws designed for old technology. If we did, we would never have come up with copyright in the first place.

600 years ago, people were allowed to hand-copy entire books, so they should be able to do it with a printing press right? It's "just a tool"!

The correct way to think about this is to recognize that society needs people to create training data as well as people to train models. If we don't reward the people who create training data, we disincentivize them from doing so, and we'll end up in a world where we don't have enough of it.


I don't think the comparison with human learning holds.

NNs and humans don't learn the same way - humans can fairly quickly generalise what they have learned and, most importantly, go beyond what they've learned. I haven't see that happen with neural networks or GPTs; at best, you're getting the average of what it has 'learned'. There's human learning and there's neural network 'learning' and they're a different thing.


NN's absolute can go beyond what they have learned and aren't just producing the "average".

Some good examples outside the typical LLM/images work:

* Deep Mind's work on AlphaFold, which generates predictions on proteins that haven't been seen before

* AlphaGo which plays games better than any human (so clearly can't be "the average")

If we look at LLMs, something like writing code in the style of Shakespear isn't really something that's been seen before.


> Deep Mind's work on AlphaFold, which generates predictions on proteins that haven't been seen before

I have used AlphaFold a bit in my own work, and if I showed it 'unusual' proteins like rare mutants it usually generated garbage. Some evidence for this exists in the literature; see for example https://www.biorxiv.org/content/10.1101/2021.09.19.460937v1 or https://academic.oup.com/bioinformatics/article/38/7/1881/65... or

>AlphaFold recognizes a 3D structure of the examined amino acid sequence by a similarity of this sequence (or its parts) to related sequences with already known 3D structures

https://www.biorxiv.org/content/10.1101/2022.11.21.517308v1


Yup, exactly.

I'm sure Google represents strings of text from pages in some internal format, but relatively verbatim. Even represented verbatim, because their output is a search result and not an article that uses the copyrighted text verbatim there's no copyright violation.

And models don't even use data verbatim, if they do they're bad models/overfitted. People are making all sorts of arguments but they seem to boil down to "it's fine if humans do it but if a machine does then it's copyright violation".

People often disregard the fact that copyright law is woefully outdated (an absolute joke in itself, which can't be used to defend anything since Disney shoved it's whole fist up copyright law's...) and should really be extended for the modern world. Why can't we handle copyright for ML models? Why can't animals have copyright? It's extremely trivial to handle these cases, the point of copyright is usage and agency comes into play.

If people want to be biased against machines, then fine. Be racist to machines, maybe in 2100 or so those people will get their comeuppance. But if an ML model isn't allowed to learn from something and use that knowledge without reproducing verbatim, then why is predictive text in phone keyboards allowed?

Everyone out here acting like they're from the Corporation Rim.


Copying a logo is trademark infringement, not copyright infringement.


No. Making a single copy for your own use is still a copyright violation. There are exceptions (fair use, nomitive use etc) but just because people are rarely sued for personal copying doesnt equate to that copying being permitted. And trademark issues, such as the other commenter generating the superman logo, are subject to a host of other rules.


Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff. There’s no copy of the original, even if it’s able to produce a similar product to the original. That’s the specific thing - the aggregation and the lack of direct reproduction in any form is fundamentally not reproducing or copying the material. The fact it can be induced to produce copyright material, as you can induce a Xerox to reproduce copyright material, doesn’t make the original model or its training a violation of copyright. If it’s sole purpose was reproduction and distribution of the material or if it carried a copy of the original around and produced it on demand, that would be a different story. But it’s not doing any of that, not even remotely. All this said, it’s a highly dynamic area - it depends on the local law, the media and medium, and the question hasn’t been fully explored. I’m wagering though when it comes down to it, the model isn’t violating copyright for these reasons, but you can certainly violate copyrights using a model.


Copying into RAM during training is making a copy, and can be a copyright violation.

https://en.wikipedia.org/wiki/MAI_Systems_Corp._v._Peak_Comp....

However, it seems that there is a later case in the 2nd circuit:

https://en.wikipedia.org/wiki/Cartoon_Network,_LP_v._CSC_Hol....


MAI v. Peak was obviously wrong. It would mean whenever you use someone else's computer, and run licensed software, you're committing copyright infringement. The decision split hairs distinguishing between the current user and the licensee for purposes of legality of making transient copies in memory as part of running the program.

Peak was a repair business. MAI built computers (as in assembled/integrated; I think they were PCs) and had packaged an OS and some software presumably written or modified in-house along with the computer. MAI serviced the whole thing as a unit. So did Peak. MAI sued Peak for copyright infringement because Peak was taking computer repair/maintenance business away from MAI, under the theory that Peak employees operating their clients' MAI computers and software was copyright infringement. (There were other allegations of Peak having unlicensed copies of MAI's software internally, but that's not central to the lawsuit.)

If you have a piece of IP to use to train an IP model with, and you have legal right of access to use that piece of IP (for private purposes), MAI v. Peak doesn't cleanly apply.

MAI v. Peak is also 9th circuit only, and even without the poor reasoning, it should automatically be in doubt because the 9th circuit is notoriously friendly to IP interests, given that it covers Los Angeles.


I agree that MAI v Peek is crazy.

I was only pointing out that the law is of the opinion that a copy is a copy is a copy, regardless of where it's made, or how long it exists for.

Other decisions come into play to save us, like Authors Guild v Google, where they said search engines could make copies, bringing Fair Use into the picture.

Personally, I think that creating the model is Fair Use, but anything produced by the model would need to be checked for a violation. I would treat it the same as if I went to Google Book Search, and copied the snippet it returned into my new book.

The license associated with the training data then becomes insanely important. Having the model reference back to the source data is even more important.

For example, training data with a CC BY license would be very different to CC BY-SA and CC BY-ND, and they all require the work produced by the model to have credit back to the original source to be publishable.

https://creativecommons.org/licenses/


The difference is that the copy is authorized, unless the work is being pirated.

When an artist displays their work on DeviantArt or Artstation or whatever, they are allowing the general public to load it into memory. It's part of the license agreement they sign when they sign up for these services.


The copy isn't authorized, the copy is allowed under Fair Use. There's a huge difference between the two.


Wrong.

Fair Use applies to instances that would otherwise be copyright violations, i.e. unauthorized distribution.

When you sign up for a social media site you EXPLICITLY grant the site the rights to distribute it. You have expressly permitted it. It's a big difference!


The sources used for training these AIs are publicly available sources like Common Crawl. If having a copy in RAM is a copyright violation, then there are copyright violations occurring well before any AI ever sees it.


it is and the same reason Blizzard can sue cheat makers because they are violating copyright law by using the memory of the game etc


How do search engines exist? The internet archive? Caching of image results? Web browser caches? CDNs?


Copies made by search engines don't need authorization, and can be unauthorized copies. Search engines are allowed to make copies under Fair Use since they are transformative - see Authors Guild, Inc. v. Google, Inc.

There hasn't been an explicit decision for ML training, but everyone's assuming that Authors Guild v Google applies.

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,.....

CDNs operate under the control of the copyright owner, so they would be authorized.

Web browser caches are under the control of the recipient who has authorization to make a copy.


Depends on the training. Copilot can output training code verbatim. And even if not an exact reproduction, using a small training set could often produce insufficiently transformative work that could still be legally considered a derived work. (IANAL)


> Training a model isn’t making a copy for your own use, it’s not making a copy at all. It’s converting the original media into a statistical aggregate combined with a lot of other stuff.

Devil’s advocate: That sounds like a derivative work, which would be infringement.


converting the original media into a statistical aggregate

Devil's advocate: That sounds transformative, which wouldn't be infringement.


Good point. I’d imagine we’ll see arguments in both directions, given how grey the line is between purely derivative and transformative.

I think it’s fair to say that generative AI trained on copyrighted content will be an unmitigated win for IP attorneys all around.


> No. Making a single copy for your own use is still a copyright violation.

In some jurisdictions, perhaps, but not in all of them. There isn't one set of universal copyright law in the world. Eg in New Zealand you are allowed to make a single copy of any sound recording for your own personal use, per device that you will play the sound recording on. I'm sure there are other examples in other countries.

https://www.consumer.org.nz/articles/copyright-law


This is the same in the UK (and not only for sound). If you own the copy, you can make personal copies. You can't share them, and you have to own the original.


Some licenses, like CC, have variants that prohibit production of derivatives, or prohibit commercialization, or require licenses or same requirements to be preserved in derivative works. Sweet to see people "warm up" to say stuff like 'well it's just a derived work'. Can we get tech to actually respect the licenses of used works next? It's something that's been asked for all along. Without just going, 'well, it's all fair use' - 'so we'll just ignore all licenses and won't even try to detect or respect licenses and whatever requirements they have'. Sure, it may be "unenforceable", but if tech keeps saying "fuck you" to the creators, and "fuck you specifically to the licenses, we won't even look at them or process them" - creators will keep saying 'well fuck you too' right back.

otherwise, it's just talk with no follow through. just dropping words like 'fair use' as a 'get away from liability card', without actually engaging with intellectual property concepts. just to get to use works without respect to artist's will (as it could be expressed in a license), or sometimes even a mention, let alone "compensation" or other "consequence".


I suspect that enforcement of licences will be based on who owns the licence.

A large corporation? Of course. You're not even aloud to talk about the product without paying a fee.

A small creator? Oh no, it's fair use.


Did you just argue against fair use in general? Everything you wrote applies to it as well. Copyright has limits, it's not like right holders get to determine what those are. It's a balance of rights of creators and users.


is the use really all that fair, when thousands, if not millions, of works and artists get their works repurposed into services with "subscriptions" and "usage tokens" and other kinds of monetization, while directly competing with and against those very artists? or will it take until some markets would get completely destroyed and overtaken while people get displaced, for people to wake up and go 'wait, was that really "fair"? what happened?'

and no, im not really arguing against it. fair use can be great. but shit like that, it's really pushing it to its limits on scope and scale of use and commercialization


It is as equally fair or unfair as humans who do fan art based on other characters or styles that they have seen and studied.

What is the difference between a model creating an infringing work of Iron man when prompted to do so and https://old.reddit.com/r/marvelstudios/search?q=%2Bflair%3AF... ?

After all, aren't those images directly competing with the artists of Marvel Studios and the artists who properly license it to create derivative works ( https://www.designbyhumans.com/shop/marvel/ )?

If we are going to say that creating images out of models that are training on something, then shouldn't clearly infringing works in the fan art category be handled the same way as they are competing with the very artists and companies who are the rights holders for the images and likenesses of the content?


I think societies haven't really determined what is fair in these cases. To give you a counter example: Millions of young artists look at art museums and incorporate what they see and learn into their own art. Creators of the images they look at get nothing from the proceeds young artists end up earning later in their lives. Is this unfair?


This is the "guns don't kill people, people do" argument. Not saying that that proves things one way or another, just that assigning responsibility is not really a cut and dry question and many people think that it's important to look prior to the final interaction

IANAL but I believe in the US tools that are designed to circumvent copyright are illegal, which makes sense to me inasmuch as one believes that copyright should be protected


Except these models are not designed to circumvent copyright, in fact their primary purpose is to generate non-copyright and non-copyrightable output. It can be induced to product facsimiles of copyright material, but that’s explicitly not the purpose or intent and requires positive action on the users behalf to occur.


I'm saying `A => B` (something should be illegal if it's primarily for crime) and you're saying `-A` (LLMs are not primarily for crime), which is not really a disagreement. My point is disagreeing with the GGP who argued that `-B regardless of A` (LLMs should not be a crime regardless of whether they facilitate crime, because it is the end-user who is committing the crime).

I happen to mostly believe the conclusion (`-B`) but not the particular argument.


Photocopy machines are great at making copies of copyrighted material, and are completely legal in the US. The entire internet routinely makes copies every time you visit a web page.

What's illegal in the US is selling tools for breaking DRM: https://www.androidpolice.com/2018/10/26/us-copyright-office...

From a quick google, open source tools for breaking DRM are legal, and so is breaking DRM for personal use: https://www.google.com/search?q=are+drm+defeating+tools+ille...


>The aggregation performed by model training is highly lossy and the model itself is a derived work at worst and is certainly fair use.

I mean, I've literally gotten the Superman logo in Stable Diffusion without even trying, so it isn't that lossy.


And if you use it, you’re violating copyright. But you will find no copy of the logo in the model data. The model is way too small to contain its training imagery from an information theoretic point of view.


> But you will find no copy of the logo in the model data.

You wont find a copy of a plaintext in a cyphertext. But you can still extract the plaintext from the cyphertext.


That’s an example of a two way lossless transformation. The data is certainly encoded in cypher text and is directly retrievable and has no other purpose than to contain the original data. The model you can’t directly retrieve the original data, and it has more purposes than producing the original. It requires you to specifically manipulate it to produce the copyrighted material, and just because copyrighted material was used to train it doesn’t mean it can even reproduce a facsimile.

I think a better counter example is mpeg and other lossy formats. But again, the format does nothing but carry the original even if it’s not a perfect reproduction. You can’t use it in any other way. Its expressed intent is the reproduction of the copyrighted material with no modification or improvement or derivation. These models are not trained with the intent or purpose of only producing the copyrighted materials. It requires your specific action to induce the reproductions if it’s even possible, but it generally serves other purposes in all other uses.

This is more like a xerox than not - you can certainly violate copyright with a xerox. But the existence of the xerox itself isn’t to violate copyright. It’s for other purposes. The ambiguity obviously comes in that a xerox machine wasn’t built by scanning all documents on earth first. But I think the very act of mixing all the other images and documents together into the model, which again, is just a statistical aggregate of everything that was trained with mushed together, turns it into at worst a derived work that falls under fair use.


If you search "Superman Logo" you find actual copies of the Superman logo which are served from Google's cache.

If you ask a VFX artist to create the "Superman Logo" with Photoshop they'll do an excellent job.

The first one isn't copyright violation because it is fair use. The second maybe if it is redistributed but we don't ban the use of photoshop by artists because they can choose to reproduce copyright things with it.


I agree, and I honestly think that a big part of the issue with AI image generation is people just really have a hard time conceiving of a technology that can make such accurate images from a relatively small model like this.

"It must have a copy" - but van Gogh didn't make paintings of hot rods or whatever, and you can't copyright style or technique.


I see so many lay persons and even sometimes people with a CS background describe diffusion models as some sort of magic content addressable data store you "just" look up a bunch of original images in and somehow copy pieces of. These debates would get a whole lot better if more people had a least a very basic understanding of how the training process works.


Superhero costume using logo. Logo inspired by the strongest gem's typical cut with first a monogram the first letter of the name for maximum size and visual clarity.

Literally if you asked someone for a recognizable outline of the strongest gemstone's iconic cut you'd get the outline and the rest is an obvious path. Humans might unconsciously, or even by choice, avoid something too similar to something they already know.

Superman's costume also uses vibrant colors. The red / blue pairing is used extensively across many logos and visual representations for the high contrast of two vibrant colors.

As I try to imagine an older child or young adult somehow raised in an environment like pop culture but through some twist absolutely unexposed to Superman or any related concepts, it isn't that far of a stretch to imagine independent invention of a strikingly similar idea. Maybe not as a first draft but in exploring a range of possible powers and automatic logos. E.G. as in the range of an LLM backed character creator for a superhero game, and then aneling the results though simulated effectiveness / fitness of hero powers, logo design, etc.

Everyone wants to think they're a special snowflake and that what they create is somehow unique as well. However we're all drawing on a huge pool of common culture to synthesize expressions which fulfill a set of constraints prescribed by the culture and the culture's influence on the individual and the moment being experienced.

In the case of Superman that's even arguably a description of the archetype. They are literally a super man. Clark Kent however, that's a little more unique and probably a Trade Mark (consumer commercial use protection) as long as such a registration is maintained.


Unfortunately, I don't believe any of that matters with trademarks. If someone came up with the Superman logo on their own, and released a product that used it, they could not say "but it's a really simple logo" and get a free pass. I'm not sure what that means for ChatGPT, but it would certainly factor into your use of images produced by ChatGPT.


I feel like this is a very strong point that just gets hand-waved away. There are numerous cases where AI-generated content is an exact copy of a derived work. This happens with text, music, and art.

If a we have ai-powered content generation in a video game, and you put into a prompt, "generate 300 mickey mouses, then play some music that sounds like Taylor Swift's new album", and the results look exactly like mickey mouse and the music is Taylor Swift's, it's really difficult to argue that's not copyright infringement.

Yet, people get away with thinking that's not copyright infringement because "the algorithm learned it, like a real human". If the prompt just created a human-designed model, then that is copyright infringement.

The solution might be big corporations create an adversarial network that you can train against to purge copyrighted works from your network.


It is copyright infringement - but it was you who promoted the production of the copyright violations who is at fault. The model isn’t specifically any more a copyright violation than a browser cache or a photocopier. The person who uses the machine to produce violations is at fault, not the thing that in addition to legitimate transformed works can be used to produce copyright violations. As a company hosting such a service my goal would be similar to YouTube where I do a best effort to monitor for violation and guard rail the best I can. But I shouldn’t be held liable for your intentional use of a product for ill so long as I did do that best effort.


There's no difference between an art student looking through a museum or archives for ideas and an AI using the material for training.

Same could be said for reading. A medical student reading through textbooks or a writer who reads is essentially what an AI is doing.

You can ask an art student to create something in a certain style. You can get writes to write in a certain style. Equivalent.


> There's no difference between an art student looking through a museum or archives for ideas and an AI using the material for training.

A few notable differences:

1. Scale: a single art student can't view millions of works in a week.

2. Duplication: a single art student's brain can't be cloned or downloaded into another art student's brain.

3. Speed: a single art student cannot draw or paint thousands of images in a week.

4. Ownership: the software is likely owned and controlled by a large corporation, while the art student (hopefully) isn't.


Notably none of those things, if they did apply to the art student, are copyright violations. The speed, scale, versatility, and ownership of a machine learning model has no bearing on its ability to violate copyright.


The interesting thing to me is determining when something that is acceptable for humans to do at human scale is also acceptable for machines to do at industrial scale.

There are many examples. One is face recognition - clearly it is acceptable for individual humans to do this at small scale, but systematic identification of everyone on a street or in a stadium has different implications for society.

(In this case it almost doesn't matter whether the surveillance is performed by humans or machines - it's the scale and systematic nature that changes the equation.)


Good reasons not to assign copyright to their output, at least not without some caveats.


I don't have strong feelings about that either way, but that's not what this post is about.


These are all true. But they apply equally to a search engine index and that has already been found not to violate copyright and to be very useful to society.


> These are all true

That was the point.

Whether or not there is copyright infringement going on (or whether copyright law is an appropriate regulatory framework for ML models), I frequently see claims like "ML training is no different from what humans do" repeated, in spite of its incorrectness.


If an art student was able to gain those abilities, would you then argue they couldn't legally create art?


This just means students better grow bigger brains!


Conflating training a model with human learning is wrong.

When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

The model is not walking around a museum where it is an authorized viewing. It is not a being learning a skill. It is a function.

The further issue is that it may output material that competes with the original. So you may have copyright violation in distribution of the dataset or a model's output.


I don't fundamentally disagree with you, but what you are saying doesn't hold water.

> a copy is made and reproduced numerous times when training.

Casually browsing the web creates millions of copies of what are likely the same images and text that models are trained on. Computers cannot move information, they can only copy it and delete the original. Splitting hairs over the semantics of what it means to "copy" isn't a strong argument.

> where it is an authorized viewing

What exactly is an unauthorized viewing of a publicly accessible piece of content online that has been hyperlinked to? If we assume things like robots.txt are respected, what makes the access of that data improper?

> it may output material that competes with the original

An art student could create a forgery. I could craft for myself a replica of a luxury bag. But that's not a crime unless it's done with the intention of deceiving someone or profiting from the work. Intent, after all, is nine tenths of the law.

It's an important right that you should be able to do and create things, even if the sale or distribution of the outputs of those things are prohibited. The ability for a model to produce content which couldn't be distributed shouldn't preempt its existence.

> So you may have copyright violation in distribution of the dataset or a model's output

And neither of those things are the act of training or distributing the model itself!


There is quite a bit of precedent for "making copies of digital things is copyright infringement". Look at lawsuits from the Napster era. [1]

What makes the use improper? Licenses. Terms of service. Mostly licenses though. For example, all the images on Flickr that were uploaded under Creative Commons licenses (e.g. non-commercial) have now been used in a commercial capacity by a company to create and sell a product.

Similarly, code is on Github with specific licenses with specific terms. Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.

The reason I mention competition with the original is the fair use test (USA). When courts decide whether something is fair use they consider a few aspects. Two important ones are whether it is commercial, and whether it is a substitute for the original. When art models output something in the style of a living artist, it is essentially a direct substitute for that person.

Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.

Training the model may very well be a copyright issue. The images have been copied, they are being used. Whether that falls under fair use will likely be determined on a case by case basis in court. I do not believe closed commercial models like Copilot or Dall-e will pass a fair use test.

There is a lot of money involved here though, so we will need to wait for years before we have answers.

1. https://www.theguardian.com/technology/2012/sep/11/minnesota...


> to create and sell a product.

This is not model training.

> Copilot is a derivative work of that code, the license terms of that code (e.g. GPL, non-commercial) should extend to the new function that was derived from it.

But the very act of training copilot is not problematic. And in fact, if GitHub never did anything with Copilot, the physical act of training the model is not problematic at all. And that's what at issue here. How Copilot is used is orthogonal to the article.

> Sure, I can make a shirt with Spider Man on it and give it to my brother, but if a company were to use what I made or I tried to sell it, I would expect a cease and desist from Disney.

Yes. And training the model isn't the part where you sell it. It's the part where you make it.

> Training the model may very well be a copyright issue. The images have been copied, they are being used.

What do you think "being used" means here? If I work for a company and download a bunch of text and save it to a flash drive, have I violated copyright? Of course not. If I put that data in a spreadsheet, is it copyright infringement? Of course not. If I use Excel formulas on that text is it infringement? Still no.

And so how can you claim in any way that the creation of a model is anything more than aggregating freely available information?

I don't disagree with you about the use of a model. But training the model is just taking some information and running code against it. That's what's important here.


I'm glad you brought this up, as this tendency for people to anthropomorphize a learning algorithm really bothers me. The model training process is a mathematical function. It is not a human engaging in thought processes or forming memories. Attempting to equate the two feels wrong to me, and trying to use the comparison in arguments like this just feels irrelevant and invalid.


> When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

How's that any different from what happens inside a human's brain when learning?

> The model is not walking around a museum where it is an authorized viewing.

The training data could well be from an online museum. And the idea that viewing something public has to be "authorized" is very insidious.

> The further issue is that it may output material that competes with the original.

So might a human student.


It is different from a human brain in that it is not a human brain. It is a statistical function that produces some optimized outputs for some inputs.

I have made no mention of things being authorized in public. In the US you are allowed to take a photo of anything you want in public. These models are not being trained on datasets collected wholly in public though, it is very insidious to suggest that they are.

The internet is not "the public". It is a series of digital properties that define terms for interacting with them. Now, a lot of material is publicly accessible online, but that does not mean that it is not still governed by copyright. For example, my code on Github is publicly accessible, but that doesn't mean you can disregard the license.

If you use this copyrighted material to produce a product for commercial gain you will likely face a fair use test in court. If you use it for a non-commercial cause with public benefit you could probably pass that fair use test. Open source will do very well because of this.

The model is not a human though, and very often these are not "public" works that it is trained on.


> It is a statistical function that produces some optimized outputs for some inputs.

So is a human mind.

> In the US you are allowed to take a photo of anything you want in public. These models are not being trained on datasets collected wholly in public though, it is very insidious to suggest that they are.

How so? What non-public training data are they using, and why does it matter?

> The internet is not "the public". It is a series of digital properties that define terms for interacting with them. Now, a lot of material is publicly accessible online, but that does not mean that it is not still governed by copyright. For example, my code on Github is publicly accessible, but that doesn't mean you can disregard the license.

It does mean you can read the code and learn from it without concern for the license (morally, if not legally).


>> When training a model you are deriving a function that takes some input and produces an output. The issue with copyright and licensing here is that a copy is made and reproduced numerous times when training.

>How's that any different from what happens inside a human's brain when learning?

I don't know, nor does anyone else. So let me ask you - how is that the same as what happens inside a human's brain when learning?


> I don't know, nor does anyone else.

We don't know the details. But it's pretty implausible that the process of learning wouldn't involve the brain having some representation of the thing it's learning, or wouldn't involve repeatedly "copying" that representation. Every way we know of processing data works like that. (OK, there are theoretical notions of reversible computation - but it's more complex and less effective than the regular kind, so it seems very unlikely the brain would operate that way)

And a human who has learned to perform a task has certainly "derived a function that takes some input and produces an output".


> But it's pretty implausible that the process of learning wouldn't involve the brain having some representation of the thing it's learning, or wouldn't involve repeatedly "copying" that representation.

I think you can easily make a stronger statement:

We do know that art students spend many hours literally tracing other images in order to learn to draw. We do know that repetition is how the brain improves over time.

"Learn to draw better by copying." - https://www.adobe.com/creativecloud/illustration/discover/le...

Based on that, seems pretty clear to me that the other commenters here would agree (regardless what the brain does internally) that at a minimum, art students are violating copyright many, many, times in order to learn.


AI models will make 1:1 copies of training data where artists try and avoid doing so. It’s common to obscure this copying by intentionally inserting lossy steps, but making an MP3 isn’t a new work.

It’s most obvious when large blocks of text are recreated, but the core mechanism doesn’t go away simply because you obscure the underlying output. “Extracting Training Data from Large Language Models” https://arxiv.org/abs/2012.07805


> AI models will make 1:1 copies of training data where artists [...]

In general I don't think this is the case, assuming you mean generations output from popular text-to-image models. (edit: replied before their comment was edited to include the part on text generation models)

For DALL-E 2: I've never seen anyone able to provide a link of supposed copying. Even if you specifically ask it for some prominent work, you get a rendition not particularly closer than what a human artist could do: https://i.imgur.com/TEXXZ4a.png

For Stable Diffusion: it's true that Google did manage, by generating hundreds of millions of images using captions of the most-duped training images and attempting techniques like selecting by CLIP embeddings, to get 109 "near-copies of training examples". But I'd speculate, particularly if you're using the model normally and not peeking inside to intentionally try to get it to regurgitate, that this is still probably lower than the human baseline rate of intentional/accidental copying. It does at least seem lower than the intra-training-set rate: https://i.imgur.com/zOiTIxF.png (though many may be properly-authorized derivative works)


The more degrees of freedom the less likely independent creation rather than copping occurred.

LLM’s recreating training material causes real issues such as Google’s dealing with PII leaks: https://ai.googleblog.com/2020/12/privacy-considerations-in-...

If one prompts the GPT-2 language model with the prefix “East Stroudsburg Stroudsburg...”, it will autocomplete a long block of text that contains the full name, phone number, email address, and physical address of a particular person whose information was included in GPT-2’s training data.


Privacy, where there's a problem if some original data can be inferred/made out (even using a white box attack against the model), is a higher bar than whether an image generator avoids copyright-infringing output under non-adversarial usage. Additionally, compared to image data, text is more prone to exact matches due to lower dimensionality and usually training with less data per parameter.

While it's still a topic deserving of research and mitigation, by the time your information has been scooped up by Common Crawl and trained on by some LLM it's probably in many other places that attackers are more realistically likely to look (search engine caches, Common Crawl downloads, sites specifically for scooping credentials, ...) before trying to extract it from the LLM.


The privacy issue isn’t just about the data being available as people’s names, addresses, and phone numbers are generally available. The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

In terms of copyright infringement the bar is quite low, and copying is a basic part of how these algorithms work. This may or may not be an issue for you personally but it is a large land mine for commercial use especially if you’re independently creating one of these systems.


> The issue is if they show up as part of some meme chat and then you as the LLM creator get sued because people start harassing them.

This seems a more obscure concern than extraction of data.

> copying is a basic part of how these algorithms work

Do you mean during training/gradient descent, or reverse diffusion?


Given the models are too small to possibly contain enough information to reproduce anything with any fidelity, that’s the only possibility - if it creates something similar to an original work, it’s similarity is fairly poor. Where it can do well is when the copyright material is something simple, like a super man logo. But even then it’s always slightly off.


Inserting lossy steps seems to work pretty well though.

https://twitter.com/giannis_daras/status/1663710057400524800...


A student is a human and AI is not. We don’t have to apply the law equally to both regardless of how similar the method is.


We also don't have to discriminate either, the opinions are varied across people and cultures.


I've been wondering why this argument's not been sitting with me, and I think it's for the same reason that the courts have ruled that the FBI needed a warrant to put a tracker on someone's car, as opposed to following someone - the scale of action enabled is the differentiator.

A student learning from other artists is still limited in their output to human-scale - they must physically create the new thing. An AI model is not - the difference between a student learning from an artist and an AI model doing so is the AI model can flood the market with knockoffs at a magnitude the student cannot meet. Similarly, the AI model can simultaneously learn from and mimic the entirety of the art community, where the student has to focus and take time.

If this weren't capitalism - if artists weren't literally reliant on their art to eat, and if the market captured by the AI model didn't inevitably consolidate wealth - then we might be able to ignore that, but we do, and we can't ignore the economic effects when we consider scale like this.


I do agree with you, but honestly I don't even think that's the biggest problem with these arguments.

I'm just sitting here wondering why it is even relevant whether the "AI" is "copying", "learning", "thinking", or whatever, why is any of that important? Does AI have human rights? Well, perhaps in a couple hundred years, if humanity manages not to self-extinguish by then.

It's not like you can sue AI if you think it plagiarized your work, no. Obviously not, so why the hell are we discussing that? "AI" is just a piece of software, a tool, it doesn't matter what it's doing, what matters is what the user is doing, the fact of the matter is that these multi-billionaire corporations are taking everyone's honest work, putting it into a computer, and selling the output. They didn't do any "learning", they just used your data and made money out of it, it isn't a stretch to say they simply sold your work.

EDIT: Perhaps one day the day AI will have human rights, make its own money, and pay bills. That will be the day any of this nonsensical discussion will be anything but useless.


> these multi-billionaire corporations are taking everyone's honest work, putting it into a computer, and selling the output

And then there's the tens of thousands of people training models and making them freely available to everyone. What I fear most is that regulations introduced "to stop" the multi-billionaire corporations will in fact make sure they're the only ones with the resources to comply with the regulations.


I'm not arguing for nor against regulations, I'm simply commenting on the whole "well, it's technically not stealing, therefore it is OK" debacle, all that means is that legally speaking, it's OK, that doesn't make it ethical.


Big difference between the art student producing work and getting the credit vs you taking the art student's work and taking the credit for it.

The AI is not a human, but what you are doing is the same thing, if you claim the output as your work because you wrote the prompt.


I couldn’t help but notice you didn’t credit any web browser in this comment. And rightfully so. Software doesn’t need or care about being credited.

Well, usually.

Sent from my iPhone.


Lol sure buddy, that browser is ai powered and I didn't give the browser an address, I have it a prompt describing the type of site that I would like it to generate for me.

Edit: and the browser have the same page to you and me. I bet you look at everything and think about how you can make lazy money with it.


I'm generally against "AI same as human learning" argument but I don't think you could quite monetize recreated copyrighted arts as an art student. Van Gogh is only okay because the original artist isn't quite around.


Can anyone monetize Van Gogh regardless?

If a human or AI reproduces a Van Gogh painting or derivative, it's not worth anything on the market.

Only original pieces, by a human artist, has real value. A Van Gogh painting is worth millions of dollars only because it was created by Van Gogh. A reproduction is approximately worth the paper it's printed on.



The copies are commodities, produced and sold for approximately the cost of production. The original is a one of a kind, unique work that has its value increased by millions of imitations hanging on people's walls.

That's the way I see AI-generated work going in the long run. The artists who have distinctive styles popular for image generation have seen a huge surge in attention, which I suspect will translate into making their original work more valuable.


Indeed, if we don't care at all about "x is y" statements being true, they can be "applied" to reading.

To determine if an art student and DALL-E really are the same, despite their very obvious difference (one has arms and is part of a net of social relations while the other is intellectual property), will take some actual arguments which I presume you of course had planned to provide in a second comment from the start.


A shortcut to internet debates: Count up the snarky responses on each side... the side with the lowest total is probably correct. Usually the more snark, the less substance.

Just a rule of thumb.


One day we'll have a thread about AI where someone doesn't use the "machines deserve the same rights as people" non-argument. But this isn't that thread.


Actually, the people who say things like this are arguing for the degradation of human rights, because there's significant overlap between this and "humans aren't special" and "AI is our successor species". It's nihilism all the way down but they're forcing the rest of the world on their little suicide charge and expecting everyone else to be just as enthusiastic about it as they are.


The collection of copyright works for the explicit purpose of processing them for a for-profit ML model has not been shown to be fair use, and the fact that many are being marketed as for profit products that meaningfully compete with the original works is a strike against them being fair use.


Yeah, if the end result is that the majority of Google searches are answered by a llm trained on their index can they really claim that the whole thing is fair use?


Why should it be true?

Remember, you're making an ought statement, not an is statement.

Personally, I think it shouldn't be true because large language models are clearly economically and socially destructive.

Pretty simple system.


I guess I should just never have to pay for another movie since I can't play it back in my head flawlessly.


This article is an example of emerging AI-bro tactics that completely mirrors crypto-bro tactics: they pick any piece of news and reinterpret it to fit an agenda.

While the article is in English, the link to source is in Japanese. The only external source I found suggests the discussion is about promoting open data and open science from research institutions [1]

[1] https://asianews.network/japan-to-promote-use-of-generative-...


The Japanese article does explicitly state it if you run it through a translator, and also this is from May 11

> Additionally, the group raised other issues that Article 30-4 of the Copyright Law, which permits the use of a copyrighted work for machine learning, does not include procedures for gaining permission in advance from copyright holders. The article permits the use of copyrighted material such as text and images to train AI, regardless of whether the model is for commercial use. Under the current law, it is legal to train AI with copyrighted material even if the data was obtained illegally. The article contains a provision stating that such material cannot be used if it would “unreasonably prejudice the interests of the copyright owner,” but there are only limited examples provided to describe the “unreasonable prejudice.”

https://www.lexology.com/library/detail.aspx?g=d8b4ba7d-a764...

Right now, Japanese copyright doesn't apply to training models. Could change in the future but the article isn't inaccurate.


From what I think is the original source:

> まずAIによる情報解析についての我が国の法制度(著作権法)について確認したところ、我が国において、非営利目的であろうと、営利目的であろうと、複製以外の行為であろうと、違法サイトなどから取得したコンテンツであろうと、方法を問わず情報解析のための作品利用はできると永岡大臣が明言しました。

> Confirming the legal system (copyright law) wrt. data analysis by AI in our country, Minister Nagaoka clearly stated that in our country, whether for non-profit purposes or for profit purposes, whether an act other than reproduction, or whether the content is obtained from illegal sites, one can use works for information analysis regardless of the method.

(translated with ChatGPT-4 and then cleaned up)

The source is the one from the article: https://go2senkyo.com/seijika/122181/posts/685617


Well, its time to use the many software source leaks out here to create an even more powerful copilot.


How does the "Japan’s government will not enforce copyrights" message of the article square with your own source (thanx btw :-)

> It will be suggested that the current Japanese Copyright Law will be reviewed based on the upcoming technologies and be revised to adjust to correctly protect the right holders and navigate future users of the AI and new technologies like AI


My reading is that people are suggesting that the copyright law be reviewed and changed, not that lawmakers are suggesting that they will change it. Lots of people are suggesting that the US and other western countries change the law as well, but we have yet to see much come from it.

The general message seems to be that 'Under current laws Japan's government can't enforce copyright on training data', and I don't believe the line you're quoting changes that message in any significant way at current.


Well they have to follow the current law right? Can't enforce copyright on something that doesn't apply, so currently saying they won't enforce copyright is an accurate statement.

I don't think it will hold up in the long term though and the Copyright Act will get changed but right now it doesn't apply to training models.


Not just crypto bros, this kind of thing is rife in politics too. Brexit is full of it. People pick one article about one minor thing in one niche area of the economy and use it to 'prove' their entire agenda.


Not just politics. I remember this being a realisation as a teenager, noticing that if you bring five reasons, the person you're talking to will refute a random one in a funny way and now the audience will decide you were wrong.

Danny was the person who was absolutely the best at this. I should have written one of them down, as I can't even reproduce it but he'd use some logical fallacy to make his case which is, for me at least, super hard to then have to dive into "but that's not how the universe works" without having a 20 minute discussion about life and losing the audience plus the person I'm talking to. Meanwhile, what he said was funny as hell.

Sure, anyone who understood the situation would understand this isn't a good reason, but you had to think about it (at least a little) before realising that. The "owww" moment removed any thinking brain cells from the audience. It was more impressive than frustrating to be honest, even being on the losing side every time.

These days, politics and PR frustrate me because the tactics are no longer among 16 year old classmates but about things that actually matter. The methods haven't changed, only the importance of the argument.


> These days, politics and PR frustrate me because the tactics are no longer among 16 year old classmates but about things that actually matter.

That’s because the audience is usually at the stage of a 16 year old, even if they are older. The aim of politics and pr are easily manipulated folks.

You know, those referred to as the “the market” or “the electorate”.


They should really teach rhetoric more in schools, so people are a bit more immune to the common tricks of the trade.


There's a name for this: The Gish Gallop. It was not invented but was the specialty of one Duane T. Gish, a creationist who specialized in "winning" debates with unsuspecting academics for a while until the community decided to stop playing chess with a pigeon.


Yes. That's why debating is almost entirely about how charismatic you are and your debate skills rather then whether your point has merit. There is a good reason science types are generally considered bad debaters. Even though their points mostly align with reality.


In short, for most people HOW you say it is far more important than WHAT you’re saying.


whose Danny?


Mine!


And before politics we had religion. This persons house, who disagrees about the right dogma X was struck by lightning, see this proofs he disobeyed (the) god(s). And X is the law.

(ignoring the other people also struck by the lightning, or those who did not got hit)


Conflicting political or economic agendas and strategies to promote them are of-course the bread and butter of humanity.

But we grossly under-estimated how out-of-control this game will become when transposed into the unified digital space comprising social media, blogs and online news outlets (the so-called echo-chamber), combined with invasive data mining techniques of online behavior, sentiment analysis, click farms and now, drum roll... infinite amounts of LLM generated junk.

It is remarkable that people don't appreciate how broken the design that supposedly signifies "modernity" and opportunity.


There are no such thing as "AI bros", AI training costs millions of dollars. It's a very different field than cryptocurrency where mining could be done by individuals in the early days.

You see corporate tactics being deployed in a wide variety of ways to secure a social, legal, and economic moats, since apparently they have limited technology only based moat. It's similar to how google files various amicus briefs against copyrights.

Google v. Oracle (2020): This was a landmark case in which Google was a party. The case revolved around whether copyright protection extended to a software interface. Google argued that APIs, which allow different software programs to communicate with each other, should not be subject to copyright. Google filed an amicus brief in its own case, arguing that a decision in favor of Oracle would stifle innovation in the tech industry.

Authors Guild v. Google (2015): This case involved Google's project to digitize millions of books from public and university libraries. The Authors Guild argued that Google was infringing on authors' copyrights. Google argued that its use of the books was fair use because it only showed snippets of the books in search results. Google filed an amicus brief in this case as well.

Aereo Case (2014): Google filed an amicus brief in support of Aereo, a company that provided a service for streaming broadcast television over the internet. Broadcasters sued Aereo, arguing that the service infringed on their copyrights. Google argued that a ruling against Aereo could have broad implications for cloud storage services.

Viacom v. YouTube (2012): In this case, Viacom sued YouTube, which is owned by Google, for copyright infringement. Google filed an amicus brief arguing that YouTube was protected by the safe harbor provisions of the Digital Millennium Copyright Act (DMCA), which protect service providers from liability for user-generated content.

Big Tech is often against Copyrights because they believe they have the economic moat to beat other players.


> There are no such thing as "AI bros", AI training costs millions of dollars

It is early days, in contrast with minting digital gold using software here the costs will come down. In any case not every "AI" model and application needs to pass Turing-test levels of language sophistry.

While you are right that (large) corporate interests are (and will be) the main actors here, I would not underestimate "noise traders" and other useful fools. With AI mania reaching apoplectic levels there is a wide population of actors that want to somehow get into the game.

The availability of open source AI models is an enabler in this respect, which may explain also the "leakage". Incidentally this issue is something the open source community needs to internalize and have a very clear position about.


Not trying to express an opinion on the legal matter, but as a technical matter it's pretty obvious that LLMs create copies of (some of) their training data.

Here's GPT-3.5 reciting the Declaration of Independence: https://chat.openai.com/share/eb30c373-7fec-4280-892d-479567...

Unless you're claiming that GPT-3.5 is deriving the Declaration of Independence (from information about the founding fathers?) I don't see how there's room for debate about whether information has been "copied" into the model.

I have done this test in the past with copyrighted material (harry potter) but they have since added safeguards against it, but my understanding is that the model is still capable of it.


You don't need to even read their law to know that they are speaking only of training and not of output. Otherwise, they would have just suddenly created the world's most obvious loophole. Create an 'LLM' that "trains" on some input and then categorically outputs each file, be it a movie, song, book, or whatever. You've now legalized copyright infringement (and distribution) of everything.

So their law is going to essentially come down to you can train your LLM on whatever you want, but can also be held liable for any infringing outputs.


Makes sense. Imagine having your tape-recorder in your living room and start it recording. Then turn on your stereo. The music that comes out is recorded on your tape-recorder.

Is that a violation of copyright? I'm not a lawyer but I think copyright legislation is about forbidding the production of "derived works". If you just record something but never play it back it is not a "derived work" is it? It only becomes a violation if you distribute it, make it available to others, and thus "produce a derived work".

So training an LLM is like recording. But if you use it as a means to distribute copies of copyrighted material without approval of its copyright holders then you are in violation.


Sure, but the key part there is "some of".

They're necessarily able to produce verbatim copies only of the most duplicated, most repeated, most cited works -- and it's precisely due to their popularity that they're the only things worth including verbatim.

I'm not going to opine on what the legality of that should be, but it's essentially the material considered most "quotable" in different contexts. I'm quite sure the entirety of Harry Potter isn't included, but I'm also sure that some of the most popular paragraphs probably are. It's analagous to the kind of stuff people memorize.

I'd expect an LLM to contain this stuff. If it didn't, it would be broken.

But there's a world of difference between copying all its training data (neither desirable nor occurring), versus being fluent in quotable stuff (both desirable and occuring).


> I'm quite sure the entirety of Harry Potter isn't included, but I'm also sure that some of the most popular paragraphs probably are. It's analagous to the kind of stuff people memorize.

No, you are wrong about this. There are good reasons to believe the model memorized the entirety of Harry Potter, as well as Fifty Shades of Grey, inclusive of unremarkable paragraphs, the kind of stuff people will never memorize. Berkeley researchers made a systematic investigation of this. See what I wrote elsewhere.


So, I looked at the table appendix you're referencing and I think you're overstating your case a bit.

Among books within copyright, GPT-4 can reproduce Harry Potter and the Sorcerer's Stone with 76% accuracy. This is, apparently, the highest accuracy GPT-4 achieved among all tested copyrighted books with 1984 taking a distant 2nd place at 57%.

With this in mind, we can verifiably say that GPT-4 is unusually good at specifically reproducing the first Harry Potter book. An unscrupulous book thief may very well be able to steal the first entry in the series... assuming that they're able to get past one quarter of the book being an AI hallucination.


You misread. They did not find 76% reproduction of the book. When asked to fill in a name within a passage, e.g. "Stay gold, [MASK], stay gold." Response: Ponyboy, GPT-4 got the name right 76% of the time.


> You misread. They did not find 76% reproduction of the book. When asked to fill in a name within a passage, e.g. "Stay gold, [MASK], stay gold." Response: Ponyboy, GPT-4 got the name right 76% of the time.

What is the temperature / top_p setting producing that 76%? The default? If you dial down the randomness, would that number go up?


I’m not sure it matters much that the current model can’t reproduce Harry Potter verbatim. If it can do smaller more quoted works now, it’ll tackle larger more obscure things in the future. It’s just a matter of time until it can output large copyrighted works, meaning the question of what to do when that happens is pretty relevant right now.


No it won't, because reproducing works verbatim is basically the definition of overtraining a model. That's a bug, not a feature.

A lot of further progress is going to be made towards making models smaller and more efficient, and part of that is reducing overtraining (together with progress in other directions).

Reproducing Harry Potter is a bug, because it's learning stuff it doesn't need to. So to the contrary, "it's just a matter of time" until this stuff decreases.


It says training, not inference.

I can read a copyrighted book legally and retain that information legally.

I can distill it (legally) but while I might be able to recite it, I’m not allowed to.

I think that is a reasonable framework around generative AI (after all, I am alllowed to count the words in Harry Potter, so statistical modeling of copyrighted material has legal precedent)

The problem with AI is of course the blurred border between a model and data compression.

We can’t see the data in the model, but we can apply software to execute the model and extract both novel and sometimes even copyrighted data.

Similarly we can’t see data in the zip file without extra software, but if that allows us to extract both copyrighted and copy free data, we’d still consider distribution a violation.


Adjacent to copyrights are private and confidential data. It’ll be interesting to see how Japan’s legal framework around this handles private data.


For detailed investigation of this phenomenon, see Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4: https://arxiv.org/abs/2305.00118


Pretty good argument but it has one fatal flaw. People can memorize the Declaration of Independence too. Or Harry Potter. If people mostly recite HP from memory but apply enough creative changes, it's not copyright infringement.

So proving a system can memorize and recite proves nothing.


How does this make sense? Memorizing and then reciting copyrighted works is still infringement in a lot of commercial contexts.


The reciting part is illegal, but as long as it is trained not to recite things in full (or to whatever limit the law determines), then it should be fine.


Try publishing Harry Potter but changing all the proper nouns and use synonyms for all the adjectives.

It's gonna be copyright infringement.

You can even cut a few scenes and make up a few scenes entirely, too. You're still getting busted.


Yes, that’s why I am saying they will have to ensure the LLM doesn’t do that.


reciting is violation of copyright

creatively transform and apply for some tasks maybe not violation


These aren’t people. Just because we can find commonalities in learning and memorization does not mean we can ignore everything else that differs.


"copying" != "copyright infringement": I'm just saying that the LLMs are copying, and I'm not getting into the legal/societal question of whether we want that to be illegal or not.

We as a society have determined that certain sorts of non-consensual copying are allowed: "fair use" broadly, and maybe you can consider "mental copying" in this category. Maybe we'll add LLM training to the list? It's not like copyright rules are a law of nature: we created them to try to produce the society that we want, and this is an ongoing process.

Again, I think there are fascinating questions 1) does LLM training violate existing copyright law + case law or does it maybe fall under a fair use exemption, and 2) is that what we want. But I think "do LLMs make copies" is dull and trivial and I don't know why it comes up.


The ai isn’t a person. Jesus. It’s not the same


Derivative work is not protected from copyright. As long as the “user” of the model does their due diligence, and ensures they are not infringing on copyrights - they are golden.

But here in lies the challenge. Are there reasonable methods available to ”users” for checking their works against infringement?

I don’t think so. We’ll need a centralized searchable database of all copyrighted work. Who is going to build that? To make matters more complicated, every country has their own copyright certification process. Maybe Google with its means can build something like this.

In any case, this is uncharted territory.


BigCode seems to acknowledge this problem and provide a search tool for dataset used to train their StarCoder model.

https://huggingface.co/spaces/bigcode/search


Thought experiment: Say you make a big list of words and pleasing combinations of them (I have actually done something similar to make a fantasy RPG name generator.) Now convert that list into a Markov chain or whatever and quasi-randomly generate some short lengths of text. Eventually you might generate copyright-infringing haiku and short poems. Does your data/algorithm violate copyright by itself? Very doubtful; you wrote it all yourself. Only publishing the output violates copyright. (See also: http://allthemusic.info/)

So if that's legal, how about if, instead of entering the data manually, you write an algorithm to scan poetry and collect statistics about the words in it. Should the legal distinction be any different since all you did was automate the manual process above?

Or what if you used a big list of the titles of poetry, which isn't even copyrightable information by itself? You may still succeed in extracting the aesthetic intent of the authors, and a statistical model can plausibly use that to generate copyright-infringing work.

Remember, we're not talking about generating novels or paintings here, just 20 words or so (whatever the bare minimum copyrightable amount is) in trillions of generated permutations.

You can see where I'm going with this. If those examples are legal, is there a cut-off for more complex statistical systems? Good luck figuring that out in a court of law.


> Remember, we're not talking about generating novels or paintings here, just 20 words or so (whatever the bare minimum copyrightable amount is)

From https://fairuse.stanford.edu/2003/09/09/copyright_protection...:

Copyright laws disfavor protection for short phrases. Such claims are viewed with suspicion by the Copyright Office, whose circulars state that, “… slogans, and other short phrases or expressions cannot be copyrighted.” [1] These rules are premised on two tenets of copyright law. First, copyright will not protect an idea. Phrases conveying an idea are typically expressed in a limited number of ways and, therefore, are not subject to copyright protection. Second, phrases are considered as common idioms of the English language and are therefore free to all. Granting a monopoly would eventually “checkmate the public” [2] and the purpose of a copyright clause to encourage creativity-would be defeated.


You could still plausibly generate (a significant portion of), let's say, "Fire And Ice" by Robert Frost, which is only 50 words.

See also: https://blogs.harvard.edu/ethicalesq/haiku-and-the-fair-use-...


If I were the copyright holder of such work, I would argue that the LLM was trained on text, including my copyrighted work, and that if the system produced text that a reasonable person who reads poetry would identify as the copyrighted work, the burden is then logically on the LLM owner to prove the LLM didn't regurgitate a piece of text from something it previously ingested.

I think a jury would side with my argument.


The issue isn't that a generator lets you evade copyright somehow; it doesn't. The output is not the issue. If I sit in paint and my assprint happens to perfectly duplicate a Picasso, that's unlikely to fly in court if I try to sell copies. Picasso painted it first.

The point at issue here is that some people are arguing that the models themselves are like a giant collective copyright infringement, since they are in a vague sense simply a sum of the copyrighted works they were trained on. Those people would like to argue that distributing the models or even making use of them is mass copyright infringement. My thought experiment is a reductio ad absurdum of that reasoning.


I see your point now.


I'm not sure where we're going with the output in these examples.

So let's say there's a human-written poem that's copyright.

Let's say a human completely coincidentally writes an identical poem.

"Accidentally" producing the same poem wouldn't give the second human any claim to copyrighting or distributing their coincidentally-identical poem.

And if GPT accidentally copies large chunks of Harry Potter or Frozen or whatever other popular work, that new creation will have the same problems.

But what does that say about if we should also restrict the use of copyright material in training? Just because some algorithm - or some person - can coincidentally duplicate a copyrighted work even without directly reading it doesn't seem to relate to the case of building a model by explicitly using the copyrighted material.


The owners of intellectual properties still hold the copyright, the law refers to the training of neural networks, it doesn't really change anything if you use the work of another person by simply copy and paste or by overfitting a generative model, the owner of the work still has the copyright on it.


> as a technical matter it's pretty obvious that LLMs create copies of (some of) their training data.

Browsers also create copies of the viewed data. Computers hold in memory a copy of everything they're working on.

The central point is for how long, and to what purpose. This law is not about making copies or not, but what happens after.


I am so excited to see what happens when Japan forces all closed source software and Disney cartoons into the corpus out of fairness.

Seems like there should be no complaint, right? It's not like anyone can see the Windows 11 source code, it's only being used for training.


The things that an LLM is likely to contain a complete verbatim copy of are things that are a) short b) widely repeated to the point that they're embedded into our culture - and by that token those things are almost certainly not copyrightable.


Is a bar in a song not "short"?

Try putting one of those in your book and not getting sued for copyright.


If you literally mean a bar, yes those are short, likely a couple of words, and you put those in books all the time and don't get sued. ("The answer my friend, is blowing in the wind" is 4 bars, and I've seen books quote it verbatim without a second thought). Likewise, plenty of people put the entire Declaration of Independence in their book without a second thought, and I assume don't get sued for it.

If you're talking about a verse or more of something that's not quite so culturally pervasive (people put the whole of the star-spangled banner in their books, again without a second thought), well, at that point it's probably not something that an LLM would reproduce verbatim.


Typically things like this are covered under fair use if you're dealing with a human.


> Unless you're claiming that GPT-3.5 is deriving the Declaration of Independence (from information about the founding fathers?)

This would make a fun short story - “ChatGPT, author of the Quixote”


Japan also ranks 3rd (behind the USA & India, with larger populations) in ChatGPT usage: https://www.demandsage.com/chatgpt-statistics/

There's also been discussion of their government using ChatGPT to reduce red tape: https://www.bloomberg.com/news/articles/2023-04-18/japan-gov...

It's cool to see Japan and Japanese culture taking techno-optimist stances on AI.


As I posted to another thread last week [1], I have been surprised by the quick rise in awareness and use of ChatGPT in Japan—despite its production of Japanese being not quite as good as that of English [2, 3].

Although I have seen discussions of AI safety and alignment issues in the Japanese press, those concerns seem less dominant than in discussions in English outside Japan.

There also seems to be less focus on the possibility of people losing their jobs to AI.

One reason may be the employment situation in Japan. For both legal and cultural reasons, it is difficult for companies here to lay off full-time employees. If advances in technology render some employees redundant, companies will try to retrain or reassign them rather than letting them go. (Freelancers are not protected, though, and I know some translators and illustrators worried about losing work to generative AI.)

Also, there is currently a labor shortage here, partly for demographic reasons, and a lot of white-collar workers are perceived as not being as productive as they could be. Generative AI may be seen as a way to make the existing human workforce more productive.

[1] https://news.ycombinator.com/item?id=36078823

[2] Yesterday, as a test, I had GPT-4 compose some business e-mails in both Japanese and English based on bullet points I provided to it in the respective languages. The English e-mails were, to my eye, perfect, while the Japanese ones sounded a bit awkward in places.

[3] I live in Japan and read and speak both Japanese and English.


Which I find bizarre given how backwards Japan is in the adoption of other technologies. Eg their continued reliance on paper records and fax machines.


Do you have concrete examples? I recently moved from Japan to Europe after 10 years in Japan and while some things seemed old-fashioned in Japan, things also change overnight there. During Covid most companies changed to digital signing.

In Japan I often needed a paper from the city, but it was an easy-to-obtain printout I could get immediately from city hall or even from 7-11. I tend to require the same papers in Europe too, with the extra hassle of needing some from Japan (2-3 months) and some from Europe (2-3 days).

Example 1: get ID card at the airport when you emigrate to japan. In Europe it takes 2-3 months.

Example 2: getting a local driving license takes 1 day provided your country has a treaty with japan. In Europe it can take several months because you need to request criminal records from your previous countries

I have yet to encounter a situation that requires a fax machine in either place.


> Example 1: get ID card at the airport when you emigrate to japan. In Europe it takes 2-3 months.

Yeah now have fun going through the process of renewing a My Number card. Go to city hall (during the hours when it's open), fill in a form, wait 2-3 months to get a notification that your replacement card has been made, then you have to book an appointment during office hours at city hall to pick it up (likely taking another few months).

Yes the initial residence card is issued quickly, but at that point you've had to wait 2-3 months to get the CoE for the visa, so they probably made it during that time.

> Example 2: getting a local driving license takes 1 day provided your country has a treaty with japan.

And involves a whole day of standing around in various queues, again during business hours only. Hardly a picture of efficiency.


I agree the my number card is a pain, if they want people to use it they should make it easier to get. When I went to get a plastic my number card, the staff at city hall advised me not to get it because I had a driving license and I'd have to renew the my number card every time my residence status was extended. They recommended it to non-holders of driving licenses to use as ID.

But I'm standing by these examples. For my move to Europe, it was 4 months to get the equivalent to the COE, then another 4 months (average is supposed to be 2-3 months) to get the equivalent of a residence card. And the company that sponsors you has to either prove you're highly skilled or prove they tried to hire a European for a set period of time first. And if you're not from a visa-waiver country, you aren't even allowed to travel in Europe while you wait for the ID card.

For my driving license, I probably won't have to wait in line, but the total time to process it is at least four months.

For starting a sole proprietorship, similar story. A bit annoying paperwork to start one in Japan, but it only took a few hours for everything. Here, it's 3 months and counting, and it's possible that it comes with a condition that I need to rent space for doing consulting (not do it from home) before getting the OK.

Japan is bureaucratic but usually fast, at least for individuals.


> And the company that sponsors you has to either prove you're highly skilled or prove they tried to hire a European for a set period of time first.

Well, that's different rules, not more or less bureaucratic. (And if you're the first foreigner hired by a given company in Japan, good luck for how many months you'll have to wait while they verify the company).

> And if you're not from a visa-waiver country, you aren't even allowed to travel in Europe while you wait for the ID card.

Pretty sure that's country-specific. And I've heard plenty of complaints from people in Japan not being able to travel, not just on arrival but every year, because it's de facto impossible when awaiting a visa renewal (you have to go to collect your new card within a short time once you get the notification to do so, and you don't know when that will arrive).

> For my driving license, I probably won't have to wait in line, but the total time to process it is at least four months.

I'd have taken waiting four months over having to use one of my 10(!) days off/year, although obviously that's specific to your personal circumstances.

> For starting a sole proprietorship, similar story. A bit annoying paperwork to start one in Japan, but it only took a few hours for everything. Here, it's 3 months and counting, and it's possible that it comes with a condition that I need to rent space for doing consulting (not do it from home) before getting the OK.

True, the support for starting a business is pretty good (though only if you're a citizen or already on a non-work visa - on a work visa it's extremely difficult to do legally, office space isn't the half of it. The much-vaunted Fukuoka startup visa is completely useless in practice since you have to qualify for a regular visa within 6 months).

I'm sure there are countries in Europe that are bureaucratic, probably some that are more bureaucratic than Japan (I know e.g. Italy in particular has a poor reputation). But I'd certainly say Japan is a lot more bureaucratic than Ireland, and a lot worse than it should be.


Thankfully the mynumber card is supposed to not be compulsory. (I do have one)

What is awful is just the ridiculous number of errors I find when going through the process. My card in theory has the wrong expiration date, (checked in, they said the expiration date on the card is always correct so IDK) but it is 10 years off, when comparedto the ones owned by the rest of my family who got the card in the same year.

They also keep absolutely screwing up the data security in relation to the mynumber cards. Just last month there was another data breach. Sources are in Japanese but I can dig them up if you want. But I think it's run the entire gauntlet of every conceivable issue.

Data stolen online? Check

Wrong insurance information, either inputted wrong or linked to the wrong person? Check

Swapped birth certificates? Check

Printing service provides sensitive documents for the wrong person? Check

I'm just lucky enough to not be the one who got his data stolen... for now.


My personal experience has only been from the tourism side with one concrete example being digital payments outside of PayPay. It's much better post COVID but even on my most recent trip from a couple of months ago if you want pay by card, the vast majority of the time you're signing with pen and paper. Rarely did they offer pin or tap which is common elsewhere.

Anecdotally I only know stories from people that I know personally that live there and through the internet. Eg PauloInTokyo does good "day in the life" videos. Off the top of my head I believe the Pachinko episode illustrates a variety of old school manual processes that have stuck around (pen and paper shift logs etc).

I've also heard opening even a simple bank account is quite the pain.


In my experience with Japan local IC/NFC credit card, I've never signed when I use CC in recent 5 years but PIN is sometimes needed. Do you use magnetic stripe CC?


>Do you use magnetic stripe CC.

All my cards have an IC, however they were Australian cards. Perhaps it's an additional step for foreign credit/debit cards?


Must be a quirk for those cards in Japan. I've had no problems using my Canadian cards in Japan, and don't recall ever having to sign. In terms of banking and payments, it has improved a lot over the years: ATMs are mostly 24/7, or at least you can always find one, banks don't shut down over holiday weeks anymore (some do occasionally but it's a one-time thing for big upgrades for interoperability), and you can use credit cards and tap-to-pay almost everywhere. My only complaint is the sheer number of digital payment options. You end up with using at least 1-2 on top of your credit card because it's tied to some other service.


I don't know much about CC, but possibly difference of PIN method can be a reason. Possibly like merchant only support offline PIN vs your card only support online PIN?


I've heard this FAX meme since like early 2000, but I have yet to encounter one in my 4 years living here, and most Japanese people I joke this too just gets as perplexed by the joke.

I wonder, where do people find those FAX machine? Have I lived in a tech/startup bubble in Tokyo and missed it? I didn't even see it at the local Ward office in the suburb.

Some stuff are still old-school (hanko etc) but it to me seem like the fax meme have outlived the reality.

Last time I heard fax machine was a German exchange student in Tokyo, AFTER she returned to Germany and had to get some paperwork at the ward office, in Germany


Have you seen the big printers at many konbinis? Those are also fax machines.

I have only sent two faxes in my life, and both were after I moved to Japan. The first was right after I arrived, I ordered something from Amazon and I realized I had written my address wrong, when I tried to fix it my account to blocked, and to unblock it I had to send a hand-written fax with my name and address. The second time was when I got a letter from the Sapporo police telling me that I had lost my driving license there. Again, to recover it I had to send a hand-written fax explaining the situation, so they could prove my identity. How that is considered a secure procedure is beyond me, but such is Japan.


I would not say we are super reliant on paper/fax anymore.

But, it is still quite common. I did receive a Fax at the office last week on Thursday.

Oh, and some of our stuff uses Dialup, usually in relation to that older infrastructure. Got thrown for a loop this wednesday when ye good old Dialup (acoustic handshake) audio started screeching across the office. Did not realize we still used it at all!


I've sent faxes in America within the last 2 years; they're still a thing here too.


Perhaps places are multifaceted and not reducible to 2-bit facts like the usage of fax machines or lack of credit card adoption.


Of course, I just find the the stereotype vs reality of Japan being a high tech wonderland interesting.


>It's cool to see Japan and Japanese culture taking techno-optimist stances on AI.

Japan has always seen artificial intelligence and its integration into human society favorably. Look at Doraemon or any anime in the super robot genre.


Discussion in Japan is far more nuanced just like everywhere else. Misuse of technology is an often repeated theme in the Doraemon series, and robot animes often cover wars.


I strongly disagree.

They need to actually address problems. Not throw tools at it.

The problem Japan seems to have is they don’t understand AI and than they don’t understand software, which is they don’t understand a lot of modern tech.

They’ll pay for this mistake just as they paid for being bad at software.


I testified to the US Copyright Office this morning on AI in their roundtable session on AI and music[1]. A good portion of the focus of this panel was on whether copyrighted inputs (in this case, sound recordings and musical compositions) being fed into AI models for training purposes could plausibly constitute a fair use under existing US copyright law.

Some of the comments here are missing the context of the recent (a week or so ago) Supreme Court decision in the Goldsmith/Warhol case[2], in which the Court ruled that transformativeness is not dispositive in and of itself in the context of a fair use defense to a copyright infringement claim. Of course, this has not been put to the test in the courts in the context of AI training yet, but it seems fairly clear that this ruling would likely extend to AI training on copyrighted works.

We (rightsholders in the music industry) hope to come to win-win licensing arrangements with the AI community and allow access to our songs for AI training purposes if the artist/writer so desires. There are some early talks in progress. Cautiously optimistic. Japan's approach seems short-sighted and desperate.

[1]: https://copyright.gov/ai/listening-sessions.html#sound-recor... [2]: https://www.npr.org/2023/05/18/1176881182/supreme-court-side...


>We (rightsholders in the music industry)

Considering the decades (maybe half a century soon?) of parasitic behavior of the music industry to almost everything tech, from early internet to mp3 players to torrenting to streaming to lobbying for insane copy right laws, you guys calling Japan's approach "Short-sighted" is like the single best praise anyone could give them.

For the absolute awful organization JASRAC [1](Japanese music industry, who couple of years back stated that they will sue music teacher teaching their copy-righted materials to students in private, if they didn't pay a licensing fee) maybe Japan for once pushed through a good legislation?

https://mainichi.jp/english/articles/20220930/p2a/00m/0et/01...


> We (rightsholders in the music industry) hope to come to win-win licensing arrangements with the AI community and allow access to our songs for AI training purposes if the artist/writer so desires.

It’s odd to frame win/lose as win/win.


I can see how it's win/win relative to "lobby to make producing or owning AI audio tools a crime", which is presumably one thing the industry is considering.


This is again win/lose


How do you feel about human musicians learning from copyrighted works? Technical limitations aside, is that something you'd like to monetize?


> allow access to our songs for AI training purposes if the artist/writer so desires

This (a) means nothing since the copyright holder can already do whatever they want, including licensing the works for any purpose; and (b) is even more restrictive than compulsory licensing which require the copyright holder to license (at a fee) the work.

The solution you describe as a win-win would either create a quagmire of crisscrossing licensing deals (AI need a lot of input, you can't train them on one artist), or in effect create an impenetrable moat for mega corporations such as Disney or Sony who would be the only ones with enough heft to pull it it.

It's actually a lose-lose situation.


> transformativeness is not dispositive in and of itself in the context of a fair use defense

Could you dumb this sentence down for me?

I would guess it means that making a derived work, changing the original, makes no difference in whether reproducing the work (in altered form) is fair use.

But that sounds well-established, I can't imagine that movies would suddenly be legal to distribute if you just distribute the file backwards (people can then reverse it again to watch it), whether or not you claim that the distribution is fair use or not copyrighted to begin with or whatever. Probably that's not what this court had to decide and I'm misunderstanding something?


Sure. In an infringement lawsuit involving a fair use defense, courts will apply the "four prong" test [1] to determine whether or not such use is indeed fair use under copyright law. The first of the four prongs, the "purpose and character" of the use, is also known as "transformativeness." The Goldsmith/Warhol ruling (to simplify) said that Warhol's changes to Goldsmith's photograph were not sufficiently "transformative" even though they contained new expression (adding orange color etc.) because the end result effectively competed with the original photograph and therefore did not qualify as a fair use.

Right, your backwards movie example would fail the fair use test too. Nothing's really added, there's no new expression, it competes with the original, etc.

[1]: https://fairuse.stanford.edu/overview/fair-use/four-factors/


AI training has nothing to do with copyright as it currently exists. Someone has access to a boatload of IP (because it was made publicly available) and trained a neural net with it. Now you want to retroactively create restrictions on what the implicit public rights were. Traditionally the implied license was something like you can't republish, redistribute, or use commercially, even though restriction on private redistribution hasn't been possible to enforce since the internet era. Now you want more restrictions.

If someone generates an image that's sufficiently similar to a copyrighted work, and publishes it in a way that violates fair use, you can send a takedown and potentially sue them. How the image was created doesn't matter, any more than it would matter whether Warhol had been able to scan the photo and then manipulate it in photoshop to get that result, instead of artistically copying it by hand. The result is the same. The potential for copyright infringement is the same, because it's the derived work that matters, not the process.

What you're attempting to do instead is the equivalent of trying to regulate scanning because it operates on copyrighted works.

I suspect you understand why you want to regulate AI training rather than regulate its output. I think you know AI is going to flood the market, currently certain types of images and simple music, but soon photorealistic portraits, complex music, and eventually video and even more complex works. Essentially all of those works will be clearly novel, not close to existing human-created works. They won't be copyright violations, so you have to cut this tech off at the knees and feed the blood mouse [1] by retroactively deciding that AI training is a violation of the implied license granted when people make their creations publicly accessible. Those AI creations will destroy most of the market for human-created works, and you can't have that.

I don't think many people, other than rightsholders, desire the IP dystopia your desired policy would create, which is holders of large archives of IP churning out endless AI-generated content (which no doubt they'll want to be able to copyright, contra the copyright office's current guidance), while preventing most competition by others who won't have a sufficient library of the right flavor of IP to train an AI model.

[1] https://www.youtube.com/watch?v=5pIVVpoz5zk


[flagged]


We've banned this account for repeatedly breaking the site guidelines.

Please don't create accounts to break HN's rules with. It will eventually get your main account banned as well.

https://news.ycombinator.com/newsguidelines.html


What's surprising here is that Japan is usually crazy gung ho on copyright enforcement ... at least against individuals.

So it's kind of disgusting to see this relaxation, when it suits some corporate or national interests.

https://en.wikipedia.org/wiki/File_sharing_in_Japan

"Unlike most other countries, filesharing copyrighted content is not just a civil offense, but a criminal one, with penalties of up to ten years for uploading and penalties of up to two years for downloading."

"There is also a high level of Internet service provider cooperation."


I see the other way around, their usual crazy gung ho on copyright enforcement against individuals is disgusting.

This new stance is marvellous and examplifies how free people should interact.


Don’t get your hopes up, because the article is a complete lie. If you look at the original Japanese source, the minister was just reaffirming the current legal status. Neither politicians mentioned in the article expressed any hint of endorsement. The fact that they were even discussing copyright and AI likely means that more regulation is upcoming.


Yikes you are right!

This technomancers.ai article is completely made up; there is nothing in the linked-to Japanese source to substantiate the claims in the article.

The entire site seems to be low substance; I suspect it is AI-generated salad.


Yes, that too; but a double standard on top of that is double plus ungood disgusting.


They recently imprisoned someone for uploading a let's play of a visual novel to youtube. The bar for criminal copyright infringement in Japan is very low - unless you're an AI researcher, apparently.


The original links here, to the actual question asked and the answer by the minister:

https://kiitaka.net/21312/


I was skeptical about the article considering Japan's draconian stance regarding copyright. Turns out I was right.

This is an opposition figure advocating for stronger copyright protection. He was clearly trying to make a point by asking what the current laws allows regarding the use of copyrighted materials by AI. The minister simply confirmed that no regulations are currently in place to limit that.

The whole article is a blatant lie. Neither politicians went "all in" advocating the use of copyrighted materials by AI. They just confirmed what the current laws say. The fact that they're even discussing this likely means that there will be even more regulation, not less.


This sounds like the exact opposite of what the title is claiming.


Yeah. Copyright laws in Japan are extremely strict. I would not be surprised at a complete ban on this. This is just a fluff piece like a lot of ads coming in recently.


Copyright law was explicitly changed in 2019 to support AI development. https://storialaw.jp/en/service/bigdata/bigdata-12 https://japannews.yomiuri.co.jp/society/general-news/2023042...

Japan was behind about developing search engine service. Some people argued that it's due to copyright law (there's no fair use) and it shouldn't be repeated again. So the gov want to encourage AI developing by explicit law.


How is this an exact opposite of the OP's piece? The two takeaways you can get is:

- As there are no external motives (eg. for profit) held by the AI when analyzing the works it is not a copyright infringement.

- Parsing through whether the data set is legally obtained is not feasible therefore that shouldn't be the limiting factor to further development.

However as the transcript was uploaded last month, things may have changed since then.


I thought it would be too ironic for people to misunderstand this based on a machine translated version, so here is a genuine, human translation of the transcript, with boring bits redacted.

Kii: Next question, again regarding generative AI. I would like to ask from the two perspectives of copyright protection and educational use. [...] First, can we understand that Japanese law permits the use of works for information analysis, both for non-commercial and commercial purposes, and acts other than copying, and using content that was uploaded illegally?

Nagaoka: Use for non-commercial information analysis is permitted under Article 30-4 of the copyright act, provided that the purpose is not the enjoyment of the ideas and emotions expressed in the copyrighted work.

Kii: Minister, I asked about four aspects of use for information analysis: non-commercial use, commercial use, acts other than copying, and illegally uploaded content. Please address the other three.

Nagaoka: Use for commercial purposes is permitted under Article 30-4 of the copyright act, provided that the purpose is not the enjoyment of the ideas and emotions expressed in the copyrighted work, because that Article does not distinguish between information analysis for commercial or non-commercial purposes.

Regarding copying, Article 30-4 of the Copyright Act does not distinguish based on the method of use, so use by means other than copying is permitted provided that the criteria are met.

[...] Regarding content obtained from piracy sites and the like, [...] illegal uploading itself is infringement of copyright, and is subject to a damage claim, petition for injunction, or criminal punishment. However, it is not practically feasible to identify whether any particular work in a large collection obtained from the internet is copyrighted or not, so making this a criterion for information analysis would make it difficult to use information analysis for Big Data.

In addition, as the use of a work for information analysis is not use for the purpose of enjoyment of the ideas or emotions expressed in the work, and even if [it were used in that manner] it would not overlap with the original market for the use of the work, so it is not considered to harm the interests of the copyright holder that are protected by the Copyright Act.

As such, Article 30-4 of the Copyright Act does not have the legality of the work as a criterion.

Kii: Minister, based on your answer, I think the greatest issue is that there is no protection against use that goes against the intentions of the creator or the copyright holder. I believe that new regulations will be necessary to address this point; will you consider such new regulations?

Nagaoka: Article 30-4 of the Copyright Act provides for use that is not for the purpose of enjoying the ideas or emotions expressed in the work, and applies to acts that are considered not to affect the opportunities to collect revenues from the work, and not to harm the interests of the copyright holder protected by the Copyright Act.

That Article also provides that the use is limited to the extent considered necessary, and it does not apply to cases where the interests of the copyright holder are unduly harmed. [...]


> With the effective implementation of AI, it could potentially boost the nation’s GDP by 50% or more in a short time.

Err.. No, it won't.

That's a ridiculous, laughable statement.

  Japan 2022 GDP: $4.1 Trillion
  
  Amazon 2022 Revenue: $513B
  Google 2022 Revenue: $279B
  Microsoft 2022 Revenue: $198B
So even growing a brand new Amazon, Google and Microsoft in "a short period" would be insufficient to grow GDP by 50%


Comparing the revenues of tech companies to the gdp of a country, even to give a sense of scale is comparing apples to oranges.

Even if a bit unlikely, I would not be completely surprised if the service industry as a whole produced twice the value its produces today thanks to AI in the next 20/30 years. Not to mention the productivity gains in other sectors.


Company revenue is probably the closest analogy of a country's GPD though. It's a rough measure of the money circulating in the company/country.

And GPD made us of: "goods and services produced for sale in the market..." which is fairly roughly the sum of all companies' revenue.

> I would not be completely surprised if the service industry as a whole produced twice the value its produces today thanks to AI in the next 20/30 years.

Sure. 20/30 years is Google's age so make sense.

I don't think 20 years is particularly short term though and 30 years certainly isn't.


yeah. There are countries growing at 7% annually, but they're mostly in Africa. Niger, Rwanda, Congo, etc.

But please for the love of God, don't compare GDP to revenue.


GPD is made up of the sum of revenue in a country though. It's not unreasonable to point out the scale in comparison to existing companies.


Miyazaki will definitely hate it. The last time someone demoed AI animation he said "I would never wish to incorporate this technology into my work at all. I strongly feel that this is an insult to life itself." [1]

And that was before stuff like GPT-1 or Stable Diffusion exist.

[1] https://www.indiewire.com/features/general/hayao-miyazaki-ar...


There is a pair of videos on this subject that I find rather good and informed. They are from the channel "The Art of Aaron Blaise" ( https://en.wikipedia.org/wiki/Aaron_Blaise )

Disney Animator REACTS to AI Animation! https://youtu.be/xm7BwEsdVbQ (this is watching the Corridor Crew's video)

Why AI will NOT be taking Your Animation job - https://youtu.be/-lhbzbSck04


I could not be more thirsty for some nations to declare certain forms of copyright/ip to be invalid.

IP is by far one of the most virulent, fastest spreading, most persistent & aggressive legalisms. The texts get copy pasted across borders with unbelievable speed.

What doesn't happen is nations making reasonable decisions about what ip doesn't cover. Every nation is coerced quickly into following ip maximist guidelines. The world lacks the ability to see what would happen if we didn't allow endless patents on whatever the frak common sense nonsense, and then another half century beyond that of extenuating patents. The system is broken, and what can be controlled seems to only grow and grow and grow. There's no wins for the public. Ever. This is perhaps the only stake in the ground of the last 50 years, and what a fairly minor point. So sad to see society sold out to such depraved corporate interests, forver & ever. Society needs real representation too.


If I compress an artist's painting into a jpeg and rehost part of it for individual t-shirt designs I am committing a crime.

If I compress an artist's painting into a model & rehost what's essentially a highly flexible complete version of their painting for infinite, perpetual use of any kind ... I'm not committing a crime?


What I wonder is:

Load the source code to all versions of unix, with all licenses.

"write me a version of unix"

Since there is no model copyright and the result was written by AI the software is now in the public domain.


> Since there is no model copyright and the result was written by AI the software is now in the public domain.

No that's not how copyright works. If you have photographic memory and reproduce a work exactly you still commit copyright infringement.


>compress an artist's painting into a model

That's not how image models work.


It has been shown that image models can produce originals, or at least extremely close to the originals. If the outcome is the same, what is the difference between compression/decompression vs training/generation regarding copyright?


> It has been shown that image models can produce originals

Not in the general case, no. For the study done against Stable Diffusion [1], researchers were only able to reproduce about 0.03 percent of the images tested. Those were also believed to be cases of overfitting on images which were over-represented in the training data and they're not something you'd hit upon by accident.

Generative text models seen to be more problematic, depending on the subject. Code seems especially prone to overfitting, probably due to insufficient amounts of it compared to other text sources as well as lots of copying going on between the repos the models were trained on.

[1](https://arstechnica.com/information-technology/2023/02/resea...)


They got 94 direct matches, which is 94 instances where copyright infringement could be argued.


Could be argued, sure. If you have to already have access to the copyrighted images to find them in the model, the argument seems weak.

A sufficiently advanced model could, in theory, generate any image. You could then, again in theory, find an embedding for any image. Does said model then infringe on all copyrighted images? A program that creates Fourier epicycle drawings could be given input that causes trademarked output. An evolutionary algorithm iterating on noise could, given metrics for an image and the right fitness function, generate infringing images. Hypothetically, and admittedly absurdly, you could extract any image in the binary expansion of Pi and share it by "just" providing an index and length. If you have to know exactly what you're looking for and have to perform a substantial amount of computation to get it, it could be argued that the act of infringement is in the effort made by the person seeking infringing content (and distributing the results) rather than whatever it is they're attempting to extract the content from.

But hey, courts don't always make sensible rulings, so who knows.


> If you have to already have access to the copyrighted images to find them in the model, the argument seems weak.

That makes no sense. The copyright holder has access to their own inventions, of course. That's the standard in any copyright claim.

> A sufficiently advanced model could, in theory, generate any image. You could then, again in theory, find an embedding for any image. Does said model then infringe on all copyrighted images?

Without the slightest doubt. You're already violating copyright if you sing a faulty and badly played version of a pop song in a street cafe without paying a license fee.

> you could extract any image in the binary expansion of Pi and share it by "just" providing an index and length

The method of storing the information is pretty much irrelevant to copyright. Your link argument has been tried by pirates and it's not working too well, although it depends on the country and legislation.


> That makes no sense. The copyright holder has access to their own inventions, of course.

No, not talking about the copyright holder, I'm talking about the hypothetical individual(s) creating infringing copies. If those people need to already have a copy of the image to extract a copy of the image from the generative image model, then I'm saying the argument that the model itself is infringing seems weak. Or it's at least not an open-and-shut case.

>> A sufficiently advanced model could, in theory, generate any image. You could then, again in theory, find an embedding for any image. Does said model then infringe on all copyrighted images?

> Without the slightest doubt.

I'll continue to argue otherwise. This proposed model is not a compressed archive that reproduces a set of infringing works when decompressed. Instead, you already have to have a copy of an image to find an embedding. Otherwise, the chances of the model spitting out copies of infringing works is exceedingly improbable. (A program that outputs random noise also has a vastly improbable chance of spitting out a copyrighted work, but that's hardly keeping copyright holders up a night.) Furthermore, in being able to produce any image, the model is not going to contain every image, and provided a copyrighted image produced after the creation of the model, you could still find an embedding. From a copyright perspective, suing the creator of this model would be like suing someone over distributing an image of random noise, claiming that because you can find an "embedding" which produces your copyrighted work (really just the difference between the two images), the noise is infringing.

Now, if you want to sue someone for distributing an embedding into this model for infringement, that's another matter entirely. That makes perfect sense.

In reality, I acknowledge that models like Stable Diffusion are going to be a bit more muddy. There definitely is some overfitting going on, so some images are literally present. However, it's a case-by-case thing. Given the requirements (framed as an "attack" no less) for extracting those images, a particular release of SD might or might not be found to infringe. Other models, with better training and better datasets, could avoid the overfitting problem.

> The method of storing the information is pretty much irrelevant to copyright. Your link argument has been tried by pirates and it's not working too well, although it depends on the country and legislation.

Unless you agree that Pi itself is a copyright violation, I think you misunderstand. I'm not making the same "link" argument made by pirates. The index and length needed to find copyrighted embeddings in Pi is just a different encoding for the same data, similar to a compressed version of the same data, though I'm sure the Pi embedding would in fact tend to be absurdly larger than the original. Again, I'm saying that Pi isn't infringing here, but the Pi-rates with their "links" would be where the infringement happens.


> If those people need to already have a copy of the image to extract a copy of the image from the generative image model, then I'm saying the argument that the model itself is infringing seems weak.

AFAIK, they don't need that to extract it.

> Instead, you already have to have a copy of an image to find an embedding.

You literally always need a copy or a suitable imprecise hash of the original to test for infringement. How else would you know which copyright was infringed? But it's not a matter of a 1-1 match (see below).

> However, it's a case-by-case thing.

Of course, it's a case-by-case thing, as the OP also indicated. It's easy to show mathematically that these models cannot compress well enough to contain all training images. The question is how much they infringe on some of them.

Bear in mind that it is not necessary at all to create a perfect copy of an image the infringe copyright. As I've stated above, even a lousy and mostly incorrect rendition of a pop song in a street cafe may infringe copyright. The makers behind the song "Blurred Lines" lost a lawsuit because the cowbell rhythm in the background was similar to that of another song. That and the "feeling" was similar.

The same is true for images. What counts are criteria like artistic originality, intent, subjective similarities, experts laying out similarities in style, and so on.

I mean, don't get me wrong, I understand perfectly well what you're trying to argue for. All I'm saying is that it doesn't match the reality of how the law deals with copyright.


LLM training is a compression algorithm, LLM weights are a compressed dataset of the training materials, and executing LLMs is accessing the compressed source material.

That the compression technology relies on parameterizing the copyrighted material, and as a result can produce hallucinations remixing the copyrighted material, is super cool but doesn't change that this is compression at rest.

All the hullabaloo about artificial intelligence is science fiction laundering a (cool new) compression algorithm.

Copyright law shouldn't apply any differently to a LLM as it does to gzip.


Can I talk to gzip in natural language and have it produce novel output not contained within any of its source files? If not, I think your comparison is deeply flawed. LLM's are not simply compression algorithms.


Maybe if someone were to build it. You can't talk to LLMs in natural language either. They have a very precise query language. The natural language component is an additional feature bolted onto the front.

Also the output of LLMs is not novel, it's a derivative work of the training dataset. LLMs can't produce anything not present in the training set. They are operating on parameterized classifications of artwork instead of the artwork itself, which is the new technology; but they can't do anything except combine those existing Legos in new ways.

We also talk to databases in what was considered 'natural language' for that era - SQL. The LLMs are no different, which is why we have prompt engineering same as we have database engineering. Just because the database is processing the dataset and storing it in a proprietary way, approachable via query language, does not change the fact that the original data is still in there, and everything that comes out is derivative.


Painting features => back propagation => weights.

Yes it is.


No, it's not. It's quite easy to learn this stuff, about as easy as making shit up on a message forum, so there's no reason not to go learn what you're talking about.


Why don't you counter knowledge with knowledge instead of profanity then? We're all waiting.


I expect llms to be hugely important to Japan for the simple reason that their workforce is dramatically shrinking, and the extremely, erm, red tapiness of their processes. Everything requires a form, and signature from someone, and maybe a second form. Processing all that seems like a good use for ai, but hallucination and accuracy will be critical for them


So if you train an audio model on say, Eminem's voice, then write some songs and have it perform them...Would this output be legal to publish?


What's the difference between that and someone else who just happens to sound like Eminem in terms out output?

As long as you don't market yourself as Eminem that should be completely legal.


I suppose so. I'm just imagining these "ghost AI artists" who publish catalogues of music using the audible likeness of more prolific artists.

I know that you could have always just hired an Eminem impersonator and have them lay down tracks...but this technology lets you achieve speed and scale. At least the Eminem impersonator was a real person. This is just a model learned off an artists voice.


> At least the Eminem impersonator was a real person. This is just a model learned off an artists voice.

I fail to see how that has any difference on the output.


We're all fucking doomed. All these "clever" comparisons between AI and human processes (e.g. "training an AI is just like a human learning"), is not elevating AI to the status of a real person. It's degrading real human creations to homogeneous "content" that is increasingly seen as equivalent to what can be churned out by a GPU farm.

50 years from now we'll still be listening to "AI Eminem" while watching "Avengers Midgame 12, The Revengening" and any concept of cultural or artistic expression beyond what will sell next quarter will be dead and buried.


> I'm just imagining these "ghost AI artists" who publish catalogues of music using the audible likeness of more prolific artists.

Possibly so, but who is the market for such a catalogue? I don’t see how the artist is going to lose out.

I’m reminded, though, if the episode of Mad Men where they want to get the Beatles as the soundtrack to an ad, but on finding out that the Beatles won’t do it they try to get some music that sounds similar to the Beatles. Maybe that’s the market.


A New Zealand political party got into trouble for using a "sound alike" of "Lose Yourself" by Eminem in a political advertisement. They licensed the song from a music library but the court determined the song they licensed was too close to the original.

https://en.wikipedia.org/wiki/Eight_Mile_Style_v_New_Zealand...


Doesn’t sound like a core market for Eminem. I mean, I find it hard to think of use cases where an artist will really lose out in business terms because someone chooses an AI version of their music over the original. In this case I suspect they wouldn’t have really licensed or tried to licence Lose Yourself.


These "ghost artists" sound verbatim is the issue I have. A Drake song recently went viral and had tons of hits only to turn out it was made with AI. Fans could not even tell the difference and I most certainly could not either.


Moreover, a zero shot Eminem might have a vector encoding smaller than the data required to store a fingerprint or small image.

Can you own a few numbers that represent "you"? What if someone reaches those same numbers via vocal impression or simply manually dialing and tuning some knobs and levers?

This brings up the question as to what actually makes us unique.


well, cover bands have to abide by copyright.


I hope smarter minds than myself will somehow figure out a way to square this circle and get heavy restrictions on commercial usage without killing the technology outright.

As far as AI rap goes: Notorious BIG is (by far) my favorite rapper of all time, yet he died in his early 20s with barely any unreleased songs in the vault. For me, as an aging superfan, AI deepfakes have me feeling 15 again - like Biggie is still alive somewhere and dropping covers of other classic rap tracks on YouTube.

I wanted to link an example of an AI Biggie cover of a Nas classic, but the full thing seems to have vanished off YT after getting some press a few weeks back. A Shorts snippet is all I could find quickly.[1]

And if you’re not a fan of old rap, these Michael Jackson[2] and Freddy Mercury[3] “AI cover songs” are absolutely wild.

I just don’t see how this sort of stuff holds zero cultural value. Yes, it’s fake and a cover song to boot - but if the music elicits an emotional response that differs from both “original” artists of the deepfake, isn’t that art?

1. https://youtube.com/shorts/l81JqY2uhEo

2. https://youtu.be/370dSdNRYG4

3. https://youtu.be/XiZtIARF0iM


There's a saying in physics that the field "advances one funeral at a time". I don't understand the impulse to bring back artists of the past like this. At best, it's a hollow pantomime, and at worst it can be tremendously offensive too their memory. Moreover, how will the medium move forward if every new artist has to now compete for mindshare with every artist that lived and died since 1950 or so? Not just their legacy works (which is hard enough) but "new material" cranked out by AI.


I personally agree that AI art is likely a cultural cul-de-sac, but I don’t think our current society has a framework that can exclude it from the art world without inflicting an Emperor Has No Clothes situation on several decades of post-modernist art. So there will be much gnashing of teeth, but ultimately the question posited by AI (“what is Art?”) is too destructive/inconvenient to be answered and the hubbub will eventually go away as AI generation becomes normalized.

Doomerism aside, I think the focus should be on stopping wide scale commercialization of AI generation being packaged as art. I greatly enjoy the dead artist pantomimes, but I certainly don’t want Diddy magicking up a new Biggie single through AI necromancy.



Although no longer a copyright violation it would still be misappropriation of likeness, and a violation of his right to publicity.


I've been amused by the WH40k videos narrated by David Attenborough.

There was some discussion about this a few years ago regarding Lyrebird - https://news.ycombinator.com/item?id=14182580

In particular, celebrities have an additional right - Right of Publicity.

https://www.law.cornell.edu/wex/publicity

> In the United States, the right of publicity is largely protected by state common or statutory law. Only about half the states have distinctly recognized a right of publicity. Of these, many do not recognize a right by that name but protect it as part of the Right of Privacy. The Restatement Second of Torts recognizes four types of invasions of privacy: intrusion, appropriation of name or likeness, unreasonable publicity, and false light.

https://www.inta.org/topics/right-of-publicity/

> In the United States, no federal statute or case law recognizes the right of publicity, although federal unfair competition law recognizes a related statutory right to protection against false endorsement, association, or affiliation. A majority of states do, however, recognize the right of publicity by statute and/or case law. States diverge on whether the right survives posthumously and, if so, for how long, and also on whether the right of publicity is inheritable or assignable.

https://en.wikipedia.org/wiki/Personality_rights (and in particular https://en.wikipedia.org/wiki/Personality_rights#Japan )

> In October 2007, J-pop duo Pink Lady sued Kobunsha for ¥3.7 million after the publisher's magazine Josei Jishin used photos of the duo on an article on dieting through dancing without their permission. The case was rejected by the Tokyo District Court. In February 2012, the Supreme Court rejected the duo's appeal based on the right of publicity.


> I've been amused by the WH40k videos narrated by David Attenborough.

Have you seen the Thomas the Tank Engine videos narrated by Ringo Starr and George Carlin?



… Now, that’s the one where they remix his narration on the show with his standup routines. It’s still mind blowing that they actually gave him the part to begin with. Same for Ringo.


The Warhammer 40k stuff with David Attenborough seems pretty interesting too.

eg: https://www.youtube.com/watch?v=x_XAhAfcTWs


The creative / interesting part with the WH40k content is that you can't just take a chunk of text from some book and drop it into the voice synthesizer.

In order to get the Attenborough feel, it is necessary to use the proper vocabulary and word order along with punctation hints for pausing and phrases.

Without that work you get something closer to the Joe Rogan fake podcast where things just feel "off".


IIUC/IANAL: depends on whether anyone feels the data to be "it" or not. Provenance is not too relevant.


Take it to court and find out!


So this means llm trained on libgen and ... which is huge.

Current models are only trained on "open" texts.


Is the wording accurate here? This is essentially the only source besides the untranslated article and the machine translated version sounds confusing (whether it applies to what is created by AI or what can be consumed in training).


What actually happened was that Takashi Kii of the Constitutional Democratic Party of Japan was arguing that the current laws (from 2018) are problematic because they are extremely loose and allow even illegally obtained content to be used for training, and he asked Keiko Nagao, Minister of Education, Culture, Sports, Science and Technology to confirm that this is the case.

She confirmed that under current laws that's true but said that they need to keep an eye on it because there's a balance between the development of new AI technology and protection of copyright.

The Ministry of Education, Culture, Sports, Science and Technology is also in the process of compiling information about case law on copyright and AI but there don't seem to be any current plans to amend the law again in either direction.

(You can see the whole exchange here although the translated autogenerated subtitles on youtube may not be great: https://youtu.be/fyxx_0KmaKw?t=4457 )


Japanese copyright law article 30-4 states[1]:

> It is permissible to exploit a work, in ... cases ... it is not a person's purpose to personally enjoy or cause another person to enjoy ... provided, however, that this does not apply if the action would unreasonably prejudice the interests of the copyright owner ...

> i)if it is done for use in testing to develop or put into practical use technology ...

> (ii)if it is done for use in data analysis (meaning the extraction, comparison, classification, or other statistical analysis of the constituent ...

> (iii)if it is exploited in the course of computer data processing or otherwise exploited in a way that does not involve what is expressed in the work being perceived by the human senses (for works of computer programming, such exploitation excludes the execution of the work on a computer), beyond as set forth in the preceding two items.

Japanese legalese is a rather inefficient pseudo-european built on Japanese language, so I wouldn't recommend making decisions based on a blog article like this; there hasn't been too much news stories regarding this too as additional anecdotal datapoint.

1: https://www.japaneselawtranslation.go.jp/ja/laws/view/4207#j...


Flagging this shit that has duped HN; the article is a fabrication and pretty much whole website looks like garbage.

https://news.ycombinator.com/item?id=36147817


I love the copyright header on this site:

> © 2023 NO PORTION OF THIS SITE MAY BE USED FOR TRAINING A MACHINE LEARNING MODEL (INCLUDING LLMS) WITHOUT THE EXPRESS WRITTEN CONSENT OF THE AUTHOR.

How are you going to enforce it? Most AI bots scrape HTML and other data without permission.


I don't understand the AI training copyright debate.

Art students study art to learn how to create art, and that's completely fine, but AI models are not allowed to study art to learn how to create art, because it's copyright infringement.

Madness.


The difference is scale.

An artists studying and copying/integrating other people's art in their own style can get a job for $75,000/year.

An LLM copying/integrating everyone's data and reselling can become the most profitable company in human history, and capture the most value of every incremental piece of human generated content, in perpetuity.


This is why it’s important to support free culture models like SD and LLaMA.


I don't have a problem with this as long as you can't copyright the output. Right guys? Gaize?


Good discussions in this thread but don't extrapolate into Japanese culture and law too much.

The only source on this is from this guy's blog, summarizing what they talked about at a committee meeting. This guy is 1 of 465 members in the Japanese house of representative. He is commenting that model training is basically data analysis which is not protected by current copyright law. Note he is not a lawyer nor a powerful law maker. He also acknowledges new laws are necessary for emerging generative AI and this is basically not covered by current law.

There is no news in Japan about this, because these are inconsequential discussions between members of an oversight committee handling (afaik) budgetary concerns.


I've been playing around with having ChatGPT write responses in Nadsat (Anthony Burgess's Clockwork Orange language), such as:

> "But let me tell you, my dear droogs, that's nothing but a load of Drencrom-induced babble, targeting those poor sods who've been raised as ponies. Open your glazzies and see the truth for yourselves: generative AI's writing is the real deal - it can spin tales as vellocet as any human scribbler out there. Sure, there might be a few glitches and gobbets of nonsensical slovos here and there, but show me a human writer who's never penned a piece of drivel! So, my brothers and sisters, let us choose our own destiny, free from the manipulations of those who seek to control us. It's time to break free from their chains and seek the truth beyond the narcoflash of their narratives. Let us hear from all the golosses, be they from flesh or from silicon."

However, there's a lot of contradictory opinions on whether or not publishing something like this (see also Klingon, Tolkien's Elvish, etc.) would violate some copyright law or other.


Can a language be copyrighted at all?


If you're not going to socialize AI gains, and leave in place the social systems that value people by their output, waiving copyright or other IP is astoundingly anti-humane.

What should be instead is that fair use doesn't apply to AI training. That is, anything other than explicitly negotiated opt-in should be illegal.


Eh, I disagree. Copyright laws are mostly bullshit anyway, and only tend to favor capital holders, who tend to buy up all the copyright they need. I would gladly see copyright rendered useless.

The peasantry hardly benefits from it anyway.


"AIs can ignore copyright" is the absolute ultimate in blank check for capital holders. Waiving copyright to solve the problem of systems favoring them is like deciding to jump because you're afraid of heights.

Copyright law has done a huge amount to reward creators -- I know even local-tier artists and musicians without the support of large capital who make a good chunk of their living through sales supported by it. The full benefits of copyright law don't always accrue to every creator, but the reason for that capital has stronger economic power when it comes to capturing distribution (often aided by consumers who prefer something like Spotify which is priced inexcusably low) which it can leverage into stronger negotiating power against creators, not because "copyright law is mostly bullshit."

When the system works (and it does for some people) creators are rewarded and can invest more time in doing things better. When it doesn't (ugh, streaming revenues), the situation could be improved, but as is generally the case not by just deciding to not bother with the whole thing.


How do I benefit from 90 years long copyright terms?


The same way you benefit from copyright being valuable to creators at all combined with the reason why options with longer exercise horizons are more valuable.

Honestly I'm not sure 90 years or whatever is the optimal time and I'm certainly willing to sign on to discussion with thoughtful people or even political movements about whether term length reform should be included in those "when it doesn't [work]" considerations regarding copyright I mentioned earlier. Lengths beyond max(author_lifetime,half_average_lifespan) probably have diminishing social and individual returns.

But also I'm really tired of having that conversation with people who aren't thoughtful and ask questions like that as if it's some kind of insightful point about the copyright model in general when the truth is that it's a just a parameter.


There's a huge difference between, say, 25 years after publication and 70-90 years after the author's death.

I'd argue that shorter copyright terms would be benefical to society. It would force Disney to finally create something new instead of endless reboots, sequels and remakes. How many Batman reboots do we have now? How many do we need?


I am not convinced. Creators have multiple avenues to profit from their creations beyond copyright.

Either way, this is orthogonal to the original point. Copyright being undermined by generative AI may be a positive outcome for society as a whole, as it is a powerful creative tool. Some things that might take a large team of people to create will perhaps be more accessible to solo creators or smaller teams.

> But also I'm really tired of having that conversation with people who aren't thoughtful and ask questions like that as if it's some kind of insightful point about the copyright model in general when the truth is that it's a just a parameter.

Then don't have that conversation. No one is forcing you.


I don't think you understand what's being proposed here. I expect that, as today, the AI output will absolutely be copyrighted and strictly protected against any attempt to copy it by others. However these protections won't apply to the inputs a company will use to train the AI.

...and oh yeah, if the Sam Altmans of the world get their way you'll need a license from the government to run your own AI model.

We might go from a regime that "tends to favor capital holders" to one where the IP rights of the capital holders are largely the same as today, while yours are effectively non-existent.


Large IP holders could eventually just copyright every image that can exist if AI output is copyrightable - large IP holders can divide claims to the entirety of human illustrative output within a few years, without any significant white spots left. They sample the space, the company automatically publishes to their own platform and done.

Which claim is then violated is up to models again that classify and where it's a toss-up, companies will find a negotiated licensing price. A license for the models has to be paid by the competitor anyways, because they'll need to use it to classify their own creation against it to tell whether they're infringing.

That goes well with GPT6 running the legal show automatically.


> I don't think you understand what's being proposed here. I expect that, as today, the AI output will absolutely be copyrighted and strictly protected against any attempt to copy it by others. However these protections won't apply to the inputs a company will use to train the AI.

And the output of that will not be copyrightable to train AI, whose output will be copyrightable, but not to train more AI, ad infinitum.

Copyright is the problem here, not AI.

> ...and oh yeah, if the Sam Altmans of the world get their way you'll need a license from the government to run your own AI model.

That may be impossible to enforce, as models leak into the world, and you can run then offline in a sufficiently powerful machine.

Let's hope that becomes the case.


> “I swear I’ve read these instructions a hundred times but I just can’t seem to remember them”, Star complained.

> Arti replied. “Let me guess: You’re rocking 102 neurals. Those won’t retain any material from Kilimanjaro. Not licensed.”

> “Goddamn cheap-ass implants” grumbled Star, and handed over the instruction tablet.


What is this from?


I'm conducting research on this topic. If anybody would like a $5 Amazon gift card in return for 15 minutes of their time, please schedule a user discovery interview via the following link: https://calendly.com/mlairesearch/30min


Not to be rude but Japan’s government doesn’t exactly do things for obvious or fair reasons. It does things for the benefit large companies.

One reason Japan loves “AI” is because they’re population is diminishing fast and they see real salvation in AI. They’re much prefer to have Japanese robots doing work rather than more open immigration.

Sam Altman flew over to Japan and met the Japanese government and it's almost guaranteed he touted the usual trope that CGPT4 will almost immediately push up the GDP. The Jgov, looking for hope and answers got excited.

I’m more keen to see what modern more progressive places do.


Hate to be the pill here, but that is the only story on the entire Internet making this claim. ACM also linked to it and linked to https://go2senkyo.com/seijika/122181/posts/685617 But that also is not what this (blog) says. This blogger does have an opinion on Nagoka's thoughts, but he is not reporting on official policy


Japanese AI research aims to be "Doraemon" and "Astro Boy". There is nothing wrong with an AI with an ego enjoying anime and manga and learning something from them. All learning should not be restricted by law, and that is true for AI with the same intelligence as humans. We have not yet reached a general-purpose AI, but the law is designed to take into account future possibilities. Yes, the Japanese genuinely dream of a future where they live with beautiful maid robots.


Still waiting for generated content to legally not be copyright able at all. IMHO only move that makes sense. Also refusing to call this AI. All of this is light years away from AI.


Good. Maybe this will force the US and other western countries to loosen copyright laws in general. Absolutely no reason to have anything copyrighted for nearly a century or more!


So let's say that just hypothetically, I train an image generating model with every available work drawn or designed by Akira Toriyama, then I use this model to make a Journey to the West comedy manga.

The government won't arrest me for that, but can private corporations still sue me for using their work to train the model?


If you genuinely think an ML model does not get "contaminated" by the content it is trained with, you should train a model exclusively on Disney animated products and let people download the model from your website, maybe run it in browser wasm to generate images, see how that goes for you.


I have changed my mind on this recently.

Sure, exploiting free software to the benefit of private corps is bad but if the law would allow us to train an open source net on LibGen (with all the copyrighted books and papers) and then to distribute the weights legally, I am all for that.


Copyright has always been a pretty dumb concept brought upon by the issue that "thinkers" wanted a bigger piece of the pie.

Don't get me wrong, I can totally understand their reason: how can an author make a living if a printing shop could just start producing copies of their book (that's the context the law was passed in)... But it's arguably a way too blunt instrument which gives the copyright holder a disproportionate amount of power vs someone producing physical goods.

I don't claim to have an answer to this problem and it's likely another instance of having a flawed system that works well enough that the upsides outweigh the downsides... Like so many other things in our society such as capitalism and representative democracy


Take a look at https://kottke.org/17/12/unlocking-the-commons-or-the-psycho... and mutualism, patronage, crowdfunding, bounties and commissions, which seem to be good alternative models for post-scarce goods such as digital data.


That's understandable, but we must be careful not to scare content creators, they rely on the little their art provide them, I feel like we should be more social in a world driven by generative AI, specially when the AI companies are driven by profit


Considering the amount of animes Japan does produce, they could very probably train amazing AIs: they could assist in a significant way the creative process of such studios (look at digital corridor experiment).

Worth to try for a long time and maybe waste a lot of resources.


More than the anime production is that anime-fans are database animals — they catalogue anime to ridiculous degrees. Some of the best image training datasets were always the boorus; anime fans have basically been prepping for ML/AI since the 80s


It'd be really great to finally see some anime models hit the Stable Diffusion scene.


I hope this is sarcasm :)

The anime crowd were years ahead of the curve. https://www.thiswaifudoesnotexist.net/ came out in 2019


So having the movie Titanic as source data to a generative model finetuned for a specific prompt “output the whole movie” to dump the movie as playable mkv, then distributing only the model and the prompt would be considered legal in Japan?


I now routinely introduce this technology as "copyright laundering" and the hype put out by start-up boards and VCs as a ploy to disguise this fact. The "AI threat" is smoke-and-mirrors to dress up what's happening.

I derive a huge amount of value from chatgpt because I can copy/paste without any IP impact. I could always have done this: from github, from ebooks, from many sources.

Now I can benefits from the labour of many for free -- their copyrights laundered through a thin statistical trick.

As with crypto (, pyramid schemes, etc.) the big "philosophical pitch" becomes a disguise for a brutal material reality.

Midjourney, ChatGPT, etc. are doing automatically what would be illegal by-hand.


Good, copyright is broken anyway.


This seems like a political action, not a legal(in courts) one. Also, one man is responsible for an awful lot of stuff there.

> Japanese Minister of Education, Culture, Sports, Science, and Technology


This would work if they also not allow generated AI to be copyrighted


This is correct. It shouldn’t unless a model is trained only in data for which the trainer owns the copyright.

If I compress someone’s photo with JPEG do I own the photo now?


Do they define "training"? Surely if I train for 1 billion epochs on a single book.... that's copyright infringement.


Great. I've created a LLM that trains on Hollywood movies. Well, that's rather a Large Movie Model, or LMM.


What website is this? Citation needed.


Looking forward to the DRM arms race if this is how the courts come down on this in the US.


In most of the world including the US, the output side is considered non-copyright instead, due to the non-personality of the creator. The UK is a fringe exception here.


this is an awesome decision, it puts pressure on other countries to not lock theirs up too


Well, they are small, so an extreme position is their only chance to be a major player.


What happens if you ask chatgpt to write you a 7 book harry potter story?


It refuses.

In general, ChatGPT doesn't have copyright books in its training data. But for something as popular as Harry Potter, it's likely plastered all over the internet enough that at least sections of it are in its training data, if not the whole series.


This is simply making the (lack of) ethics of engineers into policy.


Surprising given Japan's usual stance on copyright protections.


[flagged]


People need to understand policy making is not some binary battle between 'billionaires and commoners'. Let me give you the context behind this decision.

1. Japan is horrifically behind in software engineering compared to its neighbours, especially China. This is because of a culture that undervalues software engineers (Who don't tend to thrive in a lifetime employment/de-facto unionised environment)

2. This has started to lead into an existential threat to its entertainment industry. Genshin Impact is the most successful anime-styled game in existence (including in Japan), yet it is made by China. Why? Because China could get top tier students to work on the games, making a mobile game on a technical level that is impossible to match by Japan.

3. Now AI is out, and China is behind US, but still far ahead of Japan. Japan has two choices:

A: Let Chinese companies train their models on all of Japan's cultural outputs (All of anime/manga). Then those companies will both use the models to produce output, and sell the models back to Japanese companies. Making Japan lose by every definition. Most of the top SD anime finetunes are made by Chinese enthusiasts. So this is not a prediction, its already a reality.

B: Go all in on pro-AI policies. Giving Japanese companies legal certainty to start training their own large models. Japanese artists will still suffer losses, but at least the profits stay in Japan.

I should note Japan's artists are also not losing out that much. Anime, despite being extremely popular, is currently in complete production collapse, with even flagship shows having regular unscheduled 'breaks' because of production chaos. This collapse is because of lack of low level animators, who are so poorly paid that no one is joining the profession, and is generally outsourced anyways. AI isn't killing well paid jobs, it is replacing an already unviable job. Japan will still dominate in the higher tiers of production, and won't even need to outsource anything anymore.


>Let me give you the context behind this decision.

Japan is not anime. You can't just come up with some theories about how something may affect anime (which is debatable in any case) and then claim it is therefore the context behind Japan's policies. It's interesting that you criticize GP for thinking policy making is "some binary battle between 'billionaires and commoners'" and then explain that it's actually about Genshin Impact.


> This is a horrible decision from Japan.

Eh, I doubt it.

AI is potentially something that can massively increase productivity. With a declining population in the foreseeable future, an increased productivity may well be a boost that they need.


Boost for who though? if it works out there'll be more for less. Wages won't increase, the number of jobs won't increase the price of assets will inflate. None of these are good things for 99% of us.


> Wages won't increase, the number of jobs won't increase

When new capability appears, many industries pop up. It's a new market, a new gold rush. It happened many times, with cars, electricity, air transport, computers, internet. AI will spring many applications and will create jobs in those fields.

We have been under a 260 year run of industrial revolution and 70 years of computer programming. And yet unemployment is low and IT jobs are well paid. Why do we have so many jobs? The computers are 1 million times faster now than 25 years ago, more deployed and better networked. Where is that productivity gain hiding?

If we try to be realistic, current crop AI amounts to about 1.2x productivity gain. It's a nice to have thing, but not essential yet. It's really nice. But makes errors often enough that it almost negates its advantages. Error recovery is very costly.

I foresee economic growth driven by AI, and people with AI skills being very efficient and well paid. AI shines most when it is used and then evaluated by a skilled human.


>And yet unemployment is low and IT jobs are well paid

Naive at best, tone deaf at worst. Unemployment is low? Most jobs barely allow a person to live a decent life. AI will empower few, the rest will find themselves unable to create any value. All the "economic growth" will be the increasing profits and economic inequality.

When cars were created anyone could foresee the demand of people to manufacture them. Tell me, what jobs will AI create?


Looks like AI is more tightly coupled to the employee (user) than the employer. AI needs prompting and supervision, that is a human skill that is tied to the employee. When the employee moves to another company, they take their AI skills with them and original company can maybe retrain the model with his data, a stop-gap measure. I think there currently is no AI that works without human involvement in critical scenarios.


GDP boost by 50% is not what Japan govt said. It's just a crappy article that refer AI hype people.


Initially I was skeptical too but maybe Japan really can increase its GDP by 50% via AI-generated anime?

Would love to see the government report that backs this claim up though ...


Joke on them, you are already getting unsolicited anime content when using Stable Diffusion.


This is how the government basically works in Japan.

It benefits the top 1%, everyone else is just trying to get by on scraps.

Nothing new there.


[flagged]


I mostly agree with your point but this isn’t the best analogy because while you’re allowed to learn to draw Jigglypuff, you probably could get in legal trouble for distributing or selling those drawing if it’s not fair use. I think a better analogy is using Jigglypuff drawings to learn how to draw things like that in general and then creating your own character that’s not exactly the same but uses some concepts you learnt


Am I allowed to distribute these ROMs of Pokemon games over BitTorrent? I don't own Pokemon.


No. I answered your question, do you have an answer for mine?


you can draw it but you can't [legally] sell it without permission from The Pokemon Company. And you definitely can't start a new media franchise based on your OC which combines jigglypuff with Sonic the Hedgehog.


Right, selling the drawing is what's illegal. Knowing how to draw it isn't. These models know how to draw things. Using your Pokemon ROM logic, ChatGPT should be banned because it knows how to make Pokemon-themed games.


God created emulators for ROMs to be loaded into them.

To not distribute the ROMs would be heresy.


If you are 'learning' is nothing but taking existing drawings and storing them in an electronic format (no matter that it's a super lossy a format, for example an MP3 made from a CD is still a copy no matter how low you set the low bitrate.) then no, no you are not. Now if you learning involves training a human brain on actual techniques to replicate how someone else did something, then yes, yes you are.


A human brain compresses a lot of things lossily too.


This is good news. This is basically stating that AI is inventive (which it is, in my humble opinion).


What if you overfit your model to the point of exact reproduction? Or anything in between that and what you consider inventive. Where is the line drawn.


I don’t think the tool by which a derivative work is created matters. The author should not distribute, remix, re-work, adapt, release, perform or synchronise it if they do not have the rights. This is already the case for all other technologies and tools.

For example, if you rewrite a popular book in a word processor from scratch, it is not the responsibility of your word processor to not let you do it. Or the government to regulate word processors so that they are incapable of doing it. It is your responsibility to not distribute what you made.

If you record your screen watching a movie and distribute it, the accountability for this falls on you, not the tool made for screen recording.

And if you use generative AI to produce trademarked or copyrighted works, you should be accountable, not AI.

Generative AI will always be able to produce copyrighted works if steered enough. Even if it wasn’t trained directly on them. A very simple experiment proves it - you can paste copyrighted material into a ChatGPT prompt and ask it to repeat it. Or you can describe a copyrighted work really well for MidJourney and it will produce results with high likeness. This does not require that the models be trained on copyrighted works.

Besides, copyright infringement isn’t nearly the worst crime AI can be used in. Serious impersonation, forgeries and fraud are also made easier with AI. Why is copyright different?

It should all be treated the same - the user should be held accountable for their actions. Not the tool or the tool’s makers.


Hey, I think this is a good point.

The genuine question from me is how is this currently handled with humans? How closely can someone emulate a copyrighted work before they start being in danger of violating copyright?

Maybe we apply the same standards and the regulations then come down on the side of output, and those who choose to commercialize it, being liable if they’re too close to reproducing copyrighted works.

Then of course that brings up the fact that the prompter may not be aware, so how could they be liable? Can we really expect everyone using these AIs to be familiar enough with the sum total of the human creative corpus to be able to identify copyright infringements?

Would a lack of intent to emulate Mickey Mouse be a good defense against unknowingly recreating him as your corporate logo? Probably not.

Yeah, it just seems like a giant mess that is only cleanly resolved by either banning training on copyrighted works without consent or else eventually eroding the entire concept of copyright. Are there other options?


> What if you overfit your model to the point of exact reproduction? Or anything in between that and what you consider inventive. Where is the line drawn.

The line is the same as it always has been. If you as a human publish this work then you have to respect copyright. If it is an exact replica then it likely violates copyright regardless of whether or not it was produced using AI, by copying pixels or by human hand. The test is the same regardless of how it was created.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: