Of course, it's random and by chance - tokens are literally sampled from a predicted probability distribution. If you mean chance=uniform probability you have to articulate that.
It's trivially true that arbitrarily short reconstructions can be reproduced by virtually any random process and reconstruction length scales with the similarity in output distribution to that of the target. This really shouldn't be controversial.
My point is that matching sequence length and distributional similarity are both quantifiable. Where do you draw the line?
> Of course, it's random and by chance - tokens are literally sampled from a predicted probability distribution.
Picking randomly out of a non-random distribution doesn't give you a random result.
And you don't have to use randomness to pick tokens.
> If you mean chance=uniform probability you have to articulate that.
Don't be a pain. This isn't about uniform distribution versus other generic distribution. This is about the very elaborate calculations that exist on a per-token basis specifically to make the next token plausible and exclude the vast majority of tokens.
> My point is that matching sequence length and distributional similarity are both quantifiable. Where do you draw the line?
Any reasonable line has examples that cross it from many models. Very long segments that can be reproduced. Because many models were trained in a way that overfits certain pieces of code and basically causes them to be memorized.
Right, and very short segments can also be reproduced. Let's say that "//" is an arbitrarily short segment that matches some source code. This is trivially true. I could write "//" on a coin and half the time it's going to land "//". Let's agree that's a lower bound.
I don't even disagree that there is an upper bound. Surely reproducing a repo in its entirety is a match.
So there must exist a line between the two that divides too short and too long.
Again, by what basis do you draw a line between a 1 token reproduction and a 1,000 token reproduction? 5, 10, 20, 50? How is it justified? Purely "reasonableness"?
There are very very long examples that are clearly memorization.
Like, if a model was trained on all the code in the world except that specific example, the chance of it producing that snippet is less than a billionth of a billionth of a percent. But that snippet got fed in so many times it gets treated like a standard idiom and memorized in full.
Is that a clear enough threshold for you?
I don't know where the exact line is, but I know it's somewhere inside this big ballpark, and there are examples that go past the entire ballpark.
I care that it's within the ballpark I spent considerable detail explaining. I don't care where inside the ballpark it is.
You gave an exaggerated upper limit, so extreme there's no ambiguity, of "entire repo".
I gave my own exaggerated upper limit, so extreme there's no ambiguity. And mine has examples of it actually happening. Incidents so extreme they're clear violations.
Maybe an analogy will help: The point at which a collection of sand grains becomes a heap is ambiguous. But when we have documented incidents involving a kilogram or more of sand in a conical shape, we can skip refining the threshold and simply declare that yes heaps are real. Incidents of major LLMs copying code, in a way that is full-on memorization and not just recreating things via chance and general code knowledge, are real.
You're the only person I've seen ever imply that true copying incidents are a statistical illusion, akin to a random die. Normally the debate is over how often and impactful they are, who is going to be held responsible, and what to do about them.
To recap, the original statement was, "Llm's do not verbatim disgorge chunks of the code they were trained on." We obviously both disagree with it.
While you keep trying to drag this toward an upper bound, I'm trying to illustrate that a coin with "//" reproduces a chunk of code. Again. I don't see much of a disagreement on that point either. What I continue to fail to elicit from you is the salient difference between the two.
I'm trying to find a scissor that distills your vibes into a consistent rule and each time it's the rebutted like I'm trying to make an argument. If your system doesn't have consistency, just say so.
I have a consistent rule. The rule is that if an LLM meets the threshold I set then it definitely violated copyright, and if it doesn't meet the threshold then we need more investigation.
We have proof of LLMs going over the threshold. So that answers the question.
Your illustrations are all in the "needs more investigation" area and they don't affect the conclusion.
We both agree that 1 token by itself is fine, and that some number is too many.
So why do you keep asking about that, as if it makes my argument inconsistent in some way? We both say the same thing!
We don't need to know the exact cutoff, or calculate how it varies. We only need to find violators that are over the cutoff.
How about you tell me what you want me to say? Do you want me to say my system is inconsistent? It's not. Having an area where the answer is unclear means the system is not able to answer every question, but it doesn't need to answer every question.
If you're accusing me of using "vibes" in a way that ruins things, then I counter that no I give nice specific and super-rare probabilities that are no more "vibes" based than your suggestion of an entire repo.
> What I continue to fail to elicit from you is the salient difference between the two.
Between what, "//" and the threshold I said?
The salient difference between the two is that one is too short to be copyright infringement and the other is so long and specific that it's definitely copyright infringement (when the source is an existing file under copyright without permission to copy). What more do you want?
Just like 1 grain of sand is definitely not a heap and 1kg of sand is definitely a heap.
If you ask me about 2, 3, 20 tokens my answer is I don't care and it doesn't matter and don't pretend it's relevant to the question of whether LLMs have been infringing copyright or not ("verbatim disgorge chunks").