Their developers have intent. That intent is to give the perception of understanding/facts/logic without designing representations of such a thing, and with full knowledge that as a result, it will be routinely wrong in ways that would convey malicious intent if a human did it. I would say they are trained to deceive because if being correct was important, the developers would have taken an entirely different approach.
generating information without regard to the truth is bullshitting, not necessarily malicious intent.
for example, this is bullshit because it’s words with no real thought behind it:
“if being correct was important, the developers would have taken an entirely different approach”
If you are asking a professional high-stakes questions about their expertise in a work context and they are just bullshitting you, it's fair to impugn their motives. Similarly if someone is using their considerable talent to place bullshit artists in positions of liability-free high-stakes decisions.
Your second comment is more flippant than mine, as even AI boosters like Chollet and LeCun have come around to LLMs being tangential to delivering on their dreams, and that's before engaging with formal methods, V&V, and other approaches used in systems that actually value reliability.
> customers at the national labs are not going to be sharing custom HPC code with AMD engineers
There are several co-design projects in which AMD engineers are interacting on a weekly basis with developers of these lab-developed codes as well as those developing successors to the current production codes. I was part of one of those projects for 6 years, and it was very fruitful.
> I suspect a substantial portion of their datacenter revenue still comes from traditional HPC customers, who have no need for the ROCm stack.
HIP/ROCm is the prevailing interface for programming AMD GPUs, analogous to CUDA for NVIDIA GPUs. Some projects access it through higher level libraries (e.g., Kokkos and Raja are popular at labs). OpenMP target offload is less widespread, and there are some research-grade approaches, but the vast majority of DOE software for Frontier and El Capitan relies on the ROCm stack. Yes, we have groaned at some choices, but it has been improving, and I would say the experience on MI-250X machines (Frontier, Crusher, Tioga) is now similar to large A100 machines (Perlmutter, Polaris). Intel (Aurora) remains a rougher experience.
The point is that LLMs are never right for the right reason. Humans who understand the subject matter can make mistakes, but they are mistakes of a different nature. The issue reminds me of this from Terry Tao (LLMs being not-even pre-rigorous, but adept at forging the style of rigorous exposition):
It is perhaps worth noting that mathematicians at all three of the above stages of mathematical development can still make formal mistakes in their mathematical writing. However, the nature of these mistakes tends to be rather different, depending on what stage one is at:
1. Mathematicians at the pre-rigorous stage of development often make formal errors because they are unable to understand how the rigorous mathematical formalism actually works, and are instead applying formal rules or heuristics blindly. It can often be quite difficult for such mathematicians to appreciate and correct these errors even when those errors are explicitly pointed out to them.
2. Mathematicians at the rigorous stage of development can still make formal errors because they have not yet perfected their formal understanding, or are unable to perform enough “sanity checks” against intuition or other rules of thumb to catch, say, a sign error, or a failure to correctly verify a crucial hypothesis in a tool. However, such errors can usually be detected (and often repaired) once they are pointed out to them.
3. Mathematicians at the post-rigorous stage of development are not infallible, and are still capable of making formal errors in their writing. But this is often because they no longer need the formalism in order to perform high-level mathematical reasoning, and are actually proceeding largely through intuition, which is then translated (possibly incorrectly) into formal mathematical language.
The distinction between the three types of errors can lead to the phenomenon (which can often be quite puzzling to readers at earlier stages of mathematical development) of a mathematical argument by a post-rigorous mathematician which locally contains a number of typos and other formal errors, but is globally quite sound, with the local errors propagating for a while before being cancelled out by other local errors. (In contrast, when unchecked by a solid intuition, once an error is introduced in an argument by a pre-rigorous or rigorous mathematician, it is possible for the error to propagate out of control until one is left with complete nonsense at the end of the argument.)
The cross-over can be around 500 (https://doi.org/10.1109/SC.2016.58) for 2-level Strassen. It's not used by regular BLAS because it is less numerically stable (a concern that becomes more severe for the fancier fast MM algorithms). Whether or not the matrix can be compressed (as sparse, fast transforms, or data-sparse such as the various hierarchical low-rank representations) is more a statement about the problem domain, though it's true a sizable portion of applications that produce large matrices are producing matrices that are amenable to data-sparse representations.
> It's trivially easy to find a real-world situation where conservation of energy does not hold (any system with friction, which is basically all of them)
Conservation of energy absolutely still holds, but entropy is not conserved so the process is irreversible. If your model doesn't include heat, then discrete energy won't be conserved in a process that produces heat, but that's your modeling choice, not a statement about physics. It is common to model such processes using a dissipation potential.
Right, but I'm saying that it's all modeling choices, all the way down. Extend the model to include thermal energy and most of the time it holds again - but then it falls down if you also have static electricity that generates a visible spark (say, a wool sweater on a slide) or magnetic drag (say, regenerative braking on a car). Then you can include models for those too, but you're introducing new concepts with each, and the math gets much hairier. We call the unified model where we abstract away all the different forms of energy "conservation of energy", but there are a good many practical systems where making tangible predictions using conservation of energy gives wrong answers.
Basically this is a restatement of Box's Aphorism ("All models are wrong, but some are useful") or the ideas in Thomas Kuhn's "The Structure of Scientific Revolutions". The goal of science is to from concrete observations to abstract principles which ideally will accurately predict the value of future concrete observations. In many cases, you can do this. But not all. There is always messy data that doesn't fit into neat, simple, general laws. Usually the messy data is just ignored, because it can't be predicted and is assumed to average out or generally be irrelevant in the end. But sometimes the messy outliers bite you, or someone comes up with a new way to handle them elegantly, and then you get a paradigm shift.
And this has implications for understanding what machine learning is or why it's important. Few people would think that a model linking background color to likeliness to click on ads is a fundamental physical quality, but Google had one 15+ years ago, and it was pretty accurate, and made them a bunch of money. Or similarly, most people wouldn't think of a model of the English language as being a fundamental physical quality, but that's exactly what an LLM is, and they're pretty useful too.
It's been a long time since I have cracked a physics book, but your mention of interesting "fundamental physical quantities" triggered the recollection of there being a conservation of information result in quantum mechanics where you can come up with an action whose equations of motion are Schrödinger's equation and the conserved quantity is a probability current. So I wonder to what extent (if any) it might make sense to try to approach these things in terms of the really fundamental quantity of information itself?
Approaching physics from a pure information flow is definitely a current research topic. I suspect we see less popsci treatment of it because almost nobody understands information at all, then trying to apply it to physics that also almost nobody understands is probably at least three or four bridges too far for a popsci treatment, but it's a current and active topic.
This might be insultingly simplistic, but I always thought the phrase "conservation of information" just meant that the time-evolution operator in quantum mechanics was unitary. Unitary mappings are always bijective functions - so it makes intuitive sense to say that all information is preserved. However, it does not follow that this information is useful to actually quantify, like energy or momentum. There is certainly a kind of applied mathematics called "information theory", but I doubt there's any relevance to the term "conservation of information" as it's used in fundamental physics.
The links below lend credibility to my interpretation.
They tripled the price of my grandmother's service over a five year period despite no speed increases so I figure they were going to charge you more regardless.
The LM industry valuation would be way smaller if they were not laundering behavior that would be illegal if a human did it. If "AI" were required to practice clean-room design (https://en.wikipedia.org/wiki/Clean_room_design) to avoid infringing copyright, we would laugh at the ineptitude. If people believed the FTC-CFPB-DOJ-EEOC joint statement was going to lead to successful prosecutions, the industry valuation would collapse. https://www.ftc.gov/system/files/ftc_gov/pdf/EEOC-CRT-FTC-CF...
If you spend weeks drilling flash cards on copyrighted code, then produced pages of near-verbatim copies with copyright stripped, any court would find you to have violated the copyright. A lot of people right now are banking on "it's not illegal when AI does it", and part of that strategy is to make "AI" out to be something more than it is. That strategy has many parallels to cryptocurrency hyping.
Even MIT licensed code requires you to preserve the copyright and permission notice.
If a human did what these language models are doing (output derivative works with the copyright and license stripped), it would be a license violation. When humans want to create a new implementation with clean IP, they have one team study the IP-encumbered code and write a spec, then a different team writes a new implementation according to the spec. LM developers could have similar practices, with separately-trained components that create an auditable intermediate representation and independently create new code based on that representation. The tech isn't up to that task and the LM authors think they're going to get away with laundering what would be plagiarism if a human did it.
Why can't AI do the same: copyrighted code -> spec -> generated code.
... and then execute copyrighted code -> trace resulting values -> tests for new code.
AI could do clean room reimplementation of any code to beef up the training set. It can also make sure the new code is different from the old code at ngram-level, so even by chance it should not look the same.
Would that hold up in court? Is it copyright laundering?
Language models don't understand anything, they just manipulate tokens. It is a much harder task to write a spec (that humans and courts can review if needed to determine is not infringement) and (with a separately trained tool) implement the spec. The tech just isn't ready and it's not clear that language models will ever get there.
What language models could do easily is to obfuscate better so the license violation is harder to prove. That's behavior laundering -- no amount of human obfuscation (e.g., synonym substitution, renaming variables, swapping out control structures) can turn a plagiarized work into one that isn't. If we (via regulators and courts) let the Altmans of the world pull their stunt, they're going to end up with a government-protected monopoly on plagiarism-laundering.
> When humans want to create a new implementation with clean IP, they have one team study the IP-encumbered code and write a spec, then a different team writes a new implementation according to the spec.
Maybe at a FAANG or some other MegaCorp, but most companies around barely have a single dev team at all, or if they're larger barely have one per project.
There’s a clear separation between the training process which looks at code and outputs nothing but weights, and the generation process which takes in weights and prompts and produces code.
The weights are an intermediate representation that contains nothing resembling the original code.
But the original content is frequently recoverable.
You can't just take copyrighted code, base 64 it, sent it to someone, have them decode it, and claim there was no copyright violation.
From my (admittedly vague) understanding copyright law cares about the lineage of data, and I don't see how any reasonable interpretation could consider that the lineage doesn't pass through models.
> But the original content is frequently recoverable.
What if we train the model on paraphrases of the copyrighted code? The model can't reproduce exactly what it has not seen.
Also consider the size ratio - 1TB of code+text ends up into 1GB of model weights. There is no space to "memorize" the training set, it can only learn basic principles and how to combine them to generate code on demand.
The copyright law in principle should only protect expression, not ideas. As long as the model learns the underlying principles without copying the superficial form, it should be ok. That's my 2c
Machine learning neural networks have almost nothing to do with how brains work besides a tenuous mathematical relation that was conceived in the 1950s.
You can say that if you want to nitpick, but there are recent studies showing that neural and brain representations align rather well, to the point that we can predict what someone is seeing from brain waves, or generate the image with stable diffusion.
I think brain to neural net alignment is justified by the fact that both are the result of the same language evolutionary process. We're not all that different from AIs, we just have better tools and environments, and evolutionary adaptation for some tasks.
Language is an evolutionary system, ideas are self replicators, they evolve parallel to humans. We depend on the accumulation of ideas, starting from scratch would be hard even for humans. A human alone with no language resources of any kind would be worse than a primitive.
The real source of intelligence is the language data from which both humans and AIs learn, model architecture is not very important. Two different people, with different neural wiring in the brain, or two different models, like GPT and T5 can learn the same task given the training set. What matters is the training data. It should be credited with the skills we and AIs obtain. Most of us live our whole lives at this level and never come up with an original idea, we're applying language to tasks like GPT.
I think this view is incredibly dangerous to any kind of skills mastery. It has the potential to completely destroy the knowledge economy and eventually degrade AI due to a dearth of training data.
It reminds me of people needing to do a "clean room implementation" without ever seeing similar code. I feel like a human being who read a bunch of code and then wrote something similar without copy/paste or looking at the training data should be protected, and therefore an AI should too.
I mean those consequences are why patent law exists. New technology may require new regulatory frameworks, like we've been doing since railroads. The idea that we could not amend law and that we need to pedantically say "well this isn't illegal now" as an excuse for doing something unethical and harmful to the economy is in my opinion very flawed.
Is it really harmful to the economy, or only to entrenched players? Coding AI should be a benefit to many, like open source is. It opens the source even more, should be a dream come true for the community. It's also good for learning and lowering the entry barrier.
At the same time it does not replace human developers in any application, it might take a long time until we can go on vacation and let AI solve our Jira tickets. Remember the Self Driving task has been under intense research for more than a decade now, and it's still far from L5.
It's a trend that holds in all fields. AI is a tool that stumbles without a human to wield it, it does not replace humans at all. But with each new capability it invites us to launch new products and create jobs. Human empowerment without human replacement is what we want, right?
Has anyone been able to create a prompt that GPT4 replies to with copyrighted content (or content extremely similar to the original content)?
I'm curious how easy or difficult it is to get GPT to spit out content (code or text) that could be considered obvious infringement.
Tempted to give it half of some closed-source or restrictive licensed code to see if it auto-completes the other half in a manner that is obviously recreating the original work.
I don't know about GPT-4 but you could get ChatGPT to spit Carmac's Fast Inverse square root with the comments and all (I can't find the tweet though…)
I can reproduce when prompted all the lyrics to Bohemian Rhapsody, but my doing so isn’t automatically copyright infringement. It would depend on where, when, how, in front of what audience, and to what purpose I was reciting them as to whether it was irrelevant to copyright law, protected under some copyright use case, civilly infringing, or criminally infringing copyright abuse.
The same applies to GPT. It could reproduce Bohemian Rhapsody lyrics in the course of answering questions and there’s no automatic breach of copyright that’s taking place. It’s okay for GPT to know how a well known song goes.
If copilot ‘knows how some code goes’ and is able to complete it, how is that any different?
Yes, indeed. And it's not even a novel idea, my example is lifted straight from one of the references of the paper you've linked (namely, Kiselyov & Shan, "Lightweight Static Capabilities", 2007).
I personally think that taking an optimization (e.g. boundary checks elimination) and bringing it — or more precisely, the logic that verifies that this optimization is safe — to the source language-level is a very promising direction of research.