I read a lot of niggling comments here about whether Claude was really being smart in writing this GIF fuzzer. Of course it was trained on fuzzer source code. Of course it has read every blog post about esoteric boundary conditions in GIF parsers.
But to bring all of those things together and translate the concepts into working Python code is astonishing. We have just forgotten that a year ago, this achievement would have blown our minds.
I recently had to write an email to my kid’s school so that he could get some more support for a learning disability. I fed Claude 3 Opus a copy of his 35 page psychometric testing report along with a couple of his recent report cards and asked it to draft the email for me, making reference to things in the three documents provided. I also suggested it pay special attention to one of the testing results.
The first email draft was ready to send. Sure, I tweaked a thing or two, but this saved me half an hour of digging through dense material written by a psychologist. After verifying that there were no factual errors, I hit “Send.” To me, it’s still magic.
Not OP, but parent of multiple school-age kids and both:
1. You're 100% right, there are privacy concerns.
2. I don't know if they could possibly be worse than the majority of school districts (including my kids) running directly off of Google's Education system (Chromebooks, Google Docs, Gmail etc.).
You can generally register your child under an assumed name - at least, that is possible in my area. Families can choose this option if there is any threat to the security of their child.
It's important to differentiate concern(a feeling) from choosing to upload or not. In the calculus of benefits and risks, The feeling of concern(potentially leaking PII/health information) may be outweighed by the benefit in education. Even if someone is concerned, they may still see the positives outweigh the risks. It's a subjective decision at the end of the day.
I should have clarified that I used Adobe Acrobat to redact his personal identifiers from the report before uploading it to Claude. I generally also prompt using fake names. It's not perfect, but it's better than nothing.
And, on another note, this may be foolish, but I generally trust well funded organizations like Anthropic and OpenAI on the assumption that they have everything to lose if they leak private information from their paid users. Anthropic has a comprehensive and thoughtful privacy policy (https://www.anthropic.com/legal/privacy), which specifies they do not use your data to train their models, other than to refine models used for trust and safety:
"We will not use your Inputs or Outputs to train our models, unless: (1) your conversations are flagged for Trust & Safety review (in which case we may use or analyze them to improve our ability to detect and enforce our Acceptable Use Policy, including training models for use by our Trust and Safety team, consistent with Anthropic’s safety mission), or (2) you’ve explicitly reported the materials to us (for example via our feedback mechanisms), or (3) by otherwise explicitly opting in to training."
As for defending against a data breach, Anthropic hired a former Google engineer, Jason Clinton, as CISO. I couldn't find much information about the relevant experience at Google that may have made him a good candidate for this role, but people with a key role in security at large organizations often don't advertise this fact on their LinkedIn profiles as it makes them a target. Once you're the CISO, the target appears, but that's what the big money is for.
Thanks for the vote of confidence. I led the Chrome Infrastructure Security Team hardening for insider risk and generally defending against APTs for the last 3 years at Google. Before that, I was on the Payments Security Team defending PII and SPII data up and down the stack. Indeed, I and the company take this very seriously. We're racing as fast as we can to defend against the run-of-the-mill opportunistic attackers but also APTs. We've ramped the securtiy team over the last year from 4 to 35 people. I'm still hiring, though!
I have kind of pet peeve with people testing LLMs like this these days.
They take whatever it spits out in the first attempt. And then they go on extrapolate this to draw all kinds of conclusions. They forget the output it generated is based on a random seed. A new attempt (with a new seed) is going to give a totally different answer.
If the author has retried that prompt, that new attempt might have generated better code or might have generated lot worse code. You can not draw conclusions from just one answer.
That doesn't seem to be the case here. Reading through the article and twitter thread, the impression I get is that between moyix and the author, a decent amount of time was spent on this. A valid criticism that could have been made is the use of Claude Sonnet but based on the twitter thread, it looks like opus was what @moyix leveraged.
Yes – it's a bit hard to follow the various branches of the thread on Twitter (I wasn't really intending this to be more than a 30 minute "hey that's neat" kind of experiment, but people kept suggesting new and interesting things to try :)), but I gave Claude Opus three independent tries at creating a fuzzer for Toby's packet parser, and it consistently missed the fact that it needed to include the sequence number in the CRC calculation.
Once that oversight was pointed out, it did write a decent fuzzer that found the memory safety bugs in do_read and do_write. I also got it to fix those two bugs automatically (by providing it the ASAN output).
> A new attempt (with a new seed) is going to give a totally different answer
Totally different...I'd posit 5% different, and mostly in trivialities.
It's worth doing an experiment and prompting an LLM with a coding question twice, then seeing how different it is.
For, say, a K-Means clustering algorithm, you're absolutely correct. The initial state is _completely_ dependent on the choice of seed.
With LLMs, the initial state is your prompt + a seed. The prompt massively overwhelms the seed. Then, the nature of the model, predicting probabilities, then the nature of sampling, attempting to minimize surprise, means there's a powerful forcing function towards answers that share much in common. This is both in theory, and I think you'll see, in practice.
Depends on the question. If you asked for a small fact, you are going to get almost the same answer every time. But if it's not a factual question, and answer is supposed to be a long tangled one, then the answer is going to depend on what LLM said in the first lines because it is going to stick with that.
e.g LLM might have said for some reason the writing a fuzzer like this isn't possible and then went on presenting some alternatives for tge given task.
I have only experience with GPT-4 via api but I believe at core all these LLMs work the same way.
You're absolutely correct, in that it's never guaranteed what the next token is.
My pushback is limited to that the theoretical maximal degenerate behavior described in either of your comments is highly improbable in practice, with a lot of givens, such as reasonable parameters, reasonable model.
I.e. it will not
- give totally different answers due to seed changing.
- end up X% of the time, where X > 5 say it is impossible, and the other (100 - X)%, provide some solution.
I have integrated with GPT3.0/GPT3.5/GPT4 and revisions thereof via API, as well as Claude 2 and this week, Claude 3. I wrote a native inference solution that runs, among others, StableLM Zephyr 3B, Mistral 7B, and Mixtral 8x7B, and I wrote code that does inference, step by excruciating step, in a loop, on web via WASM, and via C++, tailored solutions for Android, iOS, macOS, Android, and Windows.
I still think it depends on the subject you are prompting. If LLM knows that thing very well it will stick to the answer, otherwise it can go in a different direction based on how different initial assessment was.
Yesterday I asked it to write a simple VB script to show a reminder that I will schedule via command line using Windows task schedular. In first attempt it suggested to create VB file for each message based on initial reasoning that I can not pass arguments to VB file like that. It didn't seem correct (confirmed via Google) then resubmitted the same prompt but this time it said that I can simply pass my reminder message as arg to VB script and the next code was based on that. (I don't know VB or ever used task schedular before)
This was GPT-4. You are not wrong about 'maximal degenerate behaviour' but initially generated assumption can lead to different answers overall. Chain of thought prompting stems from this exact behaviour.
You could likely also combine the LLM with a coverage tool to provide additional guidance when regenerating the fuzzer: "Your fuzzer missed lines XX-YY in the code. Explain why you think the fuzzer missed those lines, describe inputs that might reach those lines in the code, and then update the fuzzer code to match your observations."
This approach could likely also be combined with RL; the code coverage provides a decent reward signal.
It seems to overlook that the language model was developed using a large corpora of code, which probably includes structured fuzzers for file formats such as GIF. Plus, the scope of the "unknown" format introduced is limited.
The original test of the GIF parser does, but the VRML parser less so and the completely novel packet parser even less so. I'm not quite sure what you mean by the scope of the "unknown" format being limited – it's not the most complex format in the world, but neither is GIF.
Another test to check how much seeing the actual parser code helps is to have it generate a GIF fuzzer without giving it the code:
Why wouldn't you have an LLM write some code that uses something like libfuzzer instead?
That way you get an efficient, robust coverage-driven fuzzing engine, rather than having an LLM try to reinvent the wheel on that part of the code poorly. Let the LLM help write the boilerplate code for you.
They're actually orthogonal approaches – from what I've seen so far the LLM fuzzer generates much higher quality seeds than you'd get even after fuzzing for a while (in the case of the VRML target, even if you start with some valid test files found online), but it's not as good at generating broken inputs. So the obvious thing to do is have the LLM's fuzzer generate initial seeds that get pretty good coverage and then a traditional coverage-guided fuzzer to further mutate those.
These are still pretty small scale experiments on essentially toy programs, so it remains to be seen if LLMs remain useful on real world programs, but so far it looks pretty promising – and it's a lot less work than writing a new libfuzzer target, especially when the program is one that's not set up with nice in-memory APIs (e.g., that GIF decoder program just uses read() calls distributed all over the program; it would be fairly painful to refactor it to play nicely with libfuzzer).
I don't think using a language model to generate inputs directly is ever going to be as efficient as writing a little bit of code to do the generation; it's really hard to beat an input generator that can craft thousands of inputs/second.
For one, it'd be really hard for an LLM to get the CRC32 right, especially when it's in a header before the data it covers.
Then again, this whole approach to fuzzing comes across as kinda naive, at the very least you'd want to use an API of a coverage-guided fuzzer for generating the randomness (and then almost always fixing up CRC32 on top of that, like a human-written wrapper function would).
Exactly. If I actually wanted to fuzz this I'd use libfuzzer and manually fix the crc32. An LLM would be useful in helping me write the libfuzzer glue code.
But to bring all of those things together and translate the concepts into working Python code is astonishing. We have just forgotten that a year ago, this achievement would have blown our minds.
I recently had to write an email to my kid’s school so that he could get some more support for a learning disability. I fed Claude 3 Opus a copy of his 35 page psychometric testing report along with a couple of his recent report cards and asked it to draft the email for me, making reference to things in the three documents provided. I also suggested it pay special attention to one of the testing results.
The first email draft was ready to send. Sure, I tweaked a thing or two, but this saved me half an hour of digging through dense material written by a psychologist. After verifying that there were no factual errors, I hit “Send.” To me, it’s still magic.