Fuzz testing: the best thing to happen to our application tests

todd8 · on Aug 17, 2023

I was working on Distributed Services for AIX in 1986 and 1987, a distributed filesystem to compete with NFS. As this was being developed by a dev team, my colleague and I pondered how to test the system that we had architected.

There were so many possible states that a system's file system can be in. Were the conventional tests going to catch subtle bugs? Here's an example of an unusual but not unheard of issue: in a Unix system a file can be unlinked and hence deleted from all directories while remaining open by one or more processes. Such a file could still be written to and read by multiple processes at least until it was finally closed by all processes having open file descriptors at which point it would disappear from the system. Does the distributed file system model this correctly? Many other strange combinations of system calls might be applied to the file system. Will the tests exercise these.

It occurred to me that the "correct" behavior for any sequence of system calls could be defined by just running the sequence on a local file system and comparing the results with running an identical sequence of system calls against the distributed file system.

I built a system to generate random sequences of file related system calls that would run these on a local file system and a remote distributed file system that I wanted to test. As soon as a difference in outcome resulted the test would halt and save a log of the sequence of operations.

My experience with this test rig was interesting. At first discrepancies happened right away. As each bug was fixed by the dev team, we would start the stochastic testing again, and then a new bug would be found. Over time the test would run for a few minutes before failure and then a few minute longer and finally for hours and hours. It was a really ingesting and effective way to find some of the more subtle bugs in the code. I don't recall if I published this information internally or not.

jacquesm · on Aug 18, 2023

That's very much my experience: asymptotic reduction in bug incidence and matching increase in runtime between subsequent errors. To the point that they are so rare that you think you have fixed all of them. But that's usually an illusion: it's just that the error rate is now so low that you no longer observe incidents yourself. The only way to get past that hurdle is to do the bad thing: release and hope that you got it right and if you didn't that the incidents will not be too bad.

You could run many tests in parallel to reduce the chance but it will never be completely zero. Writing bug free software this way is hard. The better way is to design it from the ground up with a bunch of instrumentation that keeps all of your invariants under close observation and that stops the moment anything is not according to your assumptions. This usually gets you to a high level of confidence that things really do work as designed. But of course, that also isn't perfect and residual risk (and residual bugs...) will always remain in any system of even moderate complexity. File systems are well above that level, especially distributed file systems.

code_biologist · on Aug 18, 2023

I think there's room for a hybrid approach, not just fixing the bug a fuzz test found, but treating the fuzz test error as a bug in invariants (either inadequate or violated in a subtle way that's not caught) and addressing that underlying issue.

jacquesm · on Aug 18, 2023

Yes, good point: often it is the assumptions themselves that were broken. So effectively you are debugging both the real system and the monitoring system.

farresito · on Aug 18, 2023

> The better way is to design it from the ground up with a bunch of instrumentation that keeps all of your invariants under close observation and that stops the moment anything is not according to your assumptions.

Any personal "war stories" you are willing to share explaining how you went about designing such a system? :-) Or any presentation of someone who did that?

jacquesm · on Aug 18, 2023

I wrote about retrofitting such a system onto an existing one:

https://jacquesmattheij.com/all-programming-is-bookkeeping/

Which was written in the context of:

https://jacquesmattheij.com/saving-a-project-and-a-company/

And without which that project likely would not have seen a successful end result.

It really is a war story, and what I like most about it is how at the drop of a hat a team assembled to get the job done, cleanly tackled the problem(s) and transferred duties to a new crew.

rwmj · on Aug 18, 2023

The key to modern fuzzing is feedback, usually some kind of coverage measurement of the program under test. This allows the fuzzer to be much smarter about how it finds new code paths and discards inputs that don't extend coverage. This makes fuzzing find bugs a lot quicker.

Google have a project to do fuzzing on Linux system calls using coverage feedback: https://github.com/google/syzkaller

PhilipRoman · on Aug 18, 2023

I used this strategy for implementing a regex engine. I wanted to completely imitate the Lua pattern implementation, so I generated random patterns, ran them on random strings and compared the results.

It was very pleasant to work with such a system. Nowadays I would probably fuzz the patterns with AFL somehow.

5440 · on Aug 18, 2023

For those of you in FDA regulated devices, my clients started receiving FDA NSE letters for not performing fuzz testing. For example, "Though you have provided penetration testing, it does not appear that you have addressed the other items identified such as static and dynamic code analysis, malformed input (fuzz) testing, or vulnerability scanning. This testing is necessary to assess the effectiveness of the cybersecurity controls implemented and to determine whether the residual risk of your device is acceptable."

jacquesm · on Aug 18, 2023

That's excellent that they are doing that. Especially for embedded devices because there tend to be lots of homebrew protocols on those, and those are usually easy pickings.

Uw7yTcf36gTc · on Aug 20, 2023

If their penetration testing didn’t perform fuzzing then you may want to look into a new pen test provider. Fuzz testing is default on most pen tests (I do this professionally)

yosefk · on Aug 17, 2023

The relative rarity of input (pseudo-)randomization in SW testing is near inexplicable to me, except by the very low cost of all but the most commonly reproducing bugs paid by the SW vendor.

mqus · on Aug 17, 2023

In the regular testsuite (think CI) you want to have predictable results. Doing them again and again on the same code should give the same results so you can properly see with which code change things got wrong. Maybe it's simpler to explain it the other way around, for every new path your fuzzer(or other randomized test) tests, it also doesn't test a path it tested in a previous run and you probably want to add the failing paths it found to your regular test suite.

Don't get me wrong, we should have more randomization, but it's not good everywhere, which might explain why we don't have as much of it.

evil-olive · on Aug 17, 2023

it's rather easy to have both randomness and reproducibility, though:

generate a random seed, log it, then create an RNG using that "random, but recorded" seed. make sure all randomness used in the test flows from that explicitly-seeded RNG.

then, have an escape hatch where if a seed is provided as an environment variable, it will use that instead of generating one.

if you have a failure occur, you can always re-run with the same seed as a way to reproduce the failure (assuming it was indeed caused by that random seed and not some other factor)

depending on how fast the tests are, it may also be possible to run them multiple times with different seeds. for example, your on-every-commit CI run might run once with a hardcoded seed of 42. or it might run once with a hardcoded seed and once with a random seed.

and meanwhile, you might have a nightly test run that runs that same test suite 100 or 1000 times, with a different random seed each time.

jacquesm · on Aug 18, 2023

Any half decent fuzzing setup will log what it did prior so you can replay it to the point of failure. This gets a lot harder when you do multiple such runs in parallel.

rwmj · on Aug 18, 2023

AFL++ logs the specific input that causes the crash. In theory at least replaying the input ought to trigger the crash reproducibly. (Sometimes not the case if the program has lots of threads or is event driven or otherwise stochastic).

olluk · on Aug 17, 2023

That all true but at some point the combinations of paths explode. It is not possible to write tests for all the combinations then it possible to cover them eventually with some probability. Fuzzing covers more execution path combinations over time.

kragen · on Aug 18, 2023

hypothesis kind of solves this problem by adding each (minimized) failing input to a file and always running it thereafter

this is a little tricky to integrate into ci

rwmj · on Aug 18, 2023

I love fuzzing as a technique and use it quite regularly and I'm even the maintainer of AFL++ in Fedora. But running AFL++ on even a single program occupies all threads of a high end AMD server for weeks. I'm running it locally so merely paying for the electricity. If it was a cloud instance it would cost a small fortune. I think this is a reason it is not used more widely. In addition most CI systems assume the tests will run in a small finite amount of time, not run for weeks on end.

I will note that Google have a programme for doing fuzz testing on open source projects using compute from their cloud: https://google.github.io/oss-fuzz/

kragen · on Aug 18, 2023

hardware people keep saying that for some reason

maybe someday software people will listen

that would be a good day

jerrinot · on Aug 17, 2023

Hello, I'm Jaromir, one of the core engineers at QuestDB team. I just noticed this blog is trending! Andrei - the author - lives in Bulgaria and he is probably already sleeping. Happy to answer any question the blog left unanswered.

yosefk · on Aug 17, 2023

What tests did you have before adding fuzzing?

jerrinot · on Aug 17, 2023

it's the usual spectrum of tests:

1. correctness: from small units tests to relatively complex integrations tests. they typically populate a test database and query it via various interfaces, such as REST or the Postgres protocol. we use Azure Pipelines to execute them - testing in MacoOS, Linux (both Intel and ARM) and Windows.

2. performance: we tend to use the TSBS project for most of our performance testing and profiling. fun fact: we actually had to patch it as the vanilla TSBS was a bottleneck in some tests. Sadly, the PR with the improvements is still not merged: https://github.com/timescale/tsbs/pull/186

edit: I thought I would link some of the more interesting tests: Since QuestDB supports the Postgres wire protocol we have to gracefully handle even various half-broken Postgres clients. But how do you write a test with mimicking a client generating invalid requests? No sane client will generate broken requests. So we use Wireshark to record network communication between the broken client and our server and then use this recorded communication in tests. Example: https://github.com/questdb/questdb/blob/3995c31210c70664d4b3...

vodou · on Aug 18, 2023

(Sorry for hijacking the thread with a general fuzz question.)

I want to do fuzz testing on a library/framework written in C++. The actual target to test against is a simulator that takes input from a socket according to a network protocol. This simulator is built on both Linux and Windows.

What fuzzing frameworks would you recommend for this? There are quite a few and not always easy for me (as a fuzzing beginner) to understand the differences between them. Some of them seems to be abandoned projects as well.