I was working on Distributed Services for AIX in 1986 and 1987, a distributed filesystem to compete with NFS. As this was being developed by a dev team, my colleague and I pondered how to test the system that we had architected.
There were so many possible states that a system's file system can be in. Were the conventional tests going to catch subtle bugs? Here's an example of an unusual but not unheard of issue: in a Unix system a file can be unlinked and hence deleted from all directories while remaining open by one or more processes. Such a file could still be written to and read by multiple processes at least until it was finally closed by all processes having open file descriptors at which point it would disappear from the system. Does the distributed file system model this correctly? Many other strange combinations of system calls might be applied to the file system. Will the tests exercise these.
It occurred to me that the "correct" behavior for any sequence of system calls could be defined by just running the sequence on a local file system and comparing the results with running an identical sequence of system calls against the distributed file system.
I built a system to generate random sequences of file related system calls that would run these on a local file system and a remote distributed file system that I wanted to test. As soon as a difference in outcome resulted the test would halt and save a log of the sequence of operations.
My experience with this test rig was interesting. At first discrepancies happened right away. As each bug was fixed by the dev team, we would start the stochastic testing again, and then a new bug would be found. Over time the test would run for a few minutes before failure and then a few minute longer and finally for hours and hours. It was a really ingesting and effective way to find some of the more subtle bugs in the code. I don't recall if I published this information internally or not.
That's very much my experience: asymptotic reduction in bug incidence and matching increase in runtime between subsequent errors. To the point that they are so rare that you think you have fixed all of them. But that's usually an illusion: it's just that the error rate is now so low that you no longer observe incidents yourself. The only way to get past that hurdle is to do the bad thing: release and hope that you got it right and if you didn't that the incidents will not be too bad.
You could run many tests in parallel to reduce the chance but it will never be completely zero. Writing bug free software this way is hard. The better way is to design it from the ground up with a bunch of instrumentation that keeps all of your invariants under close observation and that stops the moment anything is not according to your assumptions. This usually gets you to a high level of confidence that things really do work as designed. But of course, that also isn't perfect and residual risk (and residual bugs...) will always remain in any system of even moderate complexity. File systems are well above that level, especially distributed file systems.
I think there's room for a hybrid approach, not just fixing the bug a fuzz test found, but treating the fuzz test error as a bug in invariants (either inadequate or violated in a subtle way that's not caught) and addressing that underlying issue.
Yes, good point: often it is the assumptions themselves that were broken. So effectively you are debugging both the real system and the monitoring system.
> The better way is to design it from the ground up with a bunch of instrumentation that keeps all of your invariants under close observation and that stops the moment anything is not according to your assumptions.
Any personal "war stories" you are willing to share explaining how you went about designing such a system? :-) Or any presentation of someone who did that?
And without which that project likely would not have seen a successful end result.
It really is a war story, and what I like most about it is how at the drop of a hat a team assembled to get the job done, cleanly tackled the problem(s) and transferred duties to a new crew.
The key to modern fuzzing is feedback, usually some kind of coverage measurement of the program under test. This allows the fuzzer to be much smarter about how it finds new code paths and discards inputs that don't extend coverage. This makes fuzzing find bugs a lot quicker.
I used this strategy for implementing a regex engine. I wanted to completely imitate the Lua pattern implementation, so I generated random patterns, ran them on random strings and compared the results.
It was very pleasant to work with such a system. Nowadays I would probably fuzz the patterns with AFL somehow.
For those of you in FDA regulated devices, my clients started receiving FDA NSE letters for not performing fuzz testing. For example, "Though you have provided penetration testing, it does not appear that you have addressed the other items identified such as static and dynamic code analysis, malformed input (fuzz) testing, or vulnerability scanning. This testing is necessary to assess the effectiveness of the cybersecurity controls implemented and to determine whether the residual risk of your device is acceptable."
That's excellent that they are doing that. Especially for embedded devices because there tend to be lots of homebrew protocols on those, and those are usually easy pickings.
If their penetration testing didn’t perform fuzzing then you may want to look into a new pen test provider. Fuzz testing is default on most pen tests (I do this professionally)
The relative rarity of input (pseudo-)randomization in SW testing is near inexplicable to me, except by the very low cost of all but the most commonly reproducing bugs paid by the SW vendor.
In the regular testsuite (think CI) you want to have predictable results. Doing them again and again on the same code should give the same results so you can properly see with which code change things got wrong.
Maybe it's simpler to explain it the other way around, for every new path your fuzzer(or other randomized test) tests, it also doesn't test a path it tested in a previous run and you probably want to add the failing paths it found to your regular test suite.
Don't get me wrong, we should have more randomization, but it's not good everywhere, which might explain why we don't have as much of it.
it's rather easy to have both randomness and reproducibility, though:
generate a random seed, log it, then create an RNG using that "random, but recorded" seed. make sure all randomness used in the test flows from that explicitly-seeded RNG.
then, have an escape hatch where if a seed is provided as an environment variable, it will use that instead of generating one.
if you have a failure occur, you can always re-run with the same seed as a way to reproduce the failure (assuming it was indeed caused by that random seed and not some other factor)
depending on how fast the tests are, it may also be possible to run them multiple times with different seeds. for example, your on-every-commit CI run might run once with a hardcoded seed of 42. or it might run once with a hardcoded seed and once with a random seed.
and meanwhile, you might have a nightly test run that runs that same test suite 100 or 1000 times, with a different random seed each time.
Any half decent fuzzing setup will log what it did prior so you can replay it to the point of failure. This gets a lot harder when you do multiple such runs in parallel.
AFL++ logs the specific input that causes the crash. In theory at least replaying the input ought to trigger the crash reproducibly. (Sometimes not the case if the program has lots of threads or is event driven or otherwise stochastic).
That all true but at some point the combinations of paths explode. It is not possible to write tests for all the combinations then it possible to cover them eventually with some probability. Fuzzing covers more execution path combinations over time.
I love fuzzing as a technique and use it quite regularly and I'm even the maintainer of AFL++ in Fedora. But running AFL++ on even a single program occupies all threads of a high end AMD server for weeks. I'm running it locally so merely paying for the electricity. If it was a cloud instance it would cost a small fortune. I think this is a reason it is not used more widely. In addition most CI systems assume the tests will run in a small finite amount of time, not run for weeks on end.
I will note that Google have a programme for doing fuzz testing on open source projects using compute from their cloud: https://google.github.io/oss-fuzz/
Hello, I'm Jaromir, one of the core engineers at QuestDB team. I just noticed this blog is trending! Andrei - the author - lives in Bulgaria and he is probably already sleeping. Happy to answer any question the blog left unanswered.
1. correctness: from small units tests to relatively complex integrations tests. they typically populate a test database and query it via various interfaces, such as REST or the Postgres protocol. we use Azure Pipelines to execute them - testing in MacoOS, Linux (both Intel and ARM) and Windows.
2. performance: we tend to use the TSBS project for most of our performance testing and profiling. fun fact: we actually had to patch it as the vanilla TSBS was a bottleneck in some tests. Sadly, the PR with the improvements is still not merged: https://github.com/timescale/tsbs/pull/186
edit: I thought I would link some of the more interesting tests: Since QuestDB supports the Postgres wire protocol we have to gracefully handle even various half-broken Postgres clients. But how do you write a test with mimicking a client generating invalid requests? No sane client will generate broken requests. So we use Wireshark to record network communication between the broken client and our server and then use this recorded communication in tests. Example: https://github.com/questdb/questdb/blob/3995c31210c70664d4b3...
(Sorry for hijacking the thread with a general fuzz question.)
I want to do fuzz testing on a library/framework written in C++. The actual target to test against is a simulator that takes input from a socket according to a network protocol. This simulator is built on both Linux and Windows.
What fuzzing frameworks would you recommend for this? There are quite a few and not always easy for me (as a fuzzing beginner) to understand the differences between them. Some of them seems to be abandoned projects as well.
There were so many possible states that a system's file system can be in. Were the conventional tests going to catch subtle bugs? Here's an example of an unusual but not unheard of issue: in a Unix system a file can be unlinked and hence deleted from all directories while remaining open by one or more processes. Such a file could still be written to and read by multiple processes at least until it was finally closed by all processes having open file descriptors at which point it would disappear from the system. Does the distributed file system model this correctly? Many other strange combinations of system calls might be applied to the file system. Will the tests exercise these.
It occurred to me that the "correct" behavior for any sequence of system calls could be defined by just running the sequence on a local file system and comparing the results with running an identical sequence of system calls against the distributed file system.
I built a system to generate random sequences of file related system calls that would run these on a local file system and a remote distributed file system that I wanted to test. As soon as a difference in outcome resulted the test would halt and save a log of the sequence of operations.
My experience with this test rig was interesting. At first discrepancies happened right away. As each bug was fixed by the dev team, we would start the stochastic testing again, and then a new bug would be found. Over time the test would run for a few minutes before failure and then a few minute longer and finally for hours and hours. It was a really ingesting and effective way to find some of the more subtle bugs in the code. I don't recall if I published this information internally or not.