-
Notifications
You must be signed in to change notification settings - Fork 28
Design notes
Here's some ideas about how I (@stedolan) want Crowbar to end up, formed mostly from discussions with co-conspirator @yomimono and Hypothesis author @DRMacIver, reading Hypothesis docs and various papers (citations below), and some spelunking in afl-fuzz.c. I don't think any of these ideas are new, but I don't think they've been simultaneously stolen before.
This isn't about changing the API of Crowbar. All of the stuff below
works with an unchanged crowbar.mli
and more-or-less the same
execution model (although I think it's worthwhile to redo afl-fuzz
functionality directly in OCaml, so the exact command for running
tests will change. More on this below). In particular,
ppx_deriving_crowbar
should not need any changes.
This document consists of a list of testing algorithms, starting with
naive QuickCheck and ending with what I'd like Crowbar to do. The
master
branch of Crowbar already does a few of these. The hype
branch
already does a few more, and Hypothesis already does an overlapping
set of them. In back-of-the-envelope experiments, going further down
this list has found bugs faster, although I need to do more thorough
testing.
The starting point is completely naive Quickcheck:
N times {
let t = generate a random testcase
if t fails {
print t
}
}
The "generate a random testcase" is done according to a typed
generator that the user provides (see crowbar.mli
for how these
work), and the test is some user-provided function. A "test failure"
is a particular testcase that makes the function raise an
exception. It's hard to tell how big N
should be.
The first problem (pointed out in the
original QuickCheck paper)
is that, for some reasonable-looking generators, the above algorithm
will only manage to do 1 or 2 tests. The standard problematic example
is this generator for binary trees (type t = L | B of t * t
):
with probability 1/3, choose L
with probability 2/3, generate two trees x,y, and choose (B (x, y))
Unfortunately, this generator has only a 50% chance of actually
terminating. The standard solution (which the master
branch of
Crowbar currently uses, although it's had some
bugs) is to bound the depth of the recursion.
A better solution, that Hypothesis uses and the hype
branch of
Crowbar now uses, is to arrange generators so that when the source of
random numbers starts returning mostly 0
, the generator produces a
very small test case. In Hypothesis
this is done manually,
but the hype
branch of Crowbar now does this automatically, by
computing an approximation of the smallest value a generator can make,
and using it to sort the arguments to choose
(search for
small_example_size
in src/gen.ml
).
Either way, the algorithm is:
N times {
let t = generate a random smallish testcase
if t fails {
print t
}
}
QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs, Claessen and Hughes, 2000
Hypothesis documentation, David R. MacIver, 2013-2018
The provided generator determines a particular distribution of possible test cases, and the QuickCheck algorithm takes N independent samples from this distribution. N will vary a lot: we might want to run a few tests on every commit, but maybe leave the tests running over the weekend for less frequent but more thorough testing. Here lies the second problem with the QuickCheck algorithm, which I think is less widely appreciated than the first: the correct distribution depends on N.
Let's divide the possible test cases into "small" and "big" tests, and suppose that there are only K possible small tests. There might be infinitely many big tests. For instance, in the binary tree example above, we could take a "small" test to be a binary tree with fewer than 4 leaves.
Suppose that sampling the generator produces a small test with probability p. We don't want to ignore the small tests, so p > 0. The most likely small test has probability at least p/K, so if N > K/p we expect that it will be tested at least once. However, if N is much greater than K/p we expect that the small tests will be tested many, many times. In fact, no matter how large N gets, a proportion p of the testing budget is wasted re-testing the K smallest tests that have already been exhaustively explored.
So, the optimal value of p depends on how large N is (among other things). However, we don't necessarily know N in advance (a standard value of N is "until the user gets bored and hits ctrl-C"), and it's not obvious how best to define "small" and calculate p.
An extreme solution to this problem is to check the small cases first: spend the first K tests exhaustively checking all of the small cases, then only generate big cases afterward (see SmallCheck and the very cool Lazy Smallcheck). I don't think this is the right approach, though: sometimes there are bugs that only show up on large examples, but show up on a decent proportion of large examples. Waiting until you've exhausted the small ones can be exponentially slower than trying some large ones early.
I think that a better solution is to use a random walk instead of independent sampling. Instead of generating each test case in the same way, generate later test cases by mutating earlier ones. The probability of mutating a small test case and making it large is much greater than the probability of mutating a large test case and making it small (because there are so many more large test cases than small ones), so if left run for long enough this process will stop generating small test cases entirely, as desired. To make the distribution less fragile, there should be a collection of random walkers, rather than just a single one, giving something like:
q := a queue with a single small test case
N times {
let t = remove item from q
M times {
let t' = mutate t
add t' to q
if t' fails {
print t
}
}
}
I think there are better explanations of this idea, but they involve statistical mechanics and expander graphs and other things I don't understand. I'll update this once I've done some reading.
Why Is Random Testing Effective for Partition Tolerance Bugs?, Majumdar and Niksic, 2018
There are lots of ways to implement the mutate t
step above. Short
version: what Crowbar used to do is bad, what Hypothesis does is much
better, and I think doing what Hypothesis does in an immutable style
with sharing is even better.
Both Crowbar and Hypothesis represent test cases as strings with a compact binary format. For instance, this generator:
type t = L of Int32.t | B of t * t
let gen = rec (fun g -> choose [
map [int32] (fun n -> L n);
map [g; g] (fun x y -> B (x, y))]
can represent this tree:
.
/ \
. 0
/ \
1 5
with this bitstring:
01 01 00 00 00 00 01 00 00 00 00 05 00 00 00 00 00
To parse the test case, choose
generators use the first byte to
select the case, while map
generators parse each argument in
sequence. In the gen
generator, 00
selects leaves, and 01
selects branches, so the testcase above parses as:
01 01 00 00 00 00 01 00 00 00 00 05 00 00 00 00 00
B [B [L -- int32: 1 --] [L -- int32: 5 --][ L -- int32: 0 --]]
So, these bitstrings have a fairly simple internal structure, with implicit boundaries between the blocks of bytes meant for one generator and the bytes meant for another.
Crowbar (on master
) mutates testcases by adding, removing or flipping
bits in the input bitstring. (Actually, this step is outsourced to
afl-fuzz, but that's what it does). This is done without any knowledge
of the implicit tree structure of the test case, so most changes to
the bitstring destroy the testcase (e.g. changing B
to L
without
deleting the rest of the branch). Often, what looks like a small
change will affect the entire rest of the test case, and might well
make it unparseable.
Hypothesis, on the other hand, generates testcases with knowledge of block boundaries. In particular, when a part of a testcase is being mutated, the bytes corresponding to the rest of the testcase do not get out of sync.
Hypothesis does a much better job of mutating test cases than Crowbar, but since its generators have an imperative-style API and are allowed to have side-effects, it must rerun the generator for each mutation of the test case. Crowbar's API is made of functional combinators, and demanding that generators be pure is something I think we can get away with in OCaml. So, if we use a data structure that explicitly stores all the sub-parts of generated testcases, then we can efficiently generate new testcases without having to re-build the entire structure.
This data structure exists on the hype
branch of Crowbar as 'a Gen.sample
. The Gen.mutate
algorithm picks a byte in the testcase
uniformly at random, walks down the testcase tree until it finds the
node starting at or containing that byte, and then replaces that
node. This could involve replacing the entire testcase (if it happens
to pick the first byte), but I argue that it's fast on average, at
least for reasonably balanced testcases:
If the testcase is a balanced-ish tree of size n, with about c^k nodes (for some constant branching factor c) at depth k up to a maximum depth d = O(log n), then a node at depth k has about c^(d-k) nodes underneath it, so its expected size is:
1/n Σ_(k ≤ d) c^k * c^(d-k)
= 1/n Σ_(k ≤ d) c^d
= 1/n c^d d
= O(log n)
In other words, a random node of a random tree has about O(log n) nodes underneath it. So, it should take about O(log n) time to mutate a test case, and more importantly about O(log n) extra space to store the mutated test case on the queue, instead of O(n) if we were storing the testcase in their entirety.
The main difference between Crowbar and other Quickcheck-like systems is that Crowbar uses coverage instrumentation to guide the search for failing testcases. This extra information informs Crowbar where edge cases in the code under test are, guiding testing toward tricky and possibly-buggy parts of the code.
The instrumentation is based on afl-fuzz
's instrumentation for
C. Roughly, this generates a random number at compile time for each
conditional branch target (e.g. each function, each case of a match
,
each side of an if
, etc.). Every time one of these targets is hit,
the instrumentation uses a hash of the current and previous target to
update a slot in a hash table (see lcamtuf's
docs for more
details). From Crowbar's point of view, the useful part is that after
running a test case, we get a bitmap with some bits set corresponding
to parts of the program that were exercised: if bits are set that were
not set by previous test cases, then we've found a test case that
triggers a previously-unexercised path.
The master
branch of Crowbar uses afl-fuzz
directly to generate
testcase. afl-fuzz
passes bitstrings to be parsed by Crowbar and
retrieves instrumentation results. This was easy to implement and gave
promising results, but is not efficient. The major issue is
the poor mutation strategy (described above), since afl-fuzz
knows
nothing about the structure of test cases and must re-derive their
boundaries on each mutation. It's surprisingly good at this: faster
than I'd expect, but slower than it could be. Additionally, there are
some overheads in the IPC between the separate afl-fuzz
and Crowbar
processes.
So, the fuzzing algorithm looks roughly like this:
* q := a queue with a single small test case
* N times {
* let t = remove item from q
* M times {
* let t' = mutate t
if t' fails {
print t
}
* if t' did something new {
* add t' to q
* }
* }
* }
In the master
branch, the lines marked *
are done in the
afl-fuzz
process. In the hype
branch, all of this is in OCaml,
which lets us play with alternatives to the core fuzzing algorithm.
This is still being hacked on, but so far is taking ~10x fewer tests
to find interesting behaviours than the master
branch. There are a
few variants of the core algorithm, that live in src/fuzz.ml
.
None of these are particularly optimised, so they sometimes lose to
afl-fuzz
in time-to-bug. So far, I'm working on optimising
#execs
-to-bug. I don't think it'll be too hard to optimise the
runtime later, since there's quite a bit of low-hanging fruit here.
This one is simple random testing, ignoring instrumentation. It's there for comparison purposes.
This is a straight implementation of the pseudocode above, with M=3
and N
infinite (looping until it finds a bug or someone hits
Ctrl-C
). Despite the lack of clever processing of instrumentation,
it performs quite well, easily beating master
on xmldiff
but not
on charrua
. Generally, it seems that the impact of being smarter
than afl-fuzz
in mutating testcases outweighs the impact of being
stupider than afl-fuzz
in processing instrumentation.
This is a more speculative algorithm, based on Markov chain Monte
Carlo methods. The high-level idea is that Crowbar is trying to
compute the exit code of the test, averaged over all possible
inputs. If the test always passes, then the average exit code is 0
,
but if it sometimes fails then the average is bigger than 0
.
It's inspired by some comments of Böhme et al. on viewing fuzzing as a Markov process. Roughly, some code paths (e.g. parser error handling) are much more likely to be triggered by uniform random testcases than others. For good testing, we want a mostly-uniform distribution on code paths, which we can achieve by adjusting the distribution of test cases so that the ones that trigger common code paths are generated less frequently than those that trigger edge cases.
mcmc_fuzz
uses (a slightly dubious version of...) the Metropolis-Hastings
algorithm. The idea is that we have a function that computes how
interesting a test case is. In the belief that bugs are more likely to
occur on interesting testcases, we want to construct an infinite
sequence of random testcases drawn from a distribution proportional to
interest (where a testcase that's twice as interesting is twice as
likely to be chosen).
The M-H algorithm constructs such a sequence (under certain conditions) and is surprisingly simple to implement. The state of the algorithm is a testcase T, and mutates this testcase to produce a testcase T'. If T' is more interesting than T, then it unconditionally switches to T'. Otherwise, it switches to T' with probability (interest(T')/interest(T)).
The interest function that mcmc_fuzz
uses is based on a cumulative
count of the number of times each instrumentation bit has previously
been triggered. The interest of a testcase T is 1/r
, where r
is
the number of times that the rarest instrumentation bit triggered by
T
has occurred.
There are a couple of theoretical issues with this approach:
-
the variant of Metropolis-Hastings implemented assumes that the mutation distribution is symmetric: the probability of mutating T to T' should be equal to the probability of mutating T' to T. This is true if T and T' have the same size, but not if one is bigger (it's easier to go from a smaller testcase to a bigger testcase than the reverse, because there are more options to mutate a bigger one). MH can handle certain asymmetric mutation distributions, but there's a correction factor to apply which is missing at the moment.
-
since the cumulative instrumentation counts are updated after each test, the interest function keeps changing. I think this shouldn't affect stability, because it has a negative feedback effect: if it spends too long exploring a certain part of the program, it will underestimate how interesting those parts are until it's explored other parts equally thoroughly. Still, I don't know how to prove this.
-
the rate at which it occurs is not the best model for how interesting a bit is, and counting the occurrences is not the best estimator of this rate. A better approach would be to do Bayesian inference to update the interest rates for each bit, rather than a cumulative count.
Still, the hacked version that's there performs surprisingly well
(better than quickcheck
, worse than fuzz
), considering that it has
no queue and keeps almost no state. The main problem is that it
regularly forgets about really interesting test cases. I think this
problem is fundamental, but it might be worth coming back to this one
eventually, or integrating it into another approach.
This one uses an algorithm closer to that of fuzz
, but is based on
the interest function of mcmc_fuzz
. The idea is that perhaps it's
not a great idea to enqueue only the first testcase that triggers a new
instrumentation point, as afl-fuzz
and fuzz
do, since it's not
necessarily the most interesting. Instead, corpus_fuzz
starts with a
corpus of one test case, and each step it mutates every entry in the
corpus some number of times, and collects all of the mutations as the
next corpus.
The number of times each entry is mutated is proportional to the interest of that entry. Some entries get mutated 0 times, which is the same as removing them from the corpus, so the corpus does not grow infinitely. The effect is that if many testcases triggering the same new bit are found on the same cycle, then they all get run in the next cycle. The effort allocated to the new bit is spread over all of the testcases.
There is a slight complication in how the mutation counts are computed. The interest is a float, and naively rounding it to an integer might discard interesting test cases: if there are 100 test cases that hit some bit, and that bit is interesting enough to be fuzzed 40 times, then each test case should be fuzzed on average 0.4 times, which rounds to 0. Instead of normal rounding, the computation uses random rounding: a number ending in .4 is rounded up with 40% probability and down with 60% probability. This means that about 40% of test cases with mutation count 0.4 will be mutated once.
To avoid ever forgetting about an instrumentation bit, corpus_fuzz
keeps a table of the shortest testcase found so far that hits any
given bit. If the new corpus does not contain something which hits a
previously-hit bit, then the shortest testcase that did from the
previous corpus is copied over verbatim.
This strategy usually beats fuzz
, but sometimes gets into an
annoying steady state, where it hasn't found new bits and the
keep-shortest-testcase mechanism ends up copying the previous corpus
in its entirety. It can spend a long time in this state, since it
keeps fuzzing the same testcases over and over again.
This one is the same algorithm as corpus_fuzz
, but implemented using
a continuous queue rather than batch-processing the entire
corpus. It's similar in #execs
-to-bug as corpus_fuzz
, but
generally better in time-to-bug since some of the batch-processing was
quite expensive.
I'm playing with an alternative means of avoiding forgetting instrumentation bits. So far it seems to get stuck in a steady state less, but much more testing is needed.
When testing charrua
with qcorpus_fuzz
, I noticed something quite
stupid happening: when the algorithm finds a new interesting testcase,
it gets added to the queue. Despite the fact that the algorithm
considers the new testcase more interesting than anything else in the
queue, it has to wait for the entire queue to be processed before
getting another go.
So, pqcorpus_fuzz
is the same algorithm as qcorpus_fuzz
, but the
queue is a priority queue (using
psq
), so that it always fuzzes the
most interesting test case first. Since interest is based on
occurrence count, fuzzing an interesting testcase makes that testcase
less interesting, so this does eventually move down through the
queue. However, it exploits new finds much faster than the queue-based
approaches. More testing!
Simulated annealing
The amount that testcase are mutated by, and the amount of time spent fuzzing the corpus in total, are currently more-or-less constant throughout fuzzing for each strategy. It would probably be better to do something like simulated annealing, so that we spend more time making smaller changes after we've fuzzed for a long time.
Distribution smoothing
The time-to-bug in Quickcheck is exponentially distributed: If you've run QuickCheck for 1 second, and it hasn't found a bug, then the expected time for it to find a bug is the same as if it had never run. Having passed previous tests makes it neither more nor less likely to find a bug soon.
Ideally, smarter fuzzing approaches should do better than exponential:
if the fuzzer hasn't found the bug in the first second, then that
means it's already ruled out some uninteresting behaviour, and is more
likely to find the bug in the next 10 seconds than it was originally,
because it has a smaller space to search. However, they sometimes seem
worse than exponential, by ruling out parts of the search space that
do actually contain a bug (see e.g. the insane variance of the uunf
test in the original Crowbar paper).
If we can't make fuzzing always better than exponential, we can cheat: after a certain amount of fuzzing, we can start a new fuzzer with blank state, and run it independently, occasionally transferring interesting testcases from the new to the old fuzzer. For an ideal fuzzer, this will only make things worse, but for a non-ideal fuzzer the downside is bounded and the upside is not.
Tree splicing
The hype
branch currently supports a cool data structure for test
cases, tagging each component by the generator that built it. The idea
is that we can splice together interesting test cases in a
type-preserving way, mixing up testcases but matching data to
generators properly. This isn't really exploited by the current code,
though.
AFL-fuzz, Zalewski/lcamtuf, 2013-
Coverage-based Greybox Fuzzing as Markov Chain, Böhme, Pham, Roychoudhury, 2016
When a bug is found, the first-discovered failing testcase is not necessarily the shortest or simplest, so most QuickCheck-like systems provide some means to shrink failing testcases.
- once again, do what Hypothesis does
so far, everything's assuming fuzzing a new bit of code, but generally you're fuzzing a piece of code which has been previously fuzzed but changed slightly since then.
-
reuse examples
-
target fuzzing towards recently-changed bits of code?
Directed Greybox Fuzzing, Böhme, Pham, Nguyen and Roychoudhury, 2017