Design notes

Here's some ideas about how I (@stedolan) want Crowbar to end up, formed mostly from discussions with co-conspirator @yomimono and Hypothesis author @DRMacIver, reading Hypothesis docs and various papers (citations below), and some spelunking in afl-fuzz.c. I don't think any of these ideas are new, but I don't think they've been simultaneously stolen before.

This isn't about changing the API of Crowbar. All of the stuff below works with an unchanged crowbar.mli and more-or-less the same execution model (although I think it's worthwhile to redo afl-fuzz functionality directly in OCaml, so the exact command for running tests will change. More on this below). In particular, ppx_deriving_crowbar should not need any changes.

This document consists of a list of testing algorithms, starting with naive QuickCheck and ending with what I'd like Crowbar to do. The master branch of Crowbar already does a few of these. The hype branch already does a few more, and Hypothesis already does an overlapping set of them. In back-of-the-envelope experiments, going further down this list has found bugs faster, although I need to do more thorough testing.

The starting point is completely naive Quickcheck:

N times {
  let t = generate a random testcase
  if t fails {
    print t
  }
}

The "generate a random testcase" is done according to a typed generator that the user provides (see crowbar.mli for how these work), and the test is some user-provided function. A "test failure" is a particular testcase that makes the function raise an exception. It's hard to tell how big N should be.

Generator termination

The first problem (pointed out in the original QuickCheck paper) is that, for some reasonable-looking generators, the above algorithm will only manage to do 1 or 2 tests. The standard problematic example is this generator for binary trees (type t = L | B of t * t):

with probability 1/3, choose L
with probability 2/3, generate two trees x,y, and choose (B (x, y))

Unfortunately, this generator has only a 50% chance of actually terminating. The standard solution (which the master branch of Crowbar currently uses, although it's had some bugs) is to bound the depth of the recursion.

A better solution, that Hypothesis uses and the hype branch of Crowbar now uses, is to arrange generators so that when the source of random numbers starts returning mostly 0, the generator produces a very small test case. In Hypothesis this is done manually, but the hype branch of Crowbar now does this automatically, by computing an approximation of the smallest value a generator can make, and using it to sort the arguments to choose (search for small_example_size in src/gen.ml).

Either way, the algorithm is:

N times {
  let t = generate a random smallish testcase
  if t fails {
    print t
  }
}

QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs, Claessen and Hughes, 2000

Hypothesis documentation, David R. MacIver, 2013-2018

Sampling versus random walking

The provided generator determines a particular distribution of possible test cases, and the QuickCheck algorithm takes N independent samples from this distribution. N will vary a lot: we might want to run a few tests on every commit, but maybe leave the tests running over the weekend for less frequent but more thorough testing. Here lies the second problem with the QuickCheck algorithm, which I think is less widely appreciated than the first: the correct distribution depends on N.

Let's divide the possible test cases into "small" and "big" tests, and suppose that there are only K possible small tests. There might be infinitely many big tests. For instance, in the binary tree example above, we could take a "small" test to be a binary tree with fewer than 4 leaves.

Suppose that sampling the generator produces a small test with probability p. We don't want to ignore the small tests, so p > 0. The most likely small test has probability at least p/K, so if N > K/p we expect that it will be tested at least once. However, if N is much greater than K/p we expect that the small tests will be tested many, many times. In fact, no matter how large N gets, a proportion p of the testing budget is wasted re-testing the K smallest tests that have already been exhaustively explored.

So, the optimal value of p depends on how large N is (among other things). However, we don't necessarily know N in advance (a standard value of N is "until the user gets bored and hits ctrl-C"), and it's not obvious how best to define "small" and calculate p.

An extreme solution to this problem is to check the small cases first: spend the first K tests exhaustively checking all of the small cases, then only generate big cases afterward (see SmallCheck and the very cool Lazy Smallcheck). I don't think this is the right approach, though: sometimes there are bugs that only show up on large examples, but show up on a decent proportion of large examples. Waiting until you've exhausted the small ones can be exponentially slower than trying some large ones early.

I think that a better solution is to use a random walk instead of independent sampling. Instead of generating each test case in the same way, generate later test cases by mutating earlier ones. The probability of mutating a small test case and making it large is much greater than the probability of mutating a large test case and making it small (because there are so many more large test cases than small ones), so if left run for long enough this process will stop generating small test cases entirely, as desired. To make the distribution less fragile, there should be a collection of random walkers, rather than just a single one, giving something like:

q := a queue with a single small test case
N times {
  let t = remove item from q
  M times {
    let t' = mutate t
    add t' to q
    if t' fails {
      print t
    }
  }
}

I think there are better explanations of this idea, but they involve statistical mechanics and expander graphs and other things I don't understand. I'll update this once I've done some reading.

SmallCheck and Lazy SmallCheck: automatic exhaustive testing for small values, Runciman, Naylor and Lindblad, 2008

Why Is Random Testing Effective for Partition Tolerance Bugs?, Majumdar and Niksic, 2018

Good mutation strategies

There are lots of ways to implement the mutate t step above. Short version: what Crowbar used to do is bad, what Hypothesis does is much better, and I think doing what Hypothesis does in an immutable style with sharing is even better.

Both Crowbar and Hypothesis represent test cases as strings with a compact binary format. For instance, this generator:

type t = L of Int32.t | B of t * t
let gen = rec (fun g -> choose [
  map [int32] (fun n -> L n);
  map [g; g] (fun x y -> B (x, y))]

can represent this tree:

with this bitstring:

01  01  00  00  00  00  01  00  00  00  00  05  00  00  00  00  00

To parse the test case, choose generators use the first byte to select the case, while map generators parse each argument in sequence. In the gen generator, 00 selects leaves, and 01 selects branches, so the testcase above parses as:

01  01  00  00  00  00  01  00  00  00  00  05  00  00  00  00  00
 B  [B  [L  -- int32: 1 --] [L  -- int32: 5 --][ L  -- int32: 0 --]]

So, these bitstrings have a fairly simple internal structure, with implicit boundaries between the blocks of bytes meant for one generator and the bytes meant for another.

Crowbar (on master) mutates testcases by adding, removing or flipping bits in the input bitstring. (Actually, this step is outsourced to afl-fuzz, but that's what it does). This is done without any knowledge of the implicit tree structure of the test case, so most changes to the bitstring destroy the testcase (e.g. changing B to L without deleting the rest of the branch). Often, what looks like a small change will affect the entire rest of the test case, and might well make it unparseable.

Hypothesis, on the other hand, generates testcases with knowledge of block boundaries. In particular, when a part of a testcase is being mutated, the bytes corresponding to the rest of the testcase do not get out of sync.

Hypothesis does a much better job of mutating test cases than Crowbar, but since its generators have an imperative-style API and are allowed to have side-effects, it must rerun the generator for each mutation of the test case. Crowbar's API is made of functional combinators, and demanding that generators be pure is something I think we can get away with in OCaml. So, if we use a data structure that explicitly stores all the sub-parts of generated testcases, then we can efficiently generate new testcases without having to re-build the entire structure.

This data structure exists on the hype branch of Crowbar as 'a Gen.sample. The Gen.mutate algorithm picks a byte in the testcase uniformly at random, walks down the testcase tree until it finds the node starting at or containing that byte, and then replaces that node. This could involve replacing the entire testcase (if it happens to pick the first byte), but I argue that it's fast on average, at least for reasonably balanced testcases:

If the testcase is a balanced-ish tree of size n, with about c^k nodes (for some constant branching factor c) at depth k up to a maximum depth d = O(log n), then a node at depth k has about c^(d-k) nodes underneath it, so its expected size is:

  1/n Σ_(k ≤ d) c^k * c^(d-k)
= 1/n Σ_(k ≤ d) c^d
= 1/n c^d d
= O(log n)

In other words, a random node of a random tree has about O(log n) nodes underneath it. So, it should take about O(log n) time to mutate a test case, and more importantly about O(log n) extra space to store the mutated test case on the queue, instead of O(n) if we were storing the testcase in their entirety.

Instrumentation and coverage information

The main difference between Crowbar and other Quickcheck-like systems is that Crowbar uses coverage instrumentation to guide the search for failing testcases. This extra information informs Crowbar where edge cases in the code under test are, guiding testing toward tricky and possibly-buggy parts of the code.

The instrumentation is based on afl-fuzz's instrumentation for C. Roughly, this generates a random number at compile time for each conditional branch target (e.g. each function, each case of a match, each side of an if, etc.). Every time one of these targets is hit, the instrumentation uses a hash of the current and previous target to update a slot in a hash table (see lcamtuf's docs for more details). From Crowbar's point of view, the useful part is that after running a test case, we get a bitmap with some bits set corresponding to parts of the program that were exercised: if bits are set that were not set by previous test cases, then we've found a test case that triggers a previously-unexercised path.

The master branch of Crowbar uses afl-fuzz directly to generate testcase. afl-fuzz passes bitstrings to be parsed by Crowbar and retrieves instrumentation results. This was easy to implement and gave promising results, but is not efficient. The major issue is the poor mutation strategy (described above), since afl-fuzz knows nothing about the structure of test cases and must re-derive their boundaries on each mutation. It's surprisingly good at this: faster than I'd expect, but slower than it could be. Additionally, there are some overheads in the IPC between the separate afl-fuzz and Crowbar processes.

So, the fuzzing algorithm looks roughly like this:

*  q := a queue with a single small test case
*  N times {
*    let t = remove item from q
*    M times {
*      let t' = mutate t
       if t' fails {
         print t
       }
*      if t' did something new {
*        add t' to q
*      }
*    }
*  }

In the master branch, the lines marked * are done in the afl-fuzz process. In the hype branch, all of this is in OCaml, which lets us play with alternatives to the core fuzzing algorithm. This is still being hacked on, but so far is taking ~10x fewer tests to find interesting behaviours than the master branch. There are a few variants of the core algorithm, that live in src/fuzz.ml.

None of these are particularly optimised, so they sometimes lose to afl-fuzz in time-to-bug. So far, I'm working on optimising #execs-to-bug. I don't think it'll be too hard to optimise the runtime later, since there's quite a bit of low-hanging fruit here.

`quickcheck`

This one is simple random testing, ignoring instrumentation. It's there for comparison purposes.

`fuzz`

This is a straight implementation of the pseudocode above, with M=3 and N infinite (looping until it finds a bug or someone hits Ctrl-C). Despite the lack of clever processing of instrumentation, it performs quite well, easily beating master on xmldiff but not on charrua. Generally, it seems that the impact of being smarter than afl-fuzz in mutating testcases outweighs the impact of being stupider than afl-fuzz in processing instrumentation.

`mcmc_fuzz`

This is a more speculative algorithm, based on Markov chain Monte Carlo methods. The high-level idea is that Crowbar is trying to compute the exit code of the test, averaged over all possible inputs. If the test always passes, then the average exit code is 0, but if it sometimes fails then the average is bigger than 0.

It's inspired by some comments of Böhme et al. on viewing fuzzing as a Markov process. Roughly, some code paths (e.g. parser error handling) are much more likely to be triggered by uniform random testcases than others. For good testing, we want a mostly-uniform distribution on code paths, which we can achieve by adjusting the distribution of test cases so that the ones that trigger common code paths are generated less frequently than those that trigger edge cases.

mcmc_fuzz uses (a slightly dubious version of...) the Metropolis-Hastings algorithm. The idea is that we have a function that computes how interesting a test case is. In the belief that bugs are more likely to occur on interesting testcases, we want to construct an infinite sequence of random testcases drawn from a distribution proportional to interest (where a testcase that's twice as interesting is twice as likely to be chosen).

The M-H algorithm constructs such a sequence (under certain conditions) and is surprisingly simple to implement. The state of the algorithm is a testcase T, and mutates this testcase to produce a testcase T'. If T' is more interesting than T, then it unconditionally switches to T'. Otherwise, it switches to T' with probability (interest(T')/interest(T)).

The interest function that mcmc_fuzz uses is based on a cumulative count of the number of times each instrumentation bit has previously been triggered. The interest of a testcase T is 1/r, where r is the number of times that the rarest instrumentation bit triggered by T has occurred.

There are a couple of theoretical issues with this approach:

the variant of Metropolis-Hastings implemented assumes that the mutation distribution is symmetric: the probability of mutating T to T' should be equal to the probability of mutating T' to T. This is true if T and T' have the same size, but not if one is bigger (it's easier to go from a smaller testcase to a bigger testcase than the reverse, because there are more options to mutate a bigger one). MH can handle certain asymmetric mutation distributions, but there's a correction factor to apply which is missing at the moment.
since the cumulative instrumentation counts are updated after each test, the interest function keeps changing. I think this shouldn't affect stability, because it has a negative feedback effect: if it spends too long exploring a certain part of the program, it will underestimate how interesting those parts are until it's explored other parts equally thoroughly. Still, I don't know how to prove this.
the rate at which it occurs is not the best model for how interesting a bit is, and counting the occurrences is not the best estimator of this rate. A better approach would be to do Bayesian inference to update the interest rates for each bit, rather than a cumulative count.

Still, the hacked version that's there performs surprisingly well (better than quickcheck, worse than fuzz), considering that it has no queue and keeps almost no state. The main problem is that it regularly forgets about really interesting test cases. I think this problem is fundamental, but it might be worth coming back to this one eventually, or integrating it into another approach.

`corpus_fuzz`

This one uses an algorithm closer to that of fuzz, but is based on the interest function of mcmc_fuzz. The idea is that perhaps it's not a great to enqueue only the first testcase that triggers a new instrumentation point, as afl-fuzz and fuzz do, since it's not necessarily the most interesting. Instead, corpus_fuzz starts with a corpus of one test case, and each step it mutates every entry in the corpus some number of times, and collects all of the mutations as the next corpus.

The number of times each entry is mutated is proportional to the interest of that entry. Some entries get mutated 0 times, which is the same as removing them from the corpus, so the corpus does not grow infinitely. The effect of this is that if many testcase triggering the same new bit are found on the same cycle, then they all get run in the next cycle, not just the first. The effort allocated to the new bit is spread over all of the testcases.

There is a slight complication in how the mutation counts are computed. The interest is a float, and naively rounding it to an integer might discard interesting test cases: if there are 100 test cases that hit some bit, and that bit is interesting enough to be fuzzed 40 times, then each test case should be fuzzed on average 0.4 times, which rounds to 0. Instead of normal rounding, the computation uses random rounding: a number ending in .4 is rounded up with 40% probability and down with 60% probability. This means that about 40% of test cases with mutation count 0.4 will be mutated once.

To avoid ever forgetting about an instrumentation bit, corpus_fuzz keeps a table of the shortest testcase found so far that hits any given bit. If the new corpus does not contain something which hits a previously-hit bit, then the shortest testcase that did from the previous corpus is copied over verbatim.

This strategy usually beats fuzz, but sometimes gets into an annoying steady state, where it hasn't found new bits and the keep-shortest-testcase mechanism ends up copying the previous corpus in its entirety. It can spend a long time in this state, since it keeps fuzzing the same testcases over and over again.

`qcorpus_fuzz`

This one is the same algorithm as corpus_fuzz, but implemented using a continuous queue rather than batch-processing the entire corpus. It's similar in #execs-to-bug as corpus_fuzz, but generally better in time-to-bug since some of the batch-processing was quite expensive.

I'm playing with an alternative means of avoiding forgetting instrumentation bits. So far it seems to get stuck in a steady state less, but much more testing is needed.

`pqcorpus_fuzz`

When testing charrua with qcorpus_fuzz, I noticed something quite stupid happening: when the algorithm finds a new interesting testcase, it gets added to the queue. Despite the fact that the algorithm considers the new testcase more interesting than anything else in the queue, it has to wait for the entire queue to be processed before getting another go.

So, pqcorpus_fuzz is the same algorithm as qcorpus_fuzz, but the queue is a priority queue (using psq), so that it always fuzzes the most interesting test case first. Since interest is based on occurrence count, fuzzing an interesting testcase makes that testcase less interesting, so this does eventually move down through the queue. However, it exploits new finds much faster than the queue-based approaches. More testing!

more ideas

Simulated annealing

The amount that testcase are mutated by, and the amount of time spent fuzzing the corpus in total, are currently more-or-less constant throughout fuzzing for each strategy. It would probably be better to do something like simulated annealing, so that we spend more time making smaller changes after we've fuzzed for a long time.

Distribution smoothing

The time-to-bug in Quickcheck is exponentially distributed: If you've run QuickCheck for 1 second, and it hasn't found a bug, then the expected time for it to find a bug is the same as if it had never run. Having passed previous tests makes it neither more nor less likely to find a bug soon.

Ideally, smarter fuzzing approaches should do better than exponential: if the fuzzer hasn't found the bug in the first second, then that means it's already ruled out some uninteresting behaviour, and is more likely to find the bug in the next 10 seconds than it was originally, because it has a smaller space to search. However, they sometimes seem worse than exponential, by ruling out parts of the search space that do actually contain a bug (see e.g. the insane variance of the uunf test in the original Crowbar paper).

If we can't make fuzzing always better than exponential, we can cheat: after a certain amount of fuzzing, we can start a new fuzzer with blank state, and run it independently, occasionally transferring interesting testcases from the new to the old fuzzer. For an ideal fuzzer, this will only make things worse, but for a non-ideal fuzzer the downside is bounded and the upside is not.

Tree splicing

The hype branch currently supports a cool data structure for test cases, tagging each component by the generator that built it. The idea is that we can splice together interesting test cases in a type-preserving way, mixing up testcases but matching data to generators properly. This isn't really exploited by the current code, though.

AFL-fuzz, Zalewski/lcamtuf, 2013-

Coverage-based Greybox Fuzzing as Markov Chain, Böhme, Pham, Roychoudhury, 2016

Metropolis-Hastings algorithm

[below here, still being written]

Shrinking testcases

When a bug is found, the first-discovered failing testcase is not necessarily the shortest or simplest, so most QuickCheck-like systems provide some means to shrink failing testcases.

once again, do what Hypothesis does

CI and code changes

so far, everything's assuming fuzzing a new bit of code, but generally you're fuzzing a piece of code which has been previously fuzzed but changed slightly since then.

reuse examples
target fuzzing towards recently-changed bits of code?

Directed Greybox Fuzzing, Böhme, Pham, Nguyen and Roychoudhury, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly