Getting better mileage out of Z3

In large F* developments, such as mitls and HACL*, Z3 (and/or the way F* uses it) often ends up being a bottleneck in terms of proofs stability and predictability. Though F* uses mechanisms such as hints to alleviate the problem, we have very less visibility into the root cause of such behaviors. This page lists some ideas around this problem (originally stemmed from a discussion between Nik and Aseem when working on the AEAD proofs):

Hints

F* employs a hints mechanism to prune the Z3 context. For an F* query, when Z3 is able to find a proof, F* extracts what is called an unsat core from the proof. The unsat core is a list of lemmas that Z3 applied in order to prove the particular query. F* records this unsat core in a hints file (named filename.hints). Next time whenever F* needs to prove the same query, it prunes the Z3 context based on this recorded unsat core. Essentially, it removes all the lemmas not present in the core from the context, and sends the query to Z3 in this pruned context. Most of the times, Z3 is able to find the proof quickly, but sometimes the hints don't replay, and in that case, F* falls back to the unpruned context and tries again.

Proposal - More refined fall back when hints fail

When hints don't replay, F* falls back to the complete, unpruned context. Instead, it could be employ some heuristics to try contexts that are bigger than the failed context, but still smaller than the full context. For example, if some module lemma is mentioned in the unsat core, F* could try adding all the lemmas of that module to the context and try.

Proposal - Slicing the context

F* could slice the context based only on the query. For example, it could compute the transitive closure of the modules starting from the query, and remove all the modules that are not in this closure from the context. It is unclear how much this would save, in worst case, we could end up with all the modules. It seems that other tools, such as Boogie, employ such slicing before sending the query to Z3.

Investigating failures across F* builds

Sometimes we have observed that seemingly harmless changes in F* (such as some config file change) ends up making some hints not replayable or requiring bigger timeouts. So far we don't understand why this happens. It could help comparing the hints files from the two runs to see if something changed there, or even Z3 traces (?) to see if some quantifiers instantiation patterns changed.

Notion of healthy modules

Slack discussion started by Nik:

A major source of pain in larger F* developments (HACL*, TLS, etc.) is that the context (aka environment) in which we try to prove a program can be enormous. We have very little transparency into what's in the context when presenting a proof to Z3 (it can often be a 10MB background theory) and unexpected lemmas in the environment can make otherwise simple proofs very time consuming, brittle, etc.

The problem of the unpredictable context is most troublesome when it impacts the use of infrastructural lemmas. Here are two examples:

We rely heavily on being able to prove properties about sequences. Here's an example of a simple property that's easily proven in FStar.Seq.Properties. But, try proving this same lemma in Crypto.AEAD.Invariant and it takes WAY longer, or it may even fail to prove.

let lemma (#a:Type) (s_0:seq a) (s_1:seq a{length s_0 <= length s_1})
  :Lemma (requires (Seq.equal s_0 (slice s_1 0 (length s_0))))
         (ensures  (Seq.equal s_1 (Seq.append s_0 (Seq.slice s_1 (length s_0) (length s_1)))))
  = ()

Proving properties about the footprint of a computation is really the bread and butter of our verification method. Here's a simple lemma about modifies clauses. Proving it in a standalone file on top of HyperStack, or even in Crypto.AEAD.BufferUtils (a file with few dependences) is no problem. But, try proving it in the middle of Crypto.AEAD.MAC_Wrapper.Invariant and you'll fail, or at least have to wait a LONG time.

let weaken_modifies (s0:Set.set rid) (s1:Set.set rid)
                    (h0:HS.mem) (h1:HS.mem)
   : Lemma (requires (HS.modifies s0 h0 h1 /\
                      Set.subset s0 s1))
           (ensures (HS.modifies s1 h0 h1))
   = ()

I propose for (at least) each of our "infrastructural libraries" (Seq, Heap, HyperHeap, HyperStack, Buffer, List, Map, Set, UIntXX), we come up with a handful of simple lemmas of the flavor of these examples above.

Then, for each module in HACL, TLS etc., we measure the cost of proving these infrastructural lemmas with that module (and, necessarily, its transitive dependences) in the context. We could even do this at a finer grain, e.g., after each top-level definition, we measure the cost of proving all infrastructural lemmas.

I bet we'll find that some of our modules are providing top-level definitions that end up "polluting" the prover context with quantifiers that confuses the solver.

A "healthy" module is one that does not adversely impact infrastructural proofs.

Tactics and manual proofs

Ultimately none of the solutions above are full proof. Perhaps a more principled solution is to use tactics to split the Z3 query into multiple goals, and for the goals that fail, prompt the user to provide a manual proof.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly