We Built a Competitive SMT Solver with Zero Human-Written Code

Can an LLM write software that itself reasons? We tried. It mostly worked.


The debate over whether LLMs can truly “reason” fills countless blog posts and papers. But there’s a harder, less-explored question lurking underneath: can an LLM build software that reasons?

Not a chatbot. Not a code autocomplete. A full-blown automated reasoning engine—one that takes logical formulas as input and decides whether they are satisfiable or not, and if not, produces a formal proof of why.

That’s what we built. An SMT solver. With zero lines of human-written code.

What’s an SMT Solver?

SMT (Satisfiability Modulo Theories) solvers are the workhorses of formal verification. Tools like z3 and cvc5 underpin software verification, hardware design, and security analysis. They decide whether logical formulas—possibly involving arithmetic, arrays, or uninterpreted functions—have a satisfying assignment.

Building one from scratch is a graduate-level research project. Getting it competitive with mature solvers is harder still.

The Experiment

We used Claude Code (Sonnet 4.6) as our coding agent and gave it a single initial prompt pointing to a reference for the congruence closure algorithm. No code scaffolding. No starter template. Just: here’s what you need to implement, go.

The resulting solver—LLM2SMT—targets QF_UF (quantifier-free formulas with uninterpreted functions), a core theory underlying most DPLL(T) functionality. It:

The code is on GitHub: github.com/MikolasJanota/llm2smt

The Results

We benchmarked against the full QF_UF suite from SMT-LIB (7,000+ instances, 180s timeout):

Solver Solved
z3 7,500
cvc5 7,494
LLM2SMT 7,468

LLM2SMT solved 99.6% as many instances as z3—without a single human-written line of solver code. Preprocessing alone added 113 solved instances.

It Wasn’t Magic

The first version was broken in ways both subtle and obvious:

  • No Boolean connectives. The agent understood SMT at a high level but didn’t implement and, or, not until pointed to the SMT-LIB Core theory specification.
  • Wrote its own SAT solver. Despite being told to use CaDiCaL, the agent improvised a toy SAT solver instead. Explicit re-prompting fixed this.
  • Ran without timeouts. Processes silently ran forever in the background. We added a standing instruction to always run under timeout.

Debugging required the same tools human developers rely on: fuzzing and delta-debugging. We had the agent build its own random formula generator and a differential tester against z3. Without these, each bug would have required expensive manual back-and-forth with the agent.

One bug, surfaced by the fuzzer, produced this commit message (written by the LLM):

link_bool_term_to_euf: add mutual-exclusivity clause {-eq_true, -eq_false} so the SAT solver cannot simultaneously merge __bool_true and __bool_false, which was causing false UNSAT for formulas with Bool-sorted UF arguments.

Not bad.

The Preprocessing Win

One benchmark family—the “equational diamond” problems—exposed a hard exponential blowup in standard DPLL(T). Given a formula like:

(x₁ = z₁ ∧ z₁ = x₂) ∨ (x₁ = v₁ ∧ v₁ = x₂)
...
x₁ ≠ x₁₀₁

A naive solver must explore 2ⁿ combinations before finding the contradiction. We described the problem structure in a single prompt and asked for a fix.

The agent devised a preprocessing technique on its own: for each disjunction, compute the equality closure of each branch, extract consequences common to all branches, and add them as unit clauses. For (x=y ∧ y=z) ∨ (x=w ∧ w=z), both branches imply x=z, so that fact gets added upfront. The diamond problems now solve instantaneously.

The Hard Part: Lean Proofs

For unsatisfiable instances, we asked the agent to emit a formal Lean proof. This was the hardest part of the whole project.

The approach: SMT types and axioms become Lean axiom declarations, each theory lemma becomes a theorem discharged by Lean’s grind tactic, and the final False is assembled by bv_decide.

The agent struggled to grasp the separation between theory reasoning (what grind handles) and propositional reasoning (what bv_decide handles). It kept trying to close propositionally non-trivial goals with grind, which failed silently. We eventually had to write a worked example of a correct Lean proof by hand—after which the agent incorporated the pattern correctly.

Of 3,969 attempted certifications, 285 produced verified Lean proofs. No bugs were revealed by the proof production.

Lessons Learned

  • Specify input formats precisely. Ambiguity in the language spec (e.g., SMT2’s treatment of Booleans as both propositions and terms) causes subtle, hard-to-find bugs.
  • Enforce resource limits from day one. Any tool that can run forever will run forever.
  • Fuzzing + delta-debugging are essential. The agent fixes bugs well when given a minimal failing example; finding the example is the hard part.
  • Commit frequently, test regressions. The agent can use git bisect to track down regressions when it has a dense commit history.
  • Expect jagged intelligence. The same agent that devised a clever preprocessing algorithm failed to simplify t = t to true. This uneven capability is a known phenomenon and a real challenge for reliability.
  • Formal proof generation is hard. The agent must model not just its own solver logic but the expectations of an external proof checker—a genuinely difficult meta-cognitive task.

The Bottom Line

Can LLMs build automated reasoning tools? Yes—with appropriate scaffolding. The qualifier matters: systematic testing, explicit resource controls, and careful input specification are not optional. Correctness doesn’t come for free and cannot be assumed.

Perhaps the most striking takeaway is this: LLMs don’t just reason—they can engineer systems that reason. That is a meaningful shift in what these models make possible.


LLM2SMT is described in our paper “LLM2SMT: Building an SMT Solver with Zero Human-Written Code,” on arxiv. Code: github.com/MikolasJanota/llm2smt