Formal methods and the future of programming

8 hours ago by Animats

I used to do proof of correctness work, decades ago.[1] We had more proof automation than many of the later systems. The easy stuff was solved by the first SAT solver, the Oppen-Nelson simplifier. The harder stuff used the Boyer-Moore prover, which uses heuristics and previous lemmas to guide the theorem prover. The Boyer-Moore prover had to be helped along by suggesting lemmas, which it would try to prove and which would then be used for later proofs. That was the tough human job.

Later systems seemed to have less automation. What tended to go wrong with formal methods was that the people doing them were too into the formalism. I was working on this for a commercial project that wanted bug-free code, not as an academic. The academic projects tended to get too clever. They had the mathematician's bias of wanting a terse notation and not much case analysis. That's not a good solution. You want lots of automated grinding and don't want to need much insight. The clever people kept inventing new logics - modal logic, temporal logic, etc, - to avoid verbosity in paper and pencil proof work. That wasn't really all that helpful. The basic truth of this business is that most of the theorems are rather banal.

As the Jane Street people point out, there's a big advantage in having control of the language. You want the verification statements integrated into the programming language. In many systems, they're embedded in comments, in a different syntax than the programming language, or even in completely separate files. This adds unnecessary work. It's good to see them pushing this forward.

We were doing this too early, over 40 years ago. It took about 45 minutes of compute back then to build up number theory from a cold start with the Boyer-Moore prover. Now it takes less than a second.

[1] https://archive.org/details/manualzilla-id-5928072/page/n3/m...

4 hours ago by nextos

I've worked in formal methods for quite a long time, and I disagree a bit with your statement that new logics are not helpful. Industrial logics are really practical and allow you to write all sorts of sophisticated properties that your system should satisfy in a very succinct way. Logic is to computer science and software engineering what calculus is to physics and mechanical or civil engineering [1, 2]. Things like LTL or, more recently, separation logic, have been incredible breakthroughs.

TLA+, which has gained quite a lot of popularity, is a testament to that. Model checking is eminently practical. The exciting thing now is that heavier formal methods, in particular theorem proving, might become cheap enough to use in regular systems software. Writing formal specifications for functions and getting them synthesized and proven correct by some SAT/SMT, theorem prover & LLM hybrid may become the norm in the not-too-distant future.

[1] On the Unusual Effectiveness of Logic in Computer Science. https://www.cs.rice.edu/~vardi/papers/aaas99.jsl.pdf

[2] From Philosophical to Industrial Logics. https://www.cs.rice.edu/~vardi/papers/icla09.pdf

2 hours ago by ebiederm

Have you looked at ADA Spark?

If you have does it match your intuition of how things should be done?

I am slowly working on something where I hope to integrate such a capability for the things that type systems can't handle quickly.

So I would be interested in perspectives of people who have been down this route before.

6 hours ago by winwang

Love this. I've shifted in the past few months to using highly expressive types in Scala 3 to have types carry more and more compile-time proofs (without macros, though a couple are warranted). Not only does it help with agentic test "sprawl", it seems to prevent agents from falling into lower-quality modes of operation. One of the more annoying things I've been preventing is what I call "noun accretion", where agents try and make a new monomorphic type for everything, instead of clearly genericizing when sensible. My bet is on formal-method-shaped tooling (including languages with strong type systems) to accelerate decent-quality agentic coding.

When I say "highly expressive types", I mean techniques I'd likely not want to ship in a typical codebase, unless the team was already on the typelevel programming bandwagon (i.e. having HKT and type functions being basic blocks already, not weird). Agents are better at "math" than basically most devs (even category-theory-pilled ones), at least in terms of knowledge. Better yet, they are decent at pedagogy, especially when considering they have "infinite" patience for dialogue.

In a more personal setting, I've had Codex Lean-ify a couple of my hobby proofs, it was extremely easy. Note: not saying it did this 100% "correctly" (gotta learn more Lean 4 to check more thoroughly), but it also seems to check for classic proof gotchas by default. Very excited for the future of formal methods.

12 hours ago by jdw64

In other words, because GEN AI a lot of code, the idea is to shift human value toward verification. Sometimes I think about what programming really is. In fact, learning programming itself is a huge challenge for a non English speaker like me. I have to rely on machine translation to understand English documents that have no translation. The materials in my language are about 5 to 6 years behind.

Now, since it's impossible to code review the tens of thousands of lines of code that AI produces, I see discussions about establishing an absolute rule like mathematical proofs. Reading this reminds me of Rust's borrow checker. In fact, after writing in Rust a few times, it often leads to bad practices where people use tricks to avoid the borrow checker.

Actually, when mathematical rigor goes too far, humans tend to find ways around it. An undereducated programmer like me is especially prone to that.

Looking back at this kind of attempt, it will probably result in writing code only for specific formalized answers. If it becomes that standardized, I'm not sure it will be able to respond to human needs.

I think these defensive programming attempts are fine, but I want to do offensive programming (I coined that term). You take risks, but you fix things quickly and ship. Believing that over time, it will become good enough. Of course, for established industries where accuracy matters and the scope of work is well defined, like Jane Street, the approach in this article is correct. In other words, because there is enough data to adequately model the market's demands

But for social losers like me trying to make money, constantly moving from place to place looking for gold mines, these kinds of methodologies seem like a luxury.Established businesses with mature modeling need highly educated and specialized personnel to continuously optimize. But I know that realistically, I can't keep up with that demand. So I look for places where modeling is unstructured, but I'm not sure if I can use this approach even then.

an hour ago by orochimaaru

You have to see it from janestreet's perspective. They're an HFT and trading high volume (millions if not 10's of millions) of stock & instruments. There is no "fix". By the time you understand what's wrong you've lost billions.

So yeah - offensive may work in non-critical areas.

Fwiw - you already use defensive everywhere. Python, Java, etc. come with garbage collectors. It's verified that the code is executing your intent.

I was wondering when we would start seeing formal verification. It makes sense that we would go from worrying about implementation details to a scientific/mathematical description of the problems.

5 hours ago by ngruhn

> I want to do offensive programming (I coined that term). You take risks, but you fix things quickly and ship.

Nice, I like the term too. But the paradigm is absolutely status quo in the industry. The thing is: with Gen AI the cost of "defensive programming" has gone way down, while the cost of (human) verification has gone way up. On the other hand, formal methods make verification cheap but come with massive implementation overhead (writing specs, types, proofs, and generally bending the implementation into a rigid framework). But Gen AI can automate all that laborious work. It's a match made in heaven.

5 hours ago by jdw64

You're right. I think the point we need to discuss here is, to be clear, dividing the contexts in which defensive programming is strong and where it should be aggressive.

I think the overhead of implementation is enormous, but I believe AI can write it. However, before even reaching the 'implementation' stage, that is, at the planning stage, sufficient data must be collected for the implementation to be safe.

In that respect, I think Jane Street already has enough data and modeling capabilities. However, I think it's a bit difficult to say whether this approach will take shape in many other domains.

In that sense, I also think that the reason many industries are doing this kind of fast deployment and experimental tooling might be a preparatory step for that kind of modeling. Have a good day Thanks for the good comment. It helped me think more sharply as well

31 minutes ago by majormajor

> I think the overhead of implementation is enormous, but I believe AI can write it. However, before even reaching the 'implementation' stage, that is, at the planning stage, sufficient data must be collected for the implementation to be safe.

I don't really follow here, I think once you know what you want the system to do, you could start to model it formally. What data do you think needs collection for planning here? Is it just for performance profiling of whatever algorithms are decided on? But that seems equally as relevant as if you do your initial implementation straight-up in code w/o any formal approach.

9 hours ago by stephenlf

> these kinds of methodologies seem like a luxury

Absolutely. The article acknowledges this. Jane Street is pretty uniquely equipped to benefit from this.

10 hours ago by brap

Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”.

I guess doing things twice can help catch errors, but I fail to see what’s so special about formal specs if they can suffer from the exact same bugs as the tests/implementation.

I guess the root of the problem is if you want to formally prove that a program does something, you have to be very specific (heh) about what that something is, at which point you are basically just writing tests/implementation all over again.

I have been looking into this on and off for years now, and I keep feeling like I’m missing something, but I don’t know what it is.

Can anyone enlighten me?

5 hours ago by jcranmer

There's a famous quote from Dijkstra: "Program testing can be used to show the presence of bugs, but never to show their absence." The flaw of testing is that it can only test the behaviors that you think might be problematic. To actually reach into the category of proactively fixing behaviors that you didn't know could go wrong, you have to reach for more exotic tools in the toolkit. Fuzz testing is a start down this path; formal verification is a different start down this path.

Obviously, the quality of these sorts of tools is dependent on your ability to write a formal model that is comprehensive in allowing behavior you want to be acceptable and disallowing behavior you want to be unacceptable, and we're still surprisingly far from that in many fields.

10 hours ago by jlebar

> Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”. [...] Can anyone enlighten me?

A big difference is that formal methods allow you to use the "for all" quantifier.

For example, you might write a unit test that says "foo('abc') returns a string with no trailing whitespace".

But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".

This is a trivial example, but you could imagine something more complicated, such as "for any program P, compile(P) has the same behavior as P".

Of course, you have to define what "has the same behavior as" means!

7 hours ago by brap

>Of course, you have to define what "has the same behavior as" means

And that's really my issue, for example when you define "has no trailing whitespace", you are basically writing a piece of the implementation. Cover all behaviors, and you have basically re-implemented the function, no?

In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?

6 hours ago by bluGill

> In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?

In some cases, however quite often, the spec is much simpler. For instance, it's easy to say that after running sort on some list, that the result is sorted. However, it is very hard to come up with the algorithm to do that from the specification. Sometimes that is even a point. Bubble sort, quick sort, tim sort, we can go on and on. There's a huge number of different sorts that computer science have discovered over the years. They all should have the same result and so you should be able to prove they do the same thing. However, in the real world there are often reasons you would prefer one to another despite all meeting the same spec.

3 hours ago by nemo1618

I think the key is that, while you may think you have a full formal spec of f(), you actually do not. You have a program written in some language, and that language has its own spec, and the language is compiled to asm which has its own spec, and the asm executes on an architecture that has its own spec, and so on.

So when you write a function like:

  func hypot(x, y):
    return sqrt(x*x + y*y)

You might think you have "fully specified" hypot, but this is far from true! You have said nothing about what registers will be used, for example. This is not a problem; quite the opposite. It's the whole point of using high-level languages: they let you focus on what you care about. A spec is just a program in a very-high-level language.

6 hours ago by chadgpt3

Non trailing whitespace means the string doesn't end with a space. But foo is a function that converts an AST to a string, that's totally different. Or it's a function that loops until \0 and changes all spaces to +

The spec should be a summary of what the impl is supposed to do. You'd want more than just doesn't end with whitespace of course.

6 hours ago by undefined

[deleted]

5 hours ago by parthdesai

> But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".

Isn't that essentially property testing?

4 hours ago by Agingcoder

Formal methods allow you to prove that it works for all inputs, and not just for the small subset that will be sampled by property testing

It’s a proof, not a successful experiment.

4 hours ago by Jtsummers

Property testing is stochastic, which may be fine, but only gives you a statistically (hopefully) high chance of discovering a problem. If you use something like SPARK/Ada, you can actually embed a proof in the code so that you actually know that the code is correct (for what you've proven). PBT scales better than embedding proofs, though, and is highly effective in practice along with fuzzing.

3 hours ago by akoboldfrying

Unless you literally try every possible combination of inputs (which is usually infeasible), property testing can't give you mathematical guarantees about correctness. You can think of it as a halfway house between classic testing and formal verification:

Classic testing: A human comes up with some concrete example inputs for which they know the "right answers" (corresponding outputs). They write code that runs the code under test, gets its actual outputs, and compares them to the desired outputs.

Property testing: A human comes up with a precise way of randomly generating concrete example (input, desired output) pairs. They write some code to describe how to generate the pairs, often using a declarative DSL that describes only constraints on the inputs and outputs, with the understanding that anything not expressly forbidden is permitted, like "The input can be any list of between 0 and 100 integers each between -500 and 500" and "Every integer in the input must appear the same number of times in the output". They then write some more code (often a single line) to ask the computer to use this "spec" to randomly generate, say, 1000 such pairs, or as many pairs as can be checked in 1s. The computer generates the pairs itself, runs the code under test on each input and and checks its output matches the desired output.

Formal verification: A human comes up with a spec that typically describes conditions that must hold for all (input, output) pairs. This may look very similar to, or even exactly like the DSL used for property testing, though in general there are other conditions that can be expressed that cannot be checked with property testing even in principle -- for example, checking that the program always eventually terminates. The main difference is that the code under test is never actually run; instead, the computer analyses the source code itself to attempt mathematically prove that the stated conditions hold. How to actually accomplish this is a field of active research, but one basic approach is called "symbolic execution". To greatly simplify, if we forget about loops and conditionals for a moment, the idea is that we can write down things we know must be true after each statement executes, based on the things we knew must be true before it executed. So for example if x is a variable initially containing any integer (and we ignore overflow) then after the line

    x = x * x

runs, we know that x >= 0. To handle conditionals like

    if x > 50:
      x = 42

    something_afterwards(x)

the prover "forks" into two cases: One in which we know for certain that x > 50, one in which we know for certain that x <= 50. At the end of the if statement it then has the task of recombining what is known about the two cases. In this example, the first case lets us conclude that x = 42 by the end, while the second case lets us conclude that x <= 50 by the end, so it could conclude that x <= 50 either way by the time execution reaches something_afterwards(x). Handling loops is trickier but generally involves looking for invariants.

10 hours ago by y1n0

I don’t write software, but in asic and fpga design, formal method specifications enable the use of things like SAT solvers to determine if the underlying test article meets the specification.

You specify properties of the design and the tooling tests the design in a variety of ways to see if it can violate those properties.

Let’s say you have a system that controls turn signals. You can specify properties that say things like lanes that cross each other can’t both have a green light at the same time or even within some period of time of each other.

The tooling can exhaustively check the design for this and surface code traces that violate it.

10 hours ago by jaggederest

It's more than just "write the same tests just in a different way", because each test is largely independent, or grows to unmanageably large, and you have to do the work of figuring out the completeness of your test suite by e.g. branch coverage or other relatively crude overlap methods.

Statically proven type systems let you do this in a way that each contributing piece is checked in advance against all the other pieces, guaranteeing not just "this test passes" but, in combination, "all these tests create a reasonable, coherent whole", and it disallows incoherent models from being implemented in the code. An example of this is like Rust's match, which requires complete coverage of all the possible branches, but writ large across an entire codebase.

You're right that it does nothing for you if you make a fundamental logic or theory error, it can't catch that kind of error, but you'd be surprised how much more robust it is than "merely" e.g. Haskell's type system + ad hoc functional testing + property testing - which I would consider a quite strong system overall, but formally proven e.g. cryptography is a much higher bar.

11 hours ago by awinter-py

See previous kleppmann post https://martin.kleppmann.com/2025/12/08/ai-formal-verificati..., and yes, obviously anything that you can put in the typesystem or the linter, you should weigh doing so.

Hopefully we get more ergonomic ways to do this? Like of the tools listed in the post, dafny + iris are the closest to being industrial I think. And amzn S3 has a history of TLA use in-house I think. But we probably haven't seen the typescript in this space yet, a zero cost abstraction that drops into existing tools, and people genuinely prefer it to the old way.

(And custom linters are also still pretty bad to write. Like golangci-lint is a painful codebase, haven't tried semgrep but the rules engine seemed intimidating. I've yet to use an AST API that I liked)

an hour ago by smasher164

I think it's a good time to learn Lean. It positions itself as a proof assistant that's also good at practical programming. I'm not sure how mature the ecosystem is for the latter.

10 hours ago by spenrose

As usual, this paean to deductive reasoning (“formal methods”) leaves out its fundamental limit: how closely do the postulates and definitions fit the domain they purport to map? (“In theory, there is no difference between theory and practice. In practice ...”) My guess is that Jane Street maintains large code bodies where the mapping is 1:1, because the purpose of the code is to implement a deterministic algorithm. Many other coders work in such areas. But millions of us don’t: most UIs, most exploratory work, etc.

There is a movement parallel to formal methods to define acceptance criteria at high resolution but not logico-mathematically, which at least grapples with the mapping problem but can’t resolve it where the map isn’t the territory, which is most places. Has Google’s results page, with its extremely evolved internal optimization frameworks really hit an optimum? Could that prototype you whipped up to capture a hazy idea have better illustrated it? These questions are best answered by looking outside the system to what the system serves.

9 hours ago by benlivengood

Formal methods are precisely for the domains where the semantics are well-defined. Logical circuits (a lot of CPU components get formal verification), kernels, protocols, parsers, compilers, cryptography, security frameworks, concurrency primitives, etc. all benefit a lot from verification.

4 hours ago by sn9

Most UIs in practice boil down to state machines which are extremely amenable to formal verification.

Hillel Wayne's writing is a good starting place to learn more: https://www.hillelwayne.com/formally-specifying-uis/

9 hours ago by petra

Google's results page isn't well defined. But probably 90%+ of the code below it is well defined.

And in some cases, where the result isnt well defined, it can be learned, so it's not about sitting and thinking what would make sense.

6 hours ago by addaon

I've been playing with related ideas recently, and can talk about them at length... but one thing that's surprised me is just how effective frontier models (ChatGPT-5.5 in particular) are at completing certain manual proofs in the Roqc (né Coq) proof assistant. The proofs aren't always pretty, but ChatGPT can often prove something in minutes and 10 - 100 iterations that would take me, a human who has limited but non-zero proof assistant experience but significant domain experience in the lemmas being proven, much much longer.

The reason I think this is relevant and interesting in the context of the (short) article is because it challenges basic assumptions about how certain formal method tooling works. Frama-C, Ada/SPARK, etc are focused on generating proof obligations that can be automatically discharged by CVC5, Z3, etc; with a relatively painful fallback of manually proving the obligation in Rocq. The most common workflow seems to be to discover that an obligation is "hard" (not automatically discharged), and to restructure the set of visible lemmas and assertions at the obligation generation point in the code to try to make it "easy"; or even restructuring the code to try to make it easy. Which, in a world where Roqc proofs are doubly expensive (first because they're a challenge for a human to write; second because they tend to require quite a bit of maintenance churn as the code and proof around them evolves), makes sense. But if Roqc proofs are "automatically discharged with AI in the loop", this cost delta goes away -- and it becomes possible to think about writing proofs the same we (usually) write code, with human readability and clarity as the first goal, and [compiler|prover] optimization a distant second.

Which I find quite exciting, although I haven't fully digested the implications yet.

6 hours ago by bluGill

How often will AI look at something that can't be proven because you change requirements and decide to change the code and your requirements to make the proof easier though? I haven't played with proofs at all, but I do know that occasionally when I say, this test failed after making a change, AI will just change the test instead of making the code pass both the old test and the new test which is most often my intent.

5 hours ago by addaon

That's up to the harness. Right now, my harness doesn't have any tools that would allow the agent to change contracts. Also, agents working on a Rocq proof don't have write access to the C+ACSL code; they can review it, and propose structural changes to the coordinator, but all they can do is iterate on the Roqc proof or give up. The weakest part of the harness today is that the agents that do C+ACSL work can modify both C and ACSL, even though they're usually run with the explicit intent to not change the semantics of the C code (although it's totally within bounds, and often required, for them to change e.g. `foo & 3` to `foo % 4`) -- so there's weakness here, but it's bounded, and since the contracts live in header files that they can never write, it's worked so far. It's a starting point, at least; I certainly wouldn't say this is a mature, or even good, tool chain, but still learning.

2 hours ago by zahlman

> how effective frontier models (ChatGPT-5.5 in particular) are at completing certain manual proofs in the Roqc (né Coq) proof assistant. The proofs aren't always pretty, but ChatGPT can often prove something in minutes and 10 - 100 iterations that would take me, a human who has limited but non-zero proof assistant experience but significant domain experience in the lemmas being proven, much much longer.

... How do you know that the proofs are themselves correct?

2 hours ago by addaon

With the proof checker.

Formal methods and the future of programming

Daily Digest