Whenever proof tools are brought up, I tend to wonder how one goes from the proofs as written to actual code that can run in production in a way that guarantees the proof holds - how does that happen?
The linked page even says "It’s the software equivalent of a blueprint" - how does one go from that blueprint to something that functions?
TLA+ has no direct path to code. But it is a great way to test-drive your _idea_, to force you to dwell on your problem before beginning to code, to look for states that you never thought about even in a simplified setting.
Having gone through the exercise, I find I have greater clarity while coding, and have a better idea of where to place assertions and where to look for invariants. Often I iterate between prototyping with code, then playing with TLA+, then back to prototyping before throwing it all out and starting from scratch.
----
Then there are tools like Coq that convert a proof to running code automatically. For example, see the Verdi framework for doing Raft consensus: https://github.com/uwplse/verdi-raft.
This is clearly not for the faint of heart. The wonderful thing about this exercise is that it forces you to be precise about your specification, and holds your feet to the fire until your proof meets the spec. At the end of the day however, you have no guarantee that the specification (and the simplifications made to get coq going) are enough to capture enough interesting aspects of the real world. Still, for safety-critical and security-critical systems, it is always comforting that someone's gone through the pain of thinking it through to an extra level.
TLA+ could, in principle, be used to directly verify code (e.g. [1]), and has been used like that "in the lab", but then it suffers from the same issues that all code-level, deep specification tools, like Coq, suffer from -- barely being able to get within three orders of magnitude from "ordinary" software.
High-cost end-to-end verification of very small, very specialized programs by experts is a different practice altogether from low-cost, high-level design verification. TLA+ could be used for either, but the biggest bang-for-the-buck, especially in non-super-specialized cases is obviously the latter.
I think that converting C to TLA+ is too low level; TLA's state space would explode before anything of consequence can be explored. There have been similar academic efforts to convert from SPIN's Promela to C, but nothing interesting has been shown.
So, we have TLA+ on the one hand, which is too abstract, and C on the other, which is useless when translated to TLA. I think that a better solution is to have a DSL tailored to the domain, which can generate C for performance as well as a TLA+ model. Anil Madhavapeddy's PhD dissertation is a good example:
https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-775.html
> TLA's state space would explode before anything of consequence can be explored
You're confusing TLC with TLA+. TLC will work for small programs, but you can still use the TLA+ proof assistant (TLAPS) just as you can use Coq (and for the same definition of "can").
> So, we have TLA+ on the one hand, which is too abstract, and C on the other, which is useless when translated to TLA.
TLA+ is not "too abstract". It is mathematics. It can describe a system at the level of general algorithms, or at the level of logic gates. In fact, it particularly excels at expressing abstraction/refinement relationships between descriptions of the same system at various levels. For example, you can establish a refinement from the C code to a higher-level spec with the proof assistant, and then use TLC to model-check the higher-level spec. But still, TLA+ is not likely to be any better than Coq, or any other deep specification and verification tool. Unfortunately, none of them can feasibly verify software of common size.
True that. I have not really been able to get anything beyond the most trivial to work with TLC, so I just stick to TLA+.
> TLA+ is not "too abstract". It is mathematics.
I understand. I'd wager that is the chief reason why most programmers don't even begin to investigate it. Spin feels easier because it is operational and because it can incorporate C.
> I have not really been able to get anything beyond the most trivial to work with TLC, so I just stick to TLA+.
Really? I personally have checked quite elaborate TLA+ specifications with TLC, and I know others have, too. Sure, like all verification tools it's got its limits, but even for a complex distributed system spec, TLC could reasonably check it with, say, four nodes.
> I'd wager that is the chief reason why most programmers don't even begin to investigate it.
I'm not sure Spin is used more than TLA+ these days, especially when it comes to "mainstream" software.
Thank you for this explanation, without which I would have continued searching for a mirage. Everything makes a lot more sense now. It would be great if TLA+ made this clear (or at least clearer) right from the start.
I've actually built something like this for TLA+ and C#, which I call Kayfabe. I'll hopefully present it this October at the next TLA+ conference.
High-level, there are two modes:
1. Checking execution traces from running code against the model (see notpron's work for this idea)
2. Outputting every state visited by the model checker in a finite instantiation, then computing paths to step the real C# through so it'll visit every state or transition at least once. This is very, extremely computationally expensive, but I'm running real code in prod that I've tested this way.
Hopefully I can get my algorithms to be efficient and if so I'll try to open-source my efforts.
You are asking the wrong question. While there are various ways to check conformance to the spec, none of them guarantees it. In fact, no one knows how to do that using any tool whatsoever. People sometimes refer to end-to-end verification successes (verifying all the way from the spec to the code or even machine code) like seL4, but forget how tiny that codebase is, how long it took specialized experts to write and verify it, and what concessions had to be made to make that possible.
So let me ask you this: how do you guarantee that your tests cover all possible case? You can't (and they probably don't). So why do you use tests at all? Because the question isn't how to guarantee the software is perfect -- something we can't do, at least not currently -- but how to make it better. And just like tests can make your software better even if it's not perfect, a TLA+ specification can, too.
How do you go from a blueprint to the program? The same way you go from an algorithm in a book to a program. You can make the specification as detailed or as high-level as you like, and focus on whatever you think is the most important parts. Going from spec to code is the easy part.
TLA+ was created so you can quickly model / specify a design, check it works under your assumed invariants and maybe explore alternatives up front, giving you a plan for how your system should work in theory.
Like you mentioned it's the software blueprint, and you go from a blueprint to a building by actually building the thing - TLA+ is not made to codegen your product.
Would you say that it's closer to "pseudocode you can use to test process assumptions" than a description of the final structure that a program would have?
Yes apart from it's not pseudocode but mathematical formulas.
Your building blueprints aren't made from bricks and mortar because it would just take way longer and be expensive to experiment with, however with software when designing we often use the same tools as building without thinking of the cost like that.
TLA+ and other lightweight modelling languages allow you to experiment with your design and verify that it works without all the tedium of programming languages and their environments / libraries / compilers etc.
huh, I guess I’m going to have to try it because I just realized that I had a hidden assumption that code was definitionally the lightest way to describe a problem
A lot of people use PlusCal, which is basically pseudocode that compiles to TLA+. Depending on what you're modeling, PlusCal or directly writing TLA+ might be a better fit.
Look at idris - proofs are expressed using dependent types which allow full propositional logic and implementations exploit these proofs using a haskell like language that allows imperative + purely functional code: https://www.idris-lang.org/
Different proof tools inhabit different areas on the "specification to implementation" spectrum (problem/solution). Since it's possibly to perfectly implement a bad specification, proving a specification valid has value, even if it isn't formally proving your implementation of that spec.
Typically, before implementing a feature it'd be nice to have a design for it. Nothing that takes 6 months, though. TLA+ would be really helpful there, especially if you become familiar with it, so that you can answer the "what if" questions.
I think I will give it a try. It might definitely help building better distributed systems.
Its always a joy to watch Leslie Lamport. He's one of the few people in computer science that are really worth looking up to. Everything he says and does is incredibly eloquent and purposeful.
Every time this frontpages I feel super shitty about not getting back to it. I've learned a _lot_ about how to teach TLA+ in the past three years, and there's a lot I know I can improve about this guide. One of these days, I promise!
Does anyone have personal/anecdotal experience with TLA+ being used in a professional setting? I'm interested in hearing stories/use-cases of practical application.
I've always been intrigued by formal methods but when I try to think of an application of TLA+ my problems are either too trivial to need formal analysis or too complicated for formal analysis to appear feasible.
Yes indeed. Seen it in use in AWS across a number of critical components.
The team I was on in AWS used it for the implementation of a particular component that needed to operate at large scale and had absolutely zero tolerance of failure. Two engineers learned TLA+, started writing up the logic and discovered various ways in which the logic would work on small scale but wouldn't scale up to parallel operation across multiple servers. It's fair to say TLA+ saved our bacon there before we'd written a single line of service code.
In the end we actually got the component out to production faster than predicted. The TLA+ code gave solid shape for the actual production code, and we were quickly able to prove that the code was valid and that things were working correctly.
In my current role in Oracle Cloud Infrastructure, there is actually a small dedicated team of engineers that help services use TLA+ for components, headed by Chris Newcombe, one of the authors of the TLA+ whitepaper that came out of Amazon (https://lamport.azurewebsites.net/tla/formal-methods-amazon....)
My current team has just started exploring it for use for a component that similarly has little tolerance for error. It's not in the same territory of "if it fails it's catastrophic", thankfully, but if it fails it's a major inconvenience for customers.
I used it for the first time in firmware design at MSFT a few years ago, and I've used it since in the areas of protocol design. I don't use it frequently but it's a tool I'm glad I took the time to learn, and have to hand when I need it.
My experience is that, using TLA+, or other tools like it, encourages you to think precisely about the problem and at the same time exclude unnecessary details. That alone I find useful in terms of the insights it gives e.g. surfacing interesting invariants, or alternatively, invalidating them. In any case, you use the model-checker (TLC) to test these assumptions quite easily, and then bring these invariants into your code, either in the form of assertions, contracts or perhaps in property-based testing. In any case, I would say the process of writing such a specification yields benefits. Perhaps this can best be summed up by something I think Leslie Lamport once said: "If you're not writing when you're thinking, you only think you're thinking."
I should add that despite claims about TLA+ being particularly well-suited to concurrency problems, my implementation targets have never been 'concurrent' in the 'multi-threaded' sense. Rather, they have been single-threaded event-driven state machines, and I personally find that it excels in this space. However, as you will find, in TLA+, concurrency is just a matter of abstraction...
I have found that, because TLA+ doesn't 'generate code', many colleagues struggle to see the tangible benefit. I personally think that there is a gap between the whiteboard and the editor which tools like TLA+ fill very nicely. It's not a panacea or a magic bullet, and like any tool you must still exercise judgement about when to use it, and how to use it effectively, which includes understanding the limitations of the tooling (TLC cannot check arbitrary specifications).
But if I find myself faced with a potentially tricky algorithm, or indeed want to understand an existing algorithm, I reach for TLA+.
Not really a professional setting, but I teach TLA+ to the above average kids in my class (I teach 2 hours a week) in high school (15 / 16 yo). I started this a few weeks ago as an experiment, but the results are great. The students are enthousiastic, and while they need a bit of help, it's not too hard to get into for simpeler projects. So far a student has used it to model his rubiks cube solving bot can do every move a human can. Furthermore, we are currently using it for proving their anti collision system for multiple robots navigating a grid is sound.
One of my coworkers used TLA+ to prove a change to a service my teams runs was correct.
The service was essentially a key/value store. If one region didn't know about a key, it would recursively ask the other regions in parallel for the data.
Eventually we needed to add support for deleting data. We deleted data that hadn't been read in a few months.
We used TLA+ to prove our region-sync code and deletion code interacted properly along with regular get/put operations.
There's a paper describing the use for railway applications [0].
A few years back I had an interesting discussion with one of its authors and was shown the real model and code. Interestingly, they generated C from the TLA+ spec and this code generator was "unproven". However, the software itself had high coverage requirements and needed to be tested extensively (100% code/branch/MCD coverage, I don't quite remember).
One of the points he made was that the upfront cost pays for itself by avoiding bugs that are hard to debug.
Started using it at work, mostly to verify lock-free code. After verifying a new code for multi producer multi consumer (MPMC) lock free queue, tests were still not passing on iOS in debug... So trust in the model allowed us to start asking if compiler was broken, and ... it was, apparently in Apple's fork of llvm, as they don't really use upstream llvm as-is, 128-bit atomic compare exchange (DCAS) is broken in debug in Xcode 8.
In general it is quite useful exercise to even just model existing code line-by-line and see what possible states it can have. I'm still learning though, next step would be to try TLAPS.
We've used it at various times to verify the design of some database kernel internals e.g. I/O scheduling. This is partly to give you confidence that the design is not broken, but on rare occasion it also surfaces design defects, like non-obvious race conditions.
Defects surfaced in the model may not imply defects in the software e.g. obscure race conditions in the model that cannot actually occur in implementation due to details not captured in the model. But it forces you to explicitly acknowledge and investigate that.
I highly recommend learning TLA+ before writing any papers with math involving set logic (foreach, exists, etc), if only to teach you standard mathematical syntax for those things (like “{ s \in S : \A t \in s : func(t) }“, which is valid TLA, and will also emit a human readable formula if pasted into Latex).
This would have saved me tons of work in hindsight.
Hillel Wayne (who wrote this guide) also has a great blog with many posts related to this topic: https://hillelwayne.com/
Leslie Lamport (who created this language, has written many seminal papers on distributed computing and is a Turing award winner) has also published an excellent book and video series introducing the design and practical usage of TLA+, you can find the video series at: http://lamport.azurewebsites.net/video/videos.html
I wonder what's preventing formal methods from becoming a more standardized practice when designing complex interconnected systems or microservices.
I mean, I know they are not as "friendly" as writing tests using mocha or pytest, but why can't someone build an npm version of it and use it in CI? Is there a technical challenge to doing that? Perhaps I'm missing something.
They're pretty hard to learn. Formal methods are relatively abstract and the feedback loop is a lot less obvious--in TLA+/PlusCal, you aren't writing instructions, you're defining state spaces and relationships between/across them. Since it's not something you generally use often, it also gets rusty quickly. In a lot of dev jobs, it's just not worth the cost.
I know AWS uses it heavily in distributed algorithm/system domain. I’m wondering if this is also useful for verification in other domains like data modeling and machine learning. So far I haven’t seen too much TLA used outside the core system/infra engineering/database groups
Preface, I don’t know TLA+ or Coq, but I get the impression from examples, that it is strongly reminiscent of test-driven-development.
I can see the advantage, but I wonder if there’s potential for writing the proofs in the target language, verifying the logic, then building the code, and verifying the code meets the proofs.
Is this something anyone is looking into? I could imagine a macro in Rust for example that is TLA+, that takes standard test input, verifies the logic, then you feed in the real implementations.
Potentially. I’ve dabbled in TLA+ a few times and have had mixed results; one project turned out spectacularly well, one ran into a problem where I really couldn’t figure out how to model the problem, and one was a failure but TLA+ proved that it was impossible to build a reliable system based on the hard requirements (interfacing with hardware from a 3rd party vendor).
I agree with you in the sense that, after the spectacularly good project, it would have been great to run code gen and get a skeleton to fill in.
I disagree with you in the sense that TLA+ as a language forces you to think about the problem you’re modelling from a very different perspective than how you’d actually write it in an OO/FP/whatever language, and in my experience so far, that has been a very powerful thinking tool. TLA+ doesn’t constrain you to thinking about “how would I write this in Java”, but more in a “how should the communication patterns in this system work” way.
One of my favorite war stories was losing a client after we spent two days trying to model their problem and discovered they didn't actually know what they wanted and they decided to just scrap the project and do something else
> more in a “how should the communication patterns in this system work” way.
I guess that’s the point where I see so much overlap with tests. It seems like we’re talking about test inputs and test outputs.
Even with IPC or Distributed systems I often break problems down in a way that makes them testable in a local system with repeatable results. It’s not 100% perfect b/c you can’t account for all failure scenarios.
I do see your point though about constraints of the target language.
Testing one execution is a start but isn't enough in a nondeterministic distributed system. If you need to check that an invariant is held across all possible executions (e.g. node 3 is down and node 4 is suddenly faster than node 5), you could brute force that yourself but TLA+ has it built in.
That's exactly it. A test verifies one potential execution path. A TLA+ model verifies every execution path, in every ordering that those events could happen.
One of the tricky parts I found was figuring out what granularity to use when writing specifications.
The claim I read is that formal methods aren't tractable past X LoC (IIRC it was in the thousands). So you can directly verify a small application/module, or you can verify the high-level design of a larger thing, then write code to that spec. Can't find the source anymore though.
> we use = when declaring variables, while in the algorithm itself we use :=. Outside of variable definition = is the comparison operator. Not equals is written as /=
I feel like some syntax that works some way only in some section isn’t ideal. Not a knock, it’s just apparently a problem only <your favorite language> gets right.
The linked page even says "It’s the software equivalent of a blueprint" - how does one go from that blueprint to something that functions?