A hands-on introduction to static code analysis

UncleMeat · on May 4, 2020

It's good to see discussions of static analysis, but I often feel that these blog posts do a disservice to the techniques. The post leads by mentioning applications like bugfinding and security vuln detection but the examples here are barely above local syntactic checks. This is the common scenario in the majority of blog posts I see about static analysis, probably because it is just much easier to put together a quick write up on AST-linting. Heck, this article has a diagram that directly states that an AST is the input to a static analysis module, but that is true only for some kinds of things!

AST level analysis is certainly useful. Everybody should be using some sort of style checker. But AST pattern matching is a completely different technique from the stuff used to do bugfinding that I worry that these blog posts will give the wrong impression about what static analysis can do and what it can't do.

I'd love to see blog posts about interprocedural pointer analysis, for example.

rj722 · on May 4, 2020

Article author here. Agree that the post merely touches the surface for static analysis -- because it was aimed towards an audience looking for an introduction to static analysis. The scope for the examples in this post had to be limited for this reason.

Inter-procedural pointer analysis -- Yes, a lot more trickier than these, but definitely more juicier. Will try to write a post on it in the coming weeks.

UncleMeat · on May 4, 2020

I think limiting the scope is fine in general. But one small suggestion would be to make it more clear that this is just one very simple technique. This does not come across at all in the blog post. The diagram you show, for example, seems to state that this is just how static analyses work - they are given ASTs to work with. Or at the very least include some examples of semantic properties. It seems incongruent when you describe static analysis as understanding the behavior of the program without running it and then use examples that are about syntactic style violations.

onemoresoop · on May 4, 2020

The article is great and it is clear it is intended for beginners. Everything is explained as for beginners which is good. A second part is very welcome.

kungato · on May 5, 2020

Hey I'll be doing a college project on static analysis and while I'm familiar with semantic analysis wrt compiling I was wondering if you might drop a few more of these terms like interprocedural pointer analysis so that I have more techniques to research

itsspring · on May 4, 2020

I want to read more on this topic. Have you written about this anywhere, or do you have a pointer/suggestion?

chas · on May 4, 2020

This article gets more into actual analysis of program state and execution: http://matt.might.net/articles/intro-static-analysis/

If you want to go deeper, Principles of Program Analysis is a popular reference: Principles of Program Analysis https://www.amazon.com/dp/3540654100/

dtornabene · on May 5, 2020

I would not recommend POPA to people wanting to go down this road, its an extremely difficult text. Personally, just my 2 cents here, a far more useful text would be Practical Binary Analysis, https://practicalbinaryanalysis.com/ The Cousant's text is fascinating but requires a level of mathematical maturity at virtually post-doc researcher levels

chas · on May 5, 2020

Principles of Program Analysis isn't the Cousot's text, but it does make significant use of abstract math. In particular, it uses tools from order theory[0] to describe many program analysis algorithms as finding fixpoints of functions between lattices[1].

This is useful because it reduces many program analysis design questions to questions of which lattice to use. It also allows you to compare algorithms by comparing their lattices, which makes it easier to see how algorithms are related.

The cost is that this approach will be pretty alien if you don't have experience with abstract algebra or related fields. If you do have that experience, I don't think it requires mathematical maturity beyond an undergraduate level.

[0] http://matt.might.net/articles/partial-orders/

[1] https://en.wikipedia.org/wiki/Lattice_(order)

dtornabene · on May 5, 2020

You're correct on the Cousot text, thank you. I stand by the assertion that if people want to go beyond the simple PA described in the article a far better and more approachable text is the binary analysis one I listed. Practical hands on experience that doesn't require a math major in uni is a good thing!

saagarjha · on May 4, 2020

The kinds of analyses mentioned here are typically grouped under "linting"–more advanced static analysis tools will typically do things like dataflow analysis.

dmos62 · on May 4, 2020

I too would be interested in interesting static code analyses (that are beyond linting).

saagarjha · on May 4, 2020

Terms which you might find useful to search for are "dataflow analysis", "abstract interpretation", and "taint checking". A basic background in compiler optimization would generally be helpful.

g_delgado14 · on May 4, 2020

Any beginner friendly articles on more advanced analysis that you'd recommend?

saagarjha · on May 4, 2020

Don't have any materials to point to, sadly. Most of the knowledge in this field is locked up in papers and tools; I was lucky to learn most of what I know from a graduate class taught by a professor working on static analysis in V8 and working with/on software security tooling. To begin with, I'd suggest first brushing up on compiler optimizations (which is largely separate from parsing) and that should lead you to dataflow analysis techniques.

kaidon · on May 4, 2020

Maybe a bit tangential, but still interestin:

https://cacm.acm.org/magazines/2010/2/69354-a-few-billion-li...

chas · on May 5, 2020

I think Matt Might's intro is relatively beginner-friendly depending on your familiarity with Scheme: http://matt.might.net/articles/intro-static-analysis/

jjtheblunt · on May 4, 2020

https://en.wikipedia.org/wiki/Static_single_assignment_form

UncleMeat · on May 4, 2020

While computing phis for SSA does require dataflow analysis, SSA itself is not tremendously useful. The natural follow up to this would be "so what?" Something like live variable analysis is probably a much better first introduction to dataflow analysis since its application is much more obvious.

SSA is also not even universal among IRs for static analysis at this point. Heap-SSA is growing in popularity for complex dataflow problems involving fields.

dtornabene · on May 5, 2020

Going to drop a toplevel comment and say while this is interesting (sincerely!) if people are interested in deeper tools/techniques the book Practical Binary Analysis is excellent, it ends in taint checking, symbolic excution techniques and uses Pin. https://practicalbinaryanalysis.com/

Also worth checking out is BAP, the Binary Analysis Platform, which is the successor project to Bit Blaze, and is one of the most fascinating binary analysis frameworks out there for my money. It was the only one of the darpa CGC entries that ran on real binaries, not the much less complicated ones developed specifically for the challenge.

https://github.com/BinaryAnalysisPlatform/bap

saagarjha · on May 5, 2020

I’m unsure of what you mean: while I did not participate in CGC personally IIRC they used a custom platform that required teams to retool for. How would an entry that runs “on real binaries” be useful for this situation?

dtornabene · on May 5, 2020

because the test binaries they used were not really close enough to reality to test finding real vulnerabilities. and BAP can, which, if you want to learn static binary analysis, seems useful

flohofwoe · on May 4, 2020

Slightly tangential to what the article is about, but at least in the C/C++ world, the most important change to make static analysis popular for "the rest of us" was probably Xcode's decision to integrate clang analyzer right into the Xcode UI under a menu item (Xcode doesn't do many things right, but this is definitely one of the very good features).

This way, analyzing the code is a simple "button press" and works out of the box on every Xcode project.

Soon after, Microsoft followed suit in Visual Studio (even though in my experience, the MS analyzer doesn't catch quite as many things as the clang analyzer).

Before that, static analyzers were those no doubt useful but obscure "magic tools" which were very hard to integrate into an existing build process.

Even the most useful tool will be ignored when it is hard to use.

saagarjha · on May 4, 2020

Somewhat annoyingly, the static analyzer that ships with Xcode doesn't seem to be packaged separately as in the command line tools…

flohofwoe · on May 4, 2020

Hmm, command-line clang accepts a --analyze option here ("Apple clang version 11.0.0"), and this seems to give additional output over the regular warnings. I'm not sure if that's the same thing as the analyzer integrated into Xcode, but some sort of static analyzer seems to be there.

saagarjha · on May 4, 2020

Oh, I will have to try that. Thanks for sharing!

tasty_freeze · on May 4, 2020

Same with the profiling tools.

pwaivers · on May 4, 2020

Thanks for this article, dolftax! I followed all the examples on my machine with no problem, and I learned some new stuff.

I have a question: how difficult is it to implement the ast? It seems like that the bulk of the work for this static code analysis.

rj722 · on May 5, 2020

"Crafting Interpreters" by by Bob Nystrom (https://craftinginterpreters.com). Although the book falls short in covering static analysis (obviously), implementation of ast is covered in detail.

saagarjha · on May 4, 2020

For this kind of (read: simple) static analysis, yes, the design and generation of the AST is the dominant factor. For more advanced techniques usually you'd start with an AST directly to not have to deal with parsing and then work on that.

ecuaflo · on May 4, 2020

For "Detecting unused imports", why not record the line numbers on the first pass as well? Then we don't need to traverse the tree again