Unfortunately, this presents only rather old, naive notes on decompilers, with little discussion on even the advances made by Mike Van Emmerik's decompiler work, let alone newer developments like Phoenix, DIVINE, or TIE. There's actually been a lot of work on things like variable, type, and structure recovery in the past decade, but it's usually not presented as decompilers but rather static binary analysis.
Probably the easiest way to find relevant papers is to do reverse-citation on Cifuente's work ("Reverse Compilation Techniques" and "Structuring Decompiled Graphs").
All of the papers I've mentioned can be found in citations from those two sources.
As for the state-of-the-art, the basic disassembly algorithm has been well-known for a while: recursive disassembly, which is the sort of obvious algorithm you'd think of yourself if left to your own devices. The practical answer is generally "use IDA," because indirect control flow boils down to "know every single pattern that compilers generate," and IDA already knows every single pattern. Structuring control flow graphs is more or less complete in the same sense as parsing is, which is to say that we know a few different ways to do it which all suck in their all special way, but you're not going to find fertile ground to do new things, as the existing solutions are generally good enough.
The real hard part of decompilation remaining, then, is variable and type recovery. We sort of know how to do this if you assume structs, unions, and arrays don't exist (which is to say, it's more or less solved if you were to try this on CIL or JVM bytecode), but those are a massive stumbling block in native executable decompilation. I rather suspect this is going to remain unsolved until we get a good pointer and dataflow analysis on binaries that can handle integer/pointer duality and type unsafety, which puts it at least a decade away in my estimation.
Pattern-independent structuring (from the above paper) is implemented by fcd, an LLVM-based native program optimizing decompiler: https://github.com/zneak/fcd/
Related blog posts have some details and examples:
There is some very shady SEO going on here. The French translation link goes to a Wordpress blog with a very bad translation of the introduction. The root of the translation website is a completely unrelated Dutch used car website.
Other pages on the translation blog follow the same pattern: an old webpage (this HN entry is from 2009, but other pages go back to 2000) is edited years later to link back to a bad translation in some random language.
This is very far from my area of expertise, so I'm curious to know if this is well-known, and if those pages were hacked or if this was inserted by the owner himself.
EDIT: This is fascinating. I'm finding dozens of similar blogs, with links spanning at least a thousand hosts, which I'm assuming were hacked. This is a big network, with sophisticated wording designed to look as innocent as possible.
The content seems to be written by a hobbyist decompiler developer. Unfortunately, the state of the art that it presents seems to rely heavily on the Program Transformation Wiki pages, which appear to have become rather defunct (its pages, on further research, appears to have been primarily maintained by Mike Van Emmerik, and both Van Emmerik and Cristina Cifuentes appears to have moved on from binary decompilation a while ago).
As a result, the academic state of the art that it's relying on is a decade out of date. While decompilation is in nowhere near a solved problem, there have been advances in variable and type recovery in academic literature that the author of these pages doesn't seem to be aware exists.
The content of the pages mostly seems to be the sort of basic material that I wouldn't even bother citing in a research paper. I'm not sure there's any information in there that you wouldn't find in, say, the dragon book.
I'll bite: is there anything novel presented over these book-like pages, or it's just an introductory text?
The lack of navigation syncing up with the page I'm on makes it hard to tell if I'm 10% or 90% through the content. If there was a printer-friendly view, it would be much easier to evaluate whether this content is relevant to me or not.
My recollection of IDA is that it completely keels over if you give it structures or arrays (unless you take the time to manually teach it about every little thing), and a colleague of ours has confirmed that Hex-Rays does a horrible job here as well in a similar situation. The exact example was:
typedef struct { int x, y; } Point;
int main() {
int i;
Point p[5];
for (i = 0; i < 5; ++i) {
p[i].x = 1;
p[i].y = 2;
}
return p[0].y;
}
which Hex-Rays turned into:
int __cdecl main(int argc, const char argv, const char envp) {
int v3; // ebp@0
signed int v4; // ecx@1
int v5; // eax@1
v4 = 0;
v5 = v3 - 40;
do {
(_DWORD )(unsigned int)v5 = 1;
(_DWORD )(unsigned int)(v5 + 4) = 2;
v5 += 8;
++v4;
} while (v4 < 5);
return (_DWORD )(unsigned int)(v3 - 36);
}
(Apologies for any typos, I'm transcribing this from a second computer).
To be fair, most decompilers or binary analysis tools tend to have massive problems in this regard. It's only really in the past decade or so that the research community has started to shift from heuristic-heavy, pattern-matching based approaches to more principled, general-purpose algorithms, and IDA Pro is best characterized as "every heuristic ever developed."
Well yes and no. You normally need to teach IDA once, and sometimes it needs a reminder. This means setting a proper prototype on the function before you attempt to decompile it.
> Binary Format Support: ELF, PE-COFF exe, DOS exe, a.out, MachO, raw binary, etc.
Sadly as far as I aware hex-rays is almost useless for DOS exe since it's not support 16-bit code and many applications are 16-bit. Is something changed in past few years?
I would be really happy if software I wanted to study had 32-bit version, but it doesn't. Though I may overestimate how many 16-bit only applications are there.