Famous Last Words: You can't decompile a C++ program
The Great Jack Schitt writes "I've always heard that you couldn't decompile a program written with C++. This article describes how to do it. It's a bit lengthy and it doesn't seem like the author usually writes in English, but it might just work (haven't tried it, but will when I have time)."
A c/c++ decompiler that totally worked would be the Holy Grail of crackers. Unfortunately it is actually impossible to get everything back because lots of info is lost on compilation.
Nevertheless there are tools out there that attempt to decompile programs; I think of them more as ways of making assembly more readable.
Note, a lot of them wouldn't work on hand-written assembly, because they rely on knowledge of how certain compilers compile various things- e.g. there was a Delphi decompile available.
graspee
I can't count the number of times I've been frustrated with the performance or process of an application that I had to interface with, and just wondered: *why* in god's name, or *what* in god's name are they doing in there.
but it'll look like this
class a
{
public:
void b(int c);
void d(int e);
private:
int g;
int h;
};
int main()
{
a f;
f.b(23);
int x; x=0; x++;
if(x > 3) goto j;
f.d(x); x++
if(x > 3) goto j;
f.d(x); x++;
if(x > 3) goto j;
f.d(x);
j: f.b(42);
return 0;
}
He's quite right.
... now there are infinite possible combinations of what a and b can be ... but without the correct variable names, or the commenting that went along with the code (assuming there was some) ... the decompiled output is going to be pretty much useless / extremely difficult to understand
Take a sum within a program, for example (a+b)=1000
Here is some code that supposedly decomplies... not that I've tried it.
Quote from the FAQ:
I would have posted AC but that have me blocked out for some reason...
Davak
What's to say you need something as readable as the original? I worked at InterAct Accessories/GameShark for a few years before they went under as essentially a 'reverse engineer'. Without getting yet another CND from them in the mail due to a post on Slashdot (I don't even think they could send one now they're out of business?), all I can say is sometimes when hacking a game it benefits an engineer to decompile the application and be able to set breakpoints and watch execution flow while the game is running on for example a PlayStation 2. Sure it's going to be a lot of nearly unreadable C++ mixed with Assembly, but if you can watch the execution flow as you do something, it can be useful.
Of course a lot of naive people think decompiling would allow you to take an application and start writing patches for it, in that case you are right, it's going to be pretty useless. However it's not entirely useless for all situations. I'm sure the WINE guys might get some use out of it.
..There's a-dooin's a-transpirin'
[insert joke about it being hideously ugly with templates here.]
{I did not read the article itself because it is, of course, slashdotted)
The cake is a pie
A library we were basing a major portion of our code on had a bug in it (a Listener class failed to implement EventListener if I remember correctly) which kept our code from working. Removed offending classes from archive, decompiled, fixed, and recompiled.
It's educational...the ol' "how'd they do that?". I've never taken code and used it but I found it instructional to look at how someone made a Swing text area from scratch, e.g.
The challenge...one program I installed had a "enter registration key" and I was curious how that was handled (turned out to be a static string). Then there was this applet that was the the core of a company's business. Free, or pay and get more features. As it turns out the control of the features all resided in the applet, so change a couple of switch and if/then statements and voila, administrative privleges. Didn't use it for evil, much... :) They've since come out with a new version and I've been too busy using my mad java skillz on contract work to take a look at their code.
Looking at security was instructional too, though, for when I was project lead on a commercial Java app I knew what worked and what didn't (we ended up using the Wibu key).
Even with complete original source code, understanding a non-trivial C++ application is very difficult. Source derived from an optimized executable is going to be a LOT rougher. No real function names, module names, variable names, or comments. Use of standard libraries (STL, MFC, Boost) is likely highly obscured as well. A tool like this would probably produce source that looks more like a C/machine language hybrid rather than normal C++. The primary use of something like this is if you are looking for a very specific piece of logic such as a password check or an encryption operation or protocol details. When were these famous last words anyway?
Uh, no. Compilation produces assembly, and then the (sometime integrated) assembler assembles it into machine language (not binary). Forget what switch it is, but gcc even let's you see what asm code it is generating.
You can decompile any program. A compiled program is just your high-level program translated into machine language. There is no sort of magical encryption or similar transformation that it undergoes once you compile it.
All you need to do is read in the bytes of any binary program, interpret the bytes as their machine language equivalents for whatever platform you are using, and then convert your MOV statements to assignment operators, JMP statemets to higher level loop structures, etc..
Of course, you won't retain the names of identifiers, which are referred to only by memory locations in a compiled program; and some control structures might be rearranged due to compiler optimization and the lack of machine language equivalents, but the meat and potatoes of it is all right there.
It's by no means easy to accomplish, especially with higher and higher level programming languages, but impossible? humbug! =)
excuse me..!!
just leave Delphi out of it, Delphi is a true OOP language you should do some research before coming up with a gross generalization like that.
The thing is, the point of a decompiler is to make the code readable. If you don't particularly care how readable the code is, then your standard disassembler is usually good enough.
Incidentally, you can't even theoretically create a perfect disassembler, at least on the x86 instruction set. The nature of the complex instruction set means that an arbitrary string of bytes can be decoded into a wide variety of programs, especially when you throw in the possibility of self-modifying code, and all that other garbage. It's a little better on RISC with fixed, word-aligned instruction sizes. Some minor problems would still exist, but they wouldn't be much of a hinderance to a practical "good-enough" disassembler.
Not to say that creating a workable disassembler is impossible. However, usually more valuable is a debugger with a disassembled output. In this case, you know the program counter's value, so you can deterministically disassemble the program (up to a point). This is generally all you really need to do reverse engineering. Throwing in a decompiler on top of all this generally doesn't help somebody who is fairly experienced reading a disassembly, although I suppose it could be of help to somebody who's more familiar with C++ than assembly mneumonics.
On the other hand, it's not that hard for somebody to pick up just enough assembly to figure out what's going on, especially if they're technically sophisticated enough to be going to all the trouble of stepping through the program to try and figure out how it works.
So just to reiterate, decompilers are generally not all that valuable.
First off, there is no "decompiling" going on here. That would imply that you will end up with code having a semi-resemblence to the original code - which is certainly not happening. What is going on here is simply just another compilation phase. This time, instead of an object file target compliant with the system ABI, you are getting a C/C++ file target which should theoretically be compilable into a program that will generate the same output for the same runtime input. The scope of effort and implications barely overlap as they are so vastly different.
Of course, with C++, being a strongly typed language that resolves so many things at compile time, decompilation is not possible for any non-trivial example (which all the examples in the link were- indeed they didn't use any C++ features at all). This is even ignoring the effects of compiler optimizations. The C++ language is far more expressive than the output dialects of the compiler making the whole idea of decompiling silly. C, on the other hand, is basically a platform-independent assembly language which is why the one-to-one examples of C and asm output seem to imply one can move back and forth between the two at will. Still this is a mistaken impression.
Now - is compilation from object code to (non-equivilent but functionaly similar) C code useful and interesting? Certainly. And all compiler developers and most hard core debuggers can do this pretty much at will. Its the only way to check the correctness of your compiler and its generated code and, in desperate circumstances, can give you some clue as to what an existing application for which you have no source to, is doing. This is called reverse engineering, btw, NOT decompilation. Unfortunately the material pointed to here provides absolutely no new insights and is quite rudimentary at best. Anyone intimately familiar with their compiler and environment already has more knowledge than this paper provides. Really doesn't justify a slashdot posting but I guess whomever posted it simply isn't a C/C++ developer.
A friend of mine work(ed) with a company in Kingston, ON that was spun off from Queens University. Their sole purpose and business model is to take whatever binaries and source a company has available, run it through their cluster of analysis systems, and produce a "clean" update of the system. As per usual, there is about 10-15% of the produced code that needs some hand inspection and tweaking to complete the task.
Their "big" business was the Y2K work, as their software isn't limited to just reverse-engineering, but can also refactor the re-engineered code (e.g. change all "year" values in the system from 2 digit to 4 digit, updating all related I/O formatting functions, overlay structures, etc.)
On the flip side, their stuff involves complex pattern matching and heuristics that put any other system I've heard of to shame. It requires clusters of systems running for days to do the initial code analysis. (OTOH, it probably took years to create the original code.)
I can't provide more specifics on the company because they're having some legal issues with co-investors.
I do not fail; I succeed at finding out what does not work.
The article (link provided for those who don't read URLs) is wrong, even in the first section.
The title of the first "chapter" is "Why is c++ Decompiling possible?". But immediately he lists "what is totally loss when you compile a program and what stays there".
In the Lost column he puts templates and classes. The remains list has things like function calls and local variables.
Well, guess what? Those things are are "lost" are everything that distinguishes C++ from C. If you don't have classes (meaning no inheritance or virtual functions either) and don't have templates either, then you're really just programming in "a better C", not C++.
So all his approach can hope to "decompile" is C code. Which is something we've seen done in various forms for decades.
Well, you can decompile every binary programm at least to assembler code, so why shouldnt it possible with C++?
There's a huge difference between disassembling and decompiling. With assembly, you generally have a 1 to 1 correspondence between machine language instructions and assembly instructions. That is, one specific instruction you feed to the assembler becomes one specific assembled instruction. Sometimes it's more complicated than this, but only slightly.
Now look at c, where one line of code could be arbitrarily many opcodes, depending on the complexity of the logic within that line (and the length of the line). Now suddenly, instead of looking at one instruction and translating it back to it's equivalent, your decompiler has to look at possibly hundreds of instructions, parse them logically and figure out where each line starts, and ends, and what the logical purpose of each set of instructions is. Then dealing with structures (or in C++, objects) where you have to come up with a definition for how data is laid out based solely on the instructions for dealing with that data.
That's quite a bit more complicated. I sure as hell couldn't do it. I know I could write an assembler or disassembler, I might be able to write a simple compiler, but there's no way in hell I could write a functional decompiler.
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
I know this guy. A sad thing is, lives in the US, and as far as I know, he's a native english speaker, I just can't understand a thing he says. I read this "book" week or two ago when he finished it. I thought this was a very rough draft, but I guess not. I couldn't help but laugh at some things, like it's irrelevance to C++ in general. He should have just used C, since he never even mentions a class.... Well, to be fair, he did mention classes when he describes what is lost in the compilation process, which is untrue, especially if it is a polymorphic class. In fact, I didn't see one thing in this article that would set it apart from one written on the same subject, except using C.
For a laugh, look at his other tutorials. Surprisingly, his "book" here is among some of the better material. Most have to do with C++, and some assembly, and some even cover the same material in this lengthy and pointless article. I especially like his tutorial on using Macros in C++, a concept so backwards and wrong it shouldn't even have to be mentioned. Sure, macros have uses, but with C++, you have real inline functions and constant variables, so why use them for anything besides #include? Anyway, his other works can be found on pscode.com.
What all this boils down to here, is that nothing new is said here. Not only that, but what is said is presented and worded so poorly that anyone reading it is either going to die of laughter or confusion. If you want to read something on reverse engineering, pick up the dragon book, an assembly book, a good disassembler, and some of the very nice documents on cracking software. Many of these are written by people who will be years ahead of you no matter how hard you work, people who actually know what they're talking about.
- Mik Mifflin
Been doing it for twenty years. It is easy to do.
Stop trying to use logic... actually do it.
This is such a grossly misinformed statement, I don't even know where to begin. Assembler and machine language ("binary") are semantically identical. You can go back and forth from assembler to machine code all day and still have the same thing. All you lose when going from human/compiler generated (vs disassebled machine code) is labels and comments.
With C++ or any high-level language, there zillions of ways a compiler might interpret the code - just as long as the machine code effectively does was the C code says. Even identifying what compiler was used will not help - there are just so many ways to say the same thing in C. for, while, goto, case, it's all syntactic sugar that disappears when you compile.
You can make a decompiler which identifies various code structures and converts them to high-level representations, but it can't EVER know what the original source looked like.