Slashdot Mirror


Famous Last Words: You can't decompile a C++ program

The Great Jack Schitt writes "I've always heard that you couldn't decompile a program written with C++. This article describes how to do it. It's a bit lengthy and it doesn't seem like the author usually writes in English, but it might just work (haven't tried it, but will when I have time)."

82 of 479 comments (clear)

  1. You can't by Anonymous Coward · · Score: 5, Insightful

    Information is lost in compilation. You can never reconstruct the exact original source. You end up with valid C++ that has no more human-understandable information than the equivilent machine code.

    Like turning hamburgers into cows...

    1. Re:You can't by Morologous · · Score: 5, Funny

      Like turning hamburgers into cows...

      I'm going to use that line.
    2. Re:You can't by jezzgoodwin · · Score: 2, Informative

      He's quite right.

      Take a sum within a program, for example (a+b)=1000 ... now there are infinite possible combinations of what a and b can be ... but without the correct variable names, or the commenting that went along with the code (assuming there was some) ... the decompiled output is going to be pretty much useless / extremely difficult to understand

    3. Re:You can't by NewbieProgrammerMan · · Score: 5, Funny

      Heh. You're assuming that you're attempting to decompile something that had human-understandable source to start with. :)

      --
      [b.belong('us') for b in bases if b.owner() == 'you']
    4. Re:You can't by cperciva · · Score: 5, Funny

      We're talking about C++ here, not perl.

      Compiled C++ code can't be decompiled into anything approximating the readability of the original; compiled perl code can.

    5. Re:You can't by antis0c · · Score: 4, Informative

      What's to say you need something as readable as the original? I worked at InterAct Accessories/GameShark for a few years before they went under as essentially a 'reverse engineer'. Without getting yet another CND from them in the mail due to a post on Slashdot (I don't even think they could send one now they're out of business?), all I can say is sometimes when hacking a game it benefits an engineer to decompile the application and be able to set breakpoints and watch execution flow while the game is running on for example a PlayStation 2. Sure it's going to be a lot of nearly unreadable C++ mixed with Assembly, but if you can watch the execution flow as you do something, it can be useful.

      Of course a lot of naive people think decompiling would allow you to take an application and start writing patches for it, in that case you are right, it's going to be pretty useless. However it's not entirely useless for all situations. I'm sure the WINE guys might get some use out of it.

      --

      ..There's a-dooin's a-transpirin'
    6. Re:You can't by capnjack41 · · Score: 4, Insightful
      And then on top of that, the compiler optimizes that code, so calculations are no longer the straightforward and intuitive things they used to be, now they're a series of out-of-order, smaller calculations that are harder to recognize. They're efficient as hell but barely reversible.

      I'll RTFA when it comes back to life :).

    7. Re:You can't by Anonymous+Brave+Guy · · Score: 2, Funny
      Compiled C++ code can't be decompiled into anything approximating the readability of the original; compiled perl code can.

      Yep; compiled Perl already approximates the readability of the original pretty well anyway. :-p

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    8. Re:You can't by Anonymous Coward · · Score: 2, Informative

      The thing is, the point of a decompiler is to make the code readable. If you don't particularly care how readable the code is, then your standard disassembler is usually good enough.

      Incidentally, you can't even theoretically create a perfect disassembler, at least on the x86 instruction set. The nature of the complex instruction set means that an arbitrary string of bytes can be decoded into a wide variety of programs, especially when you throw in the possibility of self-modifying code, and all that other garbage. It's a little better on RISC with fixed, word-aligned instruction sizes. Some minor problems would still exist, but they wouldn't be much of a hinderance to a practical "good-enough" disassembler.

      Not to say that creating a workable disassembler is impossible. However, usually more valuable is a debugger with a disassembled output. In this case, you know the program counter's value, so you can deterministically disassemble the program (up to a point). This is generally all you really need to do reverse engineering. Throwing in a decompiler on top of all this generally doesn't help somebody who is fairly experienced reading a disassembly, although I suppose it could be of help to somebody who's more familiar with C++ than assembly mneumonics.

      On the other hand, it's not that hard for somebody to pick up just enough assembly to figure out what's going on, especially if they're technically sophisticated enough to be going to all the trouble of stepping through the program to try and figure out how it works.

      So just to reiterate, decompilers are generally not all that valuable.

    9. Re:You can't by Paradise+Pete · · Score: 5, Funny
      I'd prefer... Like turning shit backing into pizza.

      Clearly you haven't tried Domino's.

    10. Re:You can't by ryanr · · Score: 2, Insightful

      Who said the point of the exercise was to turn the code back into the original C++?

    11. Re:You can't by len_harms · · Score: 2, Interesting

      You probably could get very close.

      With straight C++ classes you probably could get something back resembling them. VC is a very regular compiler. Which is the one he used. Havent looked at what VC dose to templates. But I would be willing to bet it transforms them into type specific classes then into C. Would just need to use the preprocessor and see what it did to it.

      Inline functions though would be imposible to get back. But then again they are inlined. So the code would be there. Just not necessaryly in the original form.

      The VC compiler is just a transform engine. It transforms from C++ to C to PCODE to ASM. Course thats 5 year old info. When I used to care about what the compiler was doing to my code. Templates are probably similar.

      Im sure the code that came back out of this thing would be UGLY. But if you look at the end of most exe's shipped these days most developers do not even bother stripping the exe anymore. You probably could even get back MOST of the classe names and function names maybe even the variables.

    12. Re:You can't by Waffle+Iron · · Score: 4, Funny
      Like turning shit backing into pizza

      Here's how:

      Flush shit down toilet -> let shit mellow at sewage plant -> strain shit residue out of bottom of sewage vat -> haul to field -> spread on grass -> grass grows -> cow eats grass -> pull cow's udder, direct milk into bucket -> ferment milk to cheese -> shred cheese -> spread on dough -> Pizza!

    13. Re:You can't by Llywelyn · · Score: 3, Funny

      This reminds me of a statement I saw on /. a long time ago:

      Python: Executable Pseudocode
      Perl: Executable line-noise

      --
      Integrate Keynote and LaTeX
    14. Re:You can't by jkorty · · Score: 4, Insightful
      Information is lost in compilation. You can never reconstruct the exact original source

      So what? Doing reasonable interpolations in context is what brains are for. Example: IIRC, when the Morris Worm appeared in 1989, Gene Spafford examined the binary and reverse-engineered the C code, sprinkling it with meaningful comments and good variable and function names. When the original source became available, his turned out to be cleaner program than the original. That is, he not only recreated the original in every way that counts, he overshot and did better than the original

    15. Re:You can't by kevquinn · · Score: 2, Insightful
      It is perfectly possible to reverse-engineer a meaningful source from a given binary. It's certainly not easy, and of course you won't end up with the same variable names etc (unless the author kindly left in heaps of debug symbols etc), but that hardly matters. The point is that it is possible. Even templates are possible to decompile, given enough incentive; after all it's just fancy pattern matching.

      With regards the original article - well, that was a bunch of obvious guff really; what you'd expect from high-school geeks of the type I was, some number of years ago. Of note, is that it claimed to decompile C++, when actually it talked only of rather trivial C constructs, something that is a well understood practice already.

      Some relatively recent classic decompilation work was done by Cristina Cifuentes who put together a C decompiler that worked to a significant degree for common DOS-based compilers of the time. Effectively the job of "decompilation" can be thought of as "compilation" - instead of compiling C into ASM, you think of compiling ASM into C. Not as daft as it sounds, honest. You can download "dcc" from the above site to investigate further.

      Boomerang is a sourceforge project attempting to create a decompiler. Worth a look, as well.

      It's worth noting, that there are a number of ways to "cheat". For example, it's often trivial to discover what compiler was used to generate a given object code, and there are usually masses of common library-type code that gives you a leg up. Add to that, the fact that a piece of code was generated by a compiler, and the problem of discovering what a given piece of object code does is drastically simplified - compilers add huge amounts of structure and predictability to the generated object code that can be absent in free-form handwritten assembler (and few people do that anymore!), and much can be made of this.

      On the code/data issue mentioned by others in this thread - although separating code/data in general from mixed binaries can be considered hard, in reality it's often quite feasible and even simple. After all, the CPU manages to work it out. Again, the fact that there are so many short-cuts you can take really helps.

      Of course, a quick cruise around the cracking community will turn up all sorts of ways and means to shortcut this sort of problem...

      Here are the results of a quick googling:

  2. Oop by Suffering+Bastard · · Score: 5, Funny

    it doesn't seem like the author usually writes in English

    Surely he now understands the English infinitive "to be Slashdotted".

    --
    "Molest me not with this pocket calculator stuff."
    - Deep Thought
  3. Why not? by bazik · · Score: 5, Insightful

    I've always heard that you couldn't decompile a program written with C++.

    Well, you can decompile every binary programm at least to assembler code, so why shouldnt it possible with C++?

    Maybe he ment "you can't decipher the source of a C++ programm" ;)

    --


    --
    One by one the penguins steal my sanity...
    1. Re:Why not? by BJH · · Score: 2, Interesting

      Actually, not quite true. Assembly code is usually considered to mean the mnemonic code intended for human (well, semi-human) consumption, whereas machine language is the actual binary opcodes and arguments.

      So, he's sort of right - you can decompile any binary program to assembler. It's usually called disassembly rather than decompilation, though.

    2. Re:Why not? by NoMoreNicksLeft · · Score: 2, Informative

      Uh, no. Compilation produces assembly, and then the (sometime integrated) assembler assembles it into machine language (not binary). Forget what switch it is, but gcc even let's you see what asm code it is generating.

    3. Re:Why not? by GlassHeart · · Score: 2, Insightful
      you can decompile every binary programm at least to assembler code

      No. Assuming we're talking about software disassemblers here, not every program can be reliably disassembled. Disassemblers work by mainly following the execution paths of already disassembled code, so that it knows exactly where a subroutine begins. In many instruction sets, instructions have variable length, and not starting your decoding on the right byte will be a big mistake that cascades on to the next instructions. Now, knowing this, all we have to do is to change the execution path without the disassembler knowing. A function pointer (address loaded at run-time) already presents a serious problem to a disassembler, but simply asking the user to enter the instruction address to jump to will completely defeat the automatic disassembler. There's no way for the disassembler to know what the user will enter, and hence where the program will go to next.

      Humans will still be able to disassemble your program, of course. However, you still won't get the original assembly source back. Assembly languages usually support macros and pseudo-instructions that improve readability, but have no correspondence in assembled form.

    4. Re:Why not? by arkanes · · Score: 2, Interesting

      Not directly, but inputting, say, the name of a function or command to call, looking that up in a table of function pointers, and executing the pointed-to function amounts to the same thing.

    5. Re:Why not? by seanadams.com · · Score: 2, Informative

      This is such a grossly misinformed statement, I don't even know where to begin. Assembler and machine language ("binary") are semantically identical. You can go back and forth from assembler to machine code all day and still have the same thing. All you lose when going from human/compiler generated (vs disassebled machine code) is labels and comments.

      With C++ or any high-level language, there zillions of ways a compiler might interpret the code - just as long as the machine code effectively does was the C code says. Even identifying what compiler was used will not help - there are just so many ways to say the same thing in C. for, while, goto, case, it's all syntactic sugar that disappears when you compile.

      You can make a decompiler which identifies various code structures and converts them to high-level representations, but it can't EVER know what the original source looked like.

  4. Re:Why by czion3 · · Score: 2

    Because you lost the source and forgot to make a backup.

  5. hmm by Graspee_Leemoor · · Score: 5, Informative

    A c/c++ decompiler that totally worked would be the Holy Grail of crackers. Unfortunately it is actually impossible to get everything back because lots of info is lost on compilation.

    Nevertheless there are tools out there that attempt to decompile programs; I think of them more as ways of making assembly more readable.

    Note, a lot of them wouldn't work on hand-written assembly, because they rely on knowledge of how certain compilers compile various things- e.g. there was a Delphi decompile available.

    graspee

    1. Re:hmm by deranged+unix+nut · · Score: 2, Insightful

      The problem is that there are quite a few people out there that assume that just because it is in binary form, that it can't be figured out. For example, they will use XOR to "encrypt" data stored inside the program, or assume that their secret algorithm is safe because it is compiled.

      The barrier to entry is definately raised, but it is always possible to figure out what the compiled code is doing given enough time and effort. In fact, I've even heard of people who patch operating system kernel code without the source...

    2. Re:hmm by jackb_guppy · · Score: 5, Interesting

      I wrote reverse compilers on IBM midrange equipment. where there are not stacks and self modifing code is VERY commom place. It is easy to do:

      Create a program that preforms / understands the opcodes for the processor and addressing. And it follows both sides of a branch.

      Now "run" the program, that maps out the all opcode and data areas.

      Once done. Look at that Assemmebler equivatlent, map out commom subroutines and function calls. Data Storage become very clear. Lastly, commom storage with show external and internal common structures - so naming of fields and visualable.

      It is striaght forward, can be time comsuming - and very helpful is understnad hinden or loss information.

  6. Re:Why by Morologous · · Score: 3, Informative

    I can't count the number of times I've been frustrated with the performance or process of an application that I had to interface with, and just wondered: *why* in god's name, or *what* in god's name are they doing in there.

  7. sure you can go from asm - c++ by Anonymous Coward · · Score: 5, Informative

    but it'll look like this

    class a
    {
    public:
    void b(int c);
    void d(int e);
    private:
    int g;
    int h;
    };

    int main()
    {
    a f;
    f.b(23);

    int x; x=0; x++;
    if(x > 3) goto j;
    f.d(x); x++
    if(x > 3) goto j;
    f.d(x); x++;
    if(x > 3) goto j;
    f.d(x);
    j: f.b(42);

    return 0;
    }

    1. Re:sure you can go from asm - c++ by rsheridan6 · · Score: 4, Funny

      My girlfriend just read that over my shoulder and said "Is that a poem?"

      --
      Don't drop the soap, Tommy!
  8. Decompile this! SlashDot Effect! by lems1 · · Score: 2, Funny

    Yeah, but they should know how to decompile the slasdot effect first... another one down. Anybody with a Mirror or Google Cache link ?

    --
    This sig can be distributed under the LGPL license
  9. Re:Why by Anonymous Coward · · Score: 5, Insightful

    You need reasons?

    1) Finding backdoors
    2) Testing security
    3) Fixing bugs
    4) Adding features
    5) Discovering copyright violations
    6) Interfacing to non-supported clients

    Pretty much anything and everything you would do if you had the source.

  10. Re:Intresting by Morologous · · Score: 3, Funny

    *BBBBRRRRRRTTTT*

    Incorrect! Spelling Nazi may have been the answer you're looking for.

  11. Re:Why by p4ul13 · · Score: 4, Insightful

    You could be updating a program for your company for which the source is lost.

    --
    Paul Lenhart writes words!
  12. Inline functions, templates and decompilation by truth_revealed · · Score: 4, Insightful

    Sure you can decompile an optimized and symbol-stripped C++ program, but you'd never have it the original compact form of the source as you do with the Java class file decompilers due to the heavy use of inline functions and templates used in C++. A C program, sure, but decompiling C++ is not terribly useful.

  13. Re:Intresting by Anonymous Coward · · Score: 2, Funny


    Its write hear in my Oxbrige Enlish Dictionairy. What are you on about?

  14. let's get back to basics by 1nv4d3r · · Score: 5, Funny

    Hell, I'd be happy if the people working for me could consistently compile their c/c++. I need a new job...

  15. Spectulation Code by Davak · · Score: 5, Informative
    Considering the entire post is evidently based on speculation...

    Here is some code that supposedly decomplies... not that I've tried it.

    Quote from the FAQ:


    [35.4] How can I decompile an executable program back into C++ source code?

    You gotta be kidding, right?

    Here are a few of the many reasons this is not even remotely feasible:
    * What makes you think the program was written in C++ to begin with?
    * Even if you are sure it was originally written (at least partially) in C++,
    which one of the gazillion C++ compilers produced it?
    * Even if you know the compiler, which particular version of the compiler was
    used?
    * Even if you know the compiler's manufacturer and version number, what
    compile-time options were used?
    * Even if you know the compiler's manufacturer and version number and
    compile-time options, what third party libraries were linked-in, and what
    was their version?
    * Even if you know all that stuff, most executables have had their debugging
    information stripped out, so the resulting decompiled code will be totally
    unreadable.
    * Even if you know everything about the compiler, manufacturer, version
    number, compile-time options, third party libraries, and debugging
    information, the cost of writing a decompiler that works with even one
    particular compiler and has even a modest success rate at generating code
    would be significant -- on the par with writing the compiler itself from
    scratch.

    But the biggest question is not how you can decompile someone's code, but why
    do you want to do this? If you're trying to reverse-engineer someone else's
    code, shame on you; go find honest work. If you're trying to recover from
    losing your own source, the best suggestion I have is to make better backups
    next time.

    I would have posted AC but that have me blocked out for some reason...


    Davak

  16. To all those, who think it's useless... by SharpFang · · Score: 4, Interesting

    Well, it isn't. Sure, if you're so lazy uou want to have source rebuilt from binaries with one click, complete with comments, makefile and documentation, that's of no use. But imagine the program does some very clever trick. Something you ooh about, "How the hell does he do that? It's impossible?". You want to include that trick in your code. You need it. So - you have three options: 1) Try to design it from scratch. Helluva work, you don't know where to start. 2) Look into the binary. If you're ASM guru, you MAY succeed. But ASM from high-level languages is hell to read. 3) Decompile the puppy, look for that piece through what looks like piles of junk, but is way more readable than ASM and find it. Then just rewrite it in pretty fashion, changing variable names and functions to your needs and include in your own software. It's "the best of the worst", last resort at finding a solution to a small problem. Not a way to edit the source and add a single feature to the original program, like remove print protection from Acrobat Reader. The decompiled program most probably won't be possible to compile. You won't make a cow from hamburgers. But with some luck you may find out the cow was a bull and got killed by a truck.

    --
    45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
    1. Re:To all those, who think it's useless... by pVoid · · Score: 3, Insightful
      Neat tricks are generally either one of these three things:

      A hidden API call - which can be easily found via ASM listings

      A nice little algorithm - which can be found in comp sci books

      An elegant piece of code - which can *not* be decompiled from ASM

      So no, I disagree with you.

  17. You're right, that is nonsense. by Anonymous Coward · · Score: 5, Funny

    I damn well know computers. I have been working with them since 1904, when the Black Man made the first computer out of a peanut. I now work for Cray research making 18 figures.

    I can scratch a superscalar CPU out of silicon with a pocket knife. I even have friends who can write major programs in binary code (yes, just 1s and 0s)... even though writing a simple "hello world" program can ammount to 92,752 bits. I fail to realize that this ability does not a good computer scientist make. Things like intelligent design and research make a CS good.

    The parent post is fluff. It's stupid, the man is flamboyant and exagerating. He clearly has no real education of computer engineering and does not recognize that any executable code can be reverse-engineered or decompiled. Especially since every langage (save interpreted languages like Java) are compiled to machine code -- specific, unambiguous, structured code. "Decompiling" this is only really a matter of translating it into your langauge of choice.

    So, Mr. Proud American, please get off your imaginary high horse. You're not fooling anyone.

  18. Re:Why by Kadagan+AU · · Score: 2, Insightful

    I agree, that should have been modded insightful, not funny. We have a ton of in-house apps that we don't have source for anymore, and it would be really nice to be able to update them without having to rewrite the entire thing.

    --
    This space for rent, inquire within.
  19. Reverse-engineering programs written in C/C++ by Anonymous Coward · · Score: 2, Interesting

    I've done some reverse-engineering on programs written in C/C++ (Intel x86). After a while you learn how to recognize different things like virtual function calls, while/for-loops, switch and stuff like that. However, it's a totally different thing to decompile to C++. It may be possible to decompile compiled code to C, but don't expect that it will look much like the original source, especially if the code was optimized by the compiler :)

  20. Templates by ucblockhead · · Score: 4, Informative
    He won't be able to regenerate any templates. If a program makes heavy use of templates, the "C++" he "decompiles" to is going to be hideously ugly.

    [insert joke about it being hideously ugly with templates here.]

    {I did not read the article itself because it is, of course, slashdotted)

    --
    The cake is a pie
  21. Java Decompiler? by mindstrm · · Score: 2, Interesting

    Anyone recommend a java decompiler known to work on the most recent versions of java, properly?

    Something that will literally give me code I can re-compile immediately?

    1. Re:Java Decompiler? by anonymous+loser · · Score: 2, Interesting

      JAD is a godsend. I wrote a very complex optimization method that was extremely effective in a couple of circumtstances. A couple of years later, those circumstances turn up again only in a different language. I can't find the source code anywhere, just the class file that had my great method in it. So, JAD comes to the rescue; it gave me a bunch code that used d1,d2,d3,... as my variables, but I already had a basic understanding of what each variable's role was, so it wasn't a problem for me to reverse-engineer my own code and finally port it to another language. I also made several back-ups of the source code this time. :-)

  22. Reverse engineering has its uses... by sheetsda · · Score: 4, Insightful

    There seem to be a lot of people in this story saying "shame on you for reverse engineering". It has its uses, how else would viruses, worms, and trojans be analyzed to figure out what they do and how they do it.

  23. Re:This is nonsense by scottking · · Score: 2, Funny

    well, when SGI lays you off this week, you're going to have plenty of time to learn how to create programs in binary, just like your friends.

    --
    scott king
  24. Re:Why by Call+Me+Black+Cloud · · Score: 4, Informative
    As a Java programmer I find it very useful to decompile class files from time to time. Reasons I've done so:

    A library we were basing a major portion of our code on had a bug in it (a Listener class failed to implement EventListener if I remember correctly) which kept our code from working. Removed offending classes from archive, decompiled, fixed, and recompiled.

    It's educational...the ol' "how'd they do that?". I've never taken code and used it but I found it instructional to look at how someone made a Swing text area from scratch, e.g.

    The challenge...one program I installed had a "enter registration key" and I was curious how that was handled (turned out to be a static string). Then there was this applet that was the the core of a company's business. Free, or pay and get more features. As it turns out the control of the features all resided in the applet, so change a couple of switch and if/then statements and voila, administrative privleges. Didn't use it for evil, much... :) They've since come out with a new version and I've been too busy using my mad java skillz on contract work to take a look at their code.

    Looking at security was instructional too, though, for when I was project lead on a commercial Java app I knew what worked and what didn't (we ended up using the Wibu key).

  25. A good decompiler shows you what was written by crovira · · Score: 4, Interesting

    not the source's lies.

    Losing source code and var names (name spaced globals aka statics and scoped locals) allows the cracker (these are rarely hacking tools, they're mostly cracking tools,) to focus on what the machine actually was told to do instead of smothering it with shades of meaning which interfere with understanding the code.

    C++ or Java or Smalltalk, or almost any highly structured language using machine code libraries or virtual machines result in structured blocks of code and heap and stack allocation.

    A good decompiler can take the machine code, peel away the name spaces and code calls, extract the patterns in the code and the hacker/cracker can read the patterns instead of wasting time on the code.

    Forensic analysis work is extremely useful at telling you what happened when something dies but it is no good at telling you how something worked. For that you need code traces.

    Map those code traces onto the structure the decompiler reveals and you understand the program better than the authors/coders.

    --
    MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
  26. Anyone want to decompile SCO? by pchown · · Score: 4, Funny
    You might decompile one file and find a comment like this at the top:

    * This program is free software; you can redistribute it and/or
    * modify it under the terms of the GNU General Public License
    * as published by the Free Software Foundation; either version
    * 2 of the License, or (at your option) any later version.
    ;-)
  27. Re:Why by neonstz · · Score: 2, Insightful

    What you need is a decent source control/backup system, not a decompiler.

  28. Re:Why by IamTheRealMike · · Score: 2, Insightful

    This might be useful for when trying to make apps run in Wine. Occasionally disassembly is the only way to figure out why the app crashes light years away from the nearest API call etc.

  29. misleading... by bismarck2 · · Score: 4, Informative

    Even with complete original source code, understanding a non-trivial C++ application is very difficult. Source derived from an optimized executable is going to be a LOT rougher. No real function names, module names, variable names, or comments. Use of standard libraries (STL, MFC, Boost) is likely highly obscured as well. A tool like this would probably produce source that looks more like a C/machine language hybrid rather than normal C++. The primary use of something like this is if you are looking for a very specific piece of logic such as a password check or an encryption operation or protocol details. When were these famous last words anyway?

  30. Decompiling to C++ is like... by Call+Me+Black+Cloud · · Score: 3, Insightful

    ...trying to rebuild a wrecked sand castle just by looking at the grains of sand. You can't. Compilers throw away a lot of information needed by people but not necessary for the machine. Compilers optimize the code to run more efficiently and that's a one-way street. Sorry to burst your bubble but trying to reconstruct original source is like trying to herd cats.

    Thank you, thank you. I'm Mr. Metaphor and I'll be here all week.

    1. Re:Decompiling to C++ is like... by davidstrauss · · Score: 2, Funny
      Thank you, thank you. I'm Mr. Metaphor and I'll be here all week.

      Calling yourself Mr. Metaphor is like using metaphor instead of analogy, which, in your case, is as incorrect as a cow marking its territory with cow pies and instituting an elaborate cow-tipping territory defense program.

  31. Re:Why by isorox · · Score: 2, Insightful

    shutting the barn door after the horse bolted?

  32. thanks for nothing. by twitter · · Score: 3, Interesting
    If you're trying to reverse-engineer someone else's code, shame on you; go find honest work.

    Shame on you Davak, you should go find honest code. There's nothing wrong with trying to understand how things work. Some people are stuck with legacy equipment or code they can't replace easily and this is their only option for improvement or even fixing it. Those people would be better off if free code were available. Sometimes the only way to make that free code is to understand the original code. There's nothing wrong with reverse engineering software, ever. Republishing someone else's binary is not legal, but it's not immoral. If the code were honest to begin with, the reverse engineer part would not be required. These days, it's cheaper to throw out the dis-honest code and hardware and buy some hardware that's well understood. If you make hardware or software, I hope you understand the implications for your product - I'm not buying it.

    --

    Friends don't help friends install M$ junk.

  33. Usefull for compatibility reasons by wilddur · · Score: 2, Interesting

    In europe it is legal to use reverse engineering for compatibility reasons enabling your software to work with others people software (mainly Microsoft)

    If you make the reverse engineering in europe you could develop compatible software and then export it to US. So it may be great news for us. In fact it is becoming really complicated to develope software for/at US. Patents, legislation, compatibility. It seems that more lawers than programmers are needed to write something more complicated than HelloWorld.exe.

    There is a need for tools that enable the compatibility of the programs or we will end with a monopoly of all kinds of progrmas (And it is illegal to use your O.S. monopoly to obtainthe monopoly of let say...web browsers).

  34. Re:Why by Lumpy · · Score: 4, Insightful

    Why would you want to do this unless you were stealing source?


    nice try.

    You must be either Bill Gates, Steve Ballmer or someone who works for the BSA.

    How am I to tell if your close source program isn't full of my GPL code that you blatently stole and are trying to rob me blind by STEALING my IP? Being a closed source advocate as you seem to be you are for me trying to detect IP theft and the illegal STEALING of my code by PIRATES right?

    Ok, I'm going overboard to make my point... I have EVERY right to use tools in a good and legal way. Why not outlaw hammers as anyone can perform a very grisly and horrible murder with one... Or better yet only allow licensed contractors to have hammers! as we know that the unlicensed public is only going to do very ewvil things with tools!

    see my point now? A tool is exactly what it looks like.... a tool. it can be used for good and evil. and I dont have any respect for the self righteous like you condemning what I do before I even do it.

    people with attitudes like you are what cause all the pain and suffering in this world...... STOP IT!

    --
    Do not look at laser with remaining good eye.
  35. Slight misunderstanding here . . . by Selanit · · Score: 2, Informative
    If you're trying to reverse-engineer someone else's code, shame on you; go find honest work.

    Shame on you Davak, you should go find honest code ...
    If you read carefully, you'll note that the "honest work" sentence is NOT Davak's. It is still indented as part of the blockquote, and therefore is the final section of the passage he was quoting from that C++ FAQ. The last sentence that is actually Davak's is his comment about wishing to post as an anonymous coward, presumably to avoid situations like this one. Since AC posting wasn't working for him, it might have been a good idea to italicize the quoted passage to set it off clearly from the rest of the post. Oh well, too late now.
  36. Decompilation = halting problem by Wizard+of+OS · · Score: 3, Insightful

    Why do people keep thinking that decompilation is possible? In short: decompiling a computer program is solving the halting problem. Period.

    The long version: In a compiled computer program there is no distinction for either code or data. Every byte in memory can be data, but it can also be executed as valid computer code.

    Now, the catch is that during compilation, data and code are mixed in the resulting binary. For instance take the compilation of a 'case' statement. There are several ways of compiling a case:
    - you can write it as a list of IF's, which is perfectly fine decompilable
    - you can write it as a jump, based on the case expression.
    The fun part about the second possibility is that it's far more efficient, but it poses a problem: when decompiling this you have to know where the bounds of the case lie. What's the furthest jump that can be made? It's a jump based on a calculated value, so you should know which values are possible. But for that, you need to run the program, and more specifically, you must run all possible execution paths.

    This can be rewritten as the instance of the halting problem: can a computer find out for any program whether or not it will halt? It is proven that a computer program cannot be written to do this task. Neither can a computer program decompile any other computer program.

    --

    --
    If code was hard to write, it should be hard to read
    1. Re:Decompilation = halting problem by johannesg · · Score: 2, Insightful
      Your understanding of the consequences of the halting problem is incomplete. It is not a proof that it is impossible to determine of any given program whether or not it stops in finite time. It is merely a proof that there exists a class of programs for this determination cannot be made. However, there are also many programs for which it can easily be determined whether or not it stops in finite time, and the same thing is true for decompilation.

      Furthermore, there is nothing saying that it has to do a 100% perfect job. Decompilation is already accepted to be imprecise; using some common sense (intuition if you want) to fill in some gaps is not an invalid method.

      The problems that face decompilation that stem from real-world issues are far, far greater than this (rather theoretical) problem. For example, decompiling any STL-based source to a useful state will be far more difficult than a simple jumptable.

  37. Of course you can decompile C++ by TapeLeg · · Score: 3, Informative

    You can decompile any program. A compiled program is just your high-level program translated into machine language. There is no sort of magical encryption or similar transformation that it undergoes once you compile it.

    All you need to do is read in the bytes of any binary program, interpret the bytes as their machine language equivalents for whatever platform you are using, and then convert your MOV statements to assignment operators, JMP statemets to higher level loop structures, etc..

    Of course, you won't retain the names of identifiers, which are referred to only by memory locations in a compiled program; and some control structures might be rearranged due to compiler optimization and the lack of machine language equivalents, but the meat and potatoes of it is all right there.

    It's by no means easy to accomplish, especially with higher and higher level programming languages, but impossible? humbug! =)

  38. Re:It's the other way around by Minna+Kirai · · Score: 2, Insightful

    When you think about it, the higher level the language is, the easier it should be to "decompile".

    No, no, no. This is both empirically untrue, (Do you see many ML or even C++ decompilers out there?) and theoretically insensible.

    The higher level a language is, the more changes there will be between the original source code and the assembly. Thus the more source data that will have been discarded by the original compiler, which is data the decompiler cannot reconstruct.

    The reason Java decompilers work relatively well is not because Java's a high-level language (it isn't, really), but because the output program is at such a high level! Instead of working from binary code, a Java decompiler gets a more presentable bytecode, packed with the names of classes and methods. (Also, because optimization of Java programs is supposed to happen after compilation at a JIT stage, the bytecodes won't be as obfuscated as the output of a normal C++ optimizing compiler)

    The closer the original source was to asm, the more the individual coder's style will be reflected in the asm

    When decompiling, the "individual coder's style" is exactly what you're trying to get!

    the more the obvious patterns the compiler uses every time for given constructs will be present.

    Good compilers don't use "obvious patterns". Their transformation functions are very sensitive, so a tiny change in the input source (expanding a loop from 3 times to 4) can cross an optimization threshold and totally change the appearance of the output.

  39. I couldn't help it by Fnkmaster · · Score: 4, Funny
    Neo: Do you always look at it in binary?


    Cypher: Well you have to. The compilers work for the construct program. But there's way too much information to decode the Matrix. You get used to it. I...I don't even see the code. All I see is an array, function pointer, integer. Hey, you uh... want a drink?


    Neo: Sure.


    Cypher: You know, I know what you're thinking, because right now I'm thinking the same thing. Actually, I've been thinking it ever since I got here. Why, oh why didn't I sell my VA Linux stock?... Good shit, huh? Cowboy Neal makes it. It's good for two things, degreasing Perl code and killing brain cells.

  40. Decompiling is possible, but hard by Animats · · Score: 4, Interesting
    Decompilers are rare, but possible. The first good one, decades ago, decompiled IBM 1401 assembler programs into COBOL. There's a commercial business, The Source Recovery Company, still doing that for legacy mainframe programs.

    C decompilers exist; here's one. There are others. Most aren't very good. It's a hard problem.

    Without debugging information, decompilation tends to result in code with arbitrary variable and function names, of course. But you get names when a DLL or .so is entered, so at least you get the program's major interfaces. Minimal C++ decompilation could be done by adding vtable recognition to a C decompiler.

    A more difficult problem is recognition of idioms. Things like "for" statements tend to decompile as lower level constructs. That's OK as a first step. You need some internal representation Initial decompilation might represent all transfers of control with "goto"; higher level recognition then deals with that.

    The key to doing a good job is "optimization", finding more concise source code that will generate the object code. The key to this problem is defining an internal representation that can represent any valid machine-language program, and which can be modified as higher level information about the program is discovered. The first step is usually to start at the starting address and build a code tree by following calls, like a good debugger does. Then you start to improve on the code tree, doing things like this:

    • Recognition of function calls. Each function call should be decompiled, and all calls to the same function checked to insure they have the same calling sequence. Then a prototype can be generated and placed in a header file.
    • Recognition of fixed-format structures. Figuring out how big a structure is can be tough, but at least fixed-format ones should be fully recognized. All references to the structure should be checked for type consistency, and a structure definition generated.
    • Recognition of "for", "while", and "switch".
    • Once constructors and destructors have been found, the structure of derived objects can be figured out. Now class definitions can be generated.
    • Once class member functions have been identified, the most restrictive protection ("private", "public", "protected") that will work should be attached. Similarly, "const" can be inserted for all arguments not seen to be modified.

    Decompilation won't always succeed. But you should find all the places where the code is doing something the compiler doesn't understand, and get code back for everything else.

    It's a big job, and somebody ought to do it. Among other things, it would be a valuable tool for finding compiler bugs.

  41. Re:It's the other way around by KingRamsis · · Score: 2, Informative

    excuse me..!!
    just leave Delphi out of it, Delphi is a true OOP language you should do some research before coming up with a gross generalization like that.

  42. Halting is a red herring by yerricde · · Score: 2, Insightful

    Now, the catch is that during compilation, data and code are mixed in the resulting binary.

    Not last time I checked. My compiler emits at least four segments in a compiled program: .text (program code), .rodata (initialized data marked as 'const'), .data (initialized data), and .bss (zero-filled data, which is run-length encoded). Segments .text and .rodata are also write-protected.

    Yes, there is a halting problem, but this isn't it. Segments make distinguishing code from data straightforward. I understand that a few programs make platform-specific API calls that write-enable .text would be harder to disassemble (and subsequently decompile), but do most user programs make such calls?

    Besides, even if the halting problem were relevant, the halting problem can be solved in a real computer, which has limited memory and is thus a linear bounded automaton rather than a Turing machine.

    --
    Will I retire or break 10K?
  43. Re:Why by Crashmarik · · Score: 3, Interesting

    That list can also double as 6 things your vendors dont want you to be able to do.

    I have always felt the greatest problem with closed source was it forced you to trust someone who you were fairly certain had only one skill and that was salesmanship.

    It of course raises the interesting question of if you find a copyright violation, in commercial software is your evidence void because the license agreement usually excludes all reverse engineering ?

  44. Re:Is This Really C++ by Specialist2k · · Score: 2, Funny
    And how do you compile the following statement using a C compiler?

    cout << "member 1: " << local_struct.member1 << endl;

  45. Re:Why by Crashmarik · · Score: 2, Interesting

    No its not.

    It may be a violation of the license agreement which would be a violation of a civil contract The enforcibility and applicability of said agreements have been a point of contention for nearly 30 years now.

  46. Research Company in Kingston, ON by msobkow · · Score: 2, Informative

    A friend of mine work(ed) with a company in Kingston, ON that was spun off from Queens University. Their sole purpose and business model is to take whatever binaries and source a company has available, run it through their cluster of analysis systems, and produce a "clean" update of the system. As per usual, there is about 10-15% of the produced code that needs some hand inspection and tweaking to complete the task.

    Their "big" business was the Y2K work, as their software isn't limited to just reverse-engineering, but can also refactor the re-engineered code (e.g. change all "year" values in the system from 2 digit to 4 digit, updating all related I/O formatting functions, overlay structures, etc.)

    On the flip side, their stuff involves complex pattern matching and heuristics that put any other system I've heard of to shame. It requires clusters of systems running for days to do the initial code analysis. (OTOH, it probably took years to create the original code.)

    I can't provide more specifics on the company because they're having some legal issues with co-investors.

    --
    I do not fail; I succeed at finding out what does not work.
  47. Re:Why by Dylan+Zimmerman · · Score: 4, Interesting

    Nope. It (probably) wouldn't be admissible because of the part that says no reverse compiling. Reverse engineering is something totally different.

    Reverse engineering is taking a black box and figuring out what it contains by giving it test inputs and watching the outputs. There are a few other things considered reverse engineering, but that describes most of it.

    Of course, all of this ignores the fact that EULAs have never been tested in court. They could be proven invalid as contracts fairly easily since the exchange of goods occurs before you ever see the EULA and most stores don't accept returns of opened software. Therefore, if you don't agree to the EULA, you still have the right to use what you purchased.

    On an interesting side note, various free trade laws specifically protect reverse engineering.

  48. Article is mistitled. by Minna+Kirai · · Score: 2, Informative

    The article (link provided for those who don't read URLs) is wrong, even in the first section.

    The title of the first "chapter" is "Why is c++ Decompiling possible?". But immediately he lists "what is totally loss when you compile a program and what stays there".

    In the Lost column he puts templates and classes. The remains list has things like function calls and local variables.

    Well, guess what? Those things are are "lost" are everything that distinguishes C++ from C. If you don't have classes (meaning no inheritance or virtual functions either) and don't have templates either, then you're really just programming in "a better C", not C++.

    So all his approach can hope to "decompile" is C code. Which is something we've seen done in various forms for decades.

  49. The linked "book" itself by p3d0 · · Score: 2, Insightful

    I haven't seen any comments on the linked "book" itself yet. In short: it sucks hard. Go take a look and try not to laugh.

    --
    Patrick Doyle
    I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
  50. From the author by opcodevoid · · Score: 5, Interesting
    I didn't relize my artical was getting any feedback because people are posting it here instead of pscode.

    Anyway i seen alot of people saying decompiling is impossible or at least not practical, well that is not true. Decompiling c++ is very practical because of high level keywords(if,while,for) ,local variables, and parameters. All of these generate certain instruction similer on every platform and just about every proccesser.

    I also extending the artical to contain 92 pages in total which will cover OOP, and crt, and a whole bunch of other stuff

  51. Why not indeed. by fishexe · · Score: 2, Informative

    Well, you can decompile every binary programm at least to assembler code, so why shouldnt it possible with C++?

    There's a huge difference between disassembling and decompiling. With assembly, you generally have a 1 to 1 correspondence between machine language instructions and assembly instructions. That is, one specific instruction you feed to the assembler becomes one specific assembled instruction. Sometimes it's more complicated than this, but only slightly.

    Now look at c, where one line of code could be arbitrarily many opcodes, depending on the complexity of the logic within that line (and the length of the line). Now suddenly, instead of looking at one instruction and translating it back to it's equivalent, your decompiler has to look at possibly hundreds of instructions, parse them logically and figure out where each line starts, and ends, and what the logical purpose of each set of instructions is. Then dealing with structures (or in C++, objects) where you have to come up with a definition for how data is laid out based solely on the instructions for dealing with that data.

    That's quite a bit more complicated. I sure as hell couldn't do it. I know I could write an assembler or disassembler, I might be able to write a simple compiler, but there's no way in hell I could write a functional decompiler.

    --
    "I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
  52. Re:Why by i_am_nitrogen · · Score: 3, Insightful

    One really good reason I haven't seen mentioned yet is writing a Linux driver for a piece of hardware only supported in Windows, such as the DXR3/Hollywood+ or the MyHD/WinTV-HD/etc. For these projects where the hardware manufacturers either can't or won't offer any help, the only way to support the hardware is by disassembling the Windows driver and figuring out the algorithms used by reading the disassembly and/or watching the interactions between the driver and the code. Fortunately for the MyHD driver project, the MyHD software is distributed without any EULA.

    BTW: Nice job getting all those responses with two lines...

  53. The author can't take honest criticism. by mark-t · · Score: 3, Interesting
    I posted the following remark about 20 minutes ago on pscode, and when I just checked back there I found that the remark had been surreptitiously removed (I still had a backup of what I had written in my cache):

    Nice try, but no. All this article ultimately describes is how to write high level language code that does the same thing as particular groups of assembly instructions, which is meaningless to a high level language programmer because knowing all the individual steps of a process are nowhere important as understanding what the process actually *IS*. This is something that no automated decompilation process can uncover because the responsibility for that understanding falls on the programmer, not the computer. Since code that only replicates functionality, but does not convey meaning to the programmer is not maintainable, the entire process of decompilation would be wasted. One would probably be better off spending their time figuring out how to do it themselves (with, perhaps, some help from standard reverse engineering, if needed).

    Not only does the author completely fail to realize that the technique he is describing doesn't remotely qualify as decompilation, and is is nothing but normal reverse engineering, but he figures that the appropriate response to negative criticism is to remove evidence of it rather than attempt to intelligently respond. I noticed that my vote of 1 of 5 was still intact on his voting page, though.

    I was originally surprised when I first read the article that someone would think it had merit enough to write about, but having some insight into the mindset of the author that I did not have before (offered by his rapid censorship of my remarks), my surprise has waned completely.

  54. Re:Decompilation = halting problem == boloney by jackb_guppy · · Score: 4, Informative

    Been doing it for twenty years. It is easy to do.

    Stop trying to use logic... actually do it.

  55. Define decompile... by BagMan2 · · Score: 2, Insightful

    It's relatively easy to come up with the set of C statements that would mimick a particular set of asm statements that you wish to decompile, but the end result would be a C program that was not much easier to read to the original asm was. Changing various assembly operations into C operations does get you back the information you really need.

    The symbolic names make up the bulk of the lost information, but often times programmers will organize a sequence of code in a certain way to make it easier to understand. The compiler will often rearrange that code in a manner that makes it easier for the computer to understand. Compilers will do screwy things like increment a variable on the stack, while holding the original in a register for later usage. Where the original C code might have had the variable increment at the end like this:

    while (x 10)
    { // do a bunch of stuff using x
    x++;
    }

    The way the compiler optimizes register usage may cause the assembly to actually increment x just after doing the conditional, then hold the non-incremented value in a register for use down below. The decompiled asm might look like this:

    while (x 10)
    {
    int temp = x;
    x++; // do stuff with temp
    }

    While this may seem like a trivial difference in the C code, it can often distort the intent of the algorithm. When a C programmer sees a construct like the latter, they naturally assume that the temp variable was used because more natural constructs would not. The C programmer then wastes time mulling it over only to discover that it was just dumb.

    I am currently on a project where I am maintaining some pretty poorly written code. I can't tell you how much time I waste looking at a particularly ugly algorithm trying to figure out why they are doing all these screwy things, only to discover they were just idiots.

    My point is, that the compiler and optimizer are going to mangle the logical order of the code in such a manner that it will be far more difficult to read.

    Like I said at the beginning, simple translation of assembly to C is easy, getting back the meaning that gives the endeavor any value at all is much more difficult.