Slashdot Mirror


Reverse Engineering Large Software Projects?

stalebread queries: "Me and a team of other students have been tasked with reverse engineering a massive C/C++ (mostly C) computer game of about half a million lines. We have most of the source, but no clue of how to approach a task of this magnitude. Anyone have suggestions of programs, or techniques we could use to understand the structure of the game?"

25 of 104 comments (clear)

  1. Flowcharting might help by eric2hill · · Score: 3, Informative

    Since you're probably proficient with C++, try a flowcharting solution to give you a high-level map of all the classes. Maybe that will help.

    --
    LOAD "SIG",8,1
    LOADING...
    READY.
    RUN
  2. oh boy by QuantumG · · Score: 3, Informative

    I presume you mean reverse engineering in the program understanding sense. In which case the way to go about it is to sit down and read the source code, taking notes as you go. You should then set yourself some maintenance tasks - modifying the source code is the best way to find out if you understand it or not.

    --
    How we know is more important than what we know.
  3. Reverse Engineer or Refactor/Port? by linuxtelephony · · Score: 4, Interesting

    It sounds like you are wanting to refactor the code, or port it to another platform. If you are missing some of the code, then you'll have to reverse engineer that portion of it.

    As for how to approach it - I think it depends on the size of your team, and what goals you set for the effort. Are you just wanting to learn? Or do you want to improve performance? Or make it work on another platform? What are the goals for this project?

    Once you know those details, they might give you an idea where to begin.

    --
    . 62,400 repetitions make one truth -- Brave New World, Aldous Huxley
    1. Re:Reverse Engineer or Refactor/Port? by QuantumG · · Score: 5, Informative

      Yeah, the "most of the source code" part is a bit scary. If they really are talking about reverse engineering from executables they are in for a hell of a time. The state of the art is a project I work on now and then, Boomerang, and it isn't for the faint of heart. I've been hearing for years about people who are working on decompilation tools that are integrated into IDA Pro but I've yet to see it. The time where you can enter a binary, press a button and get back compilable, maintainable source code is still a long long way off. But that's good, friends of mine do commercial decompilation work.

      --
      How we know is more important than what we know.
  4. It could help... by itistoday · · Score: 2, Insightful

    To understand how games are made in the first place. What kind of a game is it? Is it a single player game, or multiplayer game? If it's multiplayer you'll have to watch out for code designed to keep the game logic at a fixed rate; all other code will be built on top of that. Singly player games on the other hand don't have to worry about all the intricacies of keeping the various game clients in sync.

    So it really depends on the kind of game it is. Since I'm assuming you know this, I would suggest trying to first think how you would write the game yourself, and then see if you find any similarities between your ideas for the engine structure and the games.

  5. A UML reverse-engineering tool by Burz · · Score: 2, Informative

    One like Rational Rose. It can create iconographic models of programs from source code.

    Other UML tools exist, like Argo and Umbrello, but I'm not sure if they reverse engineer.

  6. Source navigator by Mr2cents · · Score: 3, Informative

    http://sourcenav.sourceforge.net/

    I like to use it when browsing through code, you can search and browse as much as you like. It will still take an effort though.

    --
    "It's too bad that stupidity isn't painful." - Anton LaVey
  7. Profiling! by redelm · · Score: 2, Informative
    First run the code under a profiler. This will give you some idea of where it spends it's time. Running under a first-class debugger (SoftICE?) will also help because you can haul off stack-traces and see what's been called from where.

  8. Re:Legal? by redelm · · Score: 2, Interesting
    I would presume that the code came from a liquidation/auction/takeover and the human capital the produced it is no longer available. First, I would try to hire one of the original sw architects to do some consulting. Who knows? They might have some email files that could be considered "part of the software".

  9. lots of moutain dew.... by warpSpeed · · Score: 3, Funny
    Oh, yeah, and hohos! Never underestimate the power of the hoho.

  10. I believe the instructor is assigning... by Burz · · Score: 2, Informative

    ...a maintenance task, not a coding task. S/he is probably looking for a UML model, as I implied elsewhere in the thread. IBM Rational, Gentleware, Borland and some FOSS projects have software just for this sort of thing: Modeling all of the classes, structs, member variables and functions along with displayable relationships (using arrows, lines, and nesting).

    Whats more, some of these tools can be used to modify programs within the model, and then update the source code (forward-engineering). They can also create tables/databases from your persistent entity classes, represented with their own DBMS variety of UML icons...and can even update the actual database (sometimes directly, other times with DDL scripts) and track/display relationships between tables, and with the classes that use them.

    UML tools will seldom be able to reverse-engineer information about procedural code (declarations, conditionals, etc.) also this can usually be modeled by hand when such detail is necessary.

    1. Re:I believe the instructor is assigning... by QuantumG · · Score: 2, Funny

      Yep, lots of luck finding a single one of these tools that works on C code. Although making pretty pictures can certainly be a good way to get an overview of the software, and maybe students need that kind of assistance. Personally I think something like C-Scope is more than enough.

      --
      How we know is more important than what we know.
  11. Re:Legal? by jericho4.0 · · Score: 2, Informative
    Why, yes! It is legal. In fact, the right to reverse engineer a piece of software or hardware for interoperbility is protected in the US, IIRC. Hence Intel clones, PC clones, Samba, etc.

    But the article poster has access to the source code, something not usually associated with 'reverse engineering'. Products are still protected by patents, copyright and trademarks, and writing Samba (for example) after seeing Microsofts code would open one up to legal woes.

    IANAL, or USian.

    --
    "A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
  12. Use our tool :) by mr_tenor · · Score: 2, Interesting

    www.cse.unsw.edu.au/~drt

    Not that I'm biased or anything. The idea is to monitor the program while it's running and use the call graph to generate sequence diagrams and such. Feedback and ideas for further reasearch welcome :)

  13. Have most of the code? by mnmn · · Score: 2, Insightful

    It is not 'reverse engineering' if you already have the code. So you'll be reverse engineering the part that you dont have a code for, and making sense out of the code that you do have.

    Draw flow charts. Then assign a seperate person for each module to make sense out of it. Next you'll do what you plan to do....

    Make mods for it? Make a clone? Rewrite the code and sell the code? Recompile and port to Linux?

    --
    "Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
  14. There are some automatic UML generators by rgbe · · Score: 2, Interesting

    There are some automatic UML generators that will give you an overview of the code, or parts of the code:
    http://droogs.org/autodia/

  15. Re:Legal? by Macphisto · · Score: 3, Insightful

    "Human capital"? What are you, an alien overlord of some sort?

  16. Re:WTF? by kisielk · · Score: 3, Insightful

    Just because you have the code doesn't mean you know how the system is assembled and how all the components work together. "Reverse Engineering" is a pretty loosely defined, but if you take it literally, it's just that.. reversing the engineering process. From the description of the question, the poster is looking to take the finished product (the source for this game..) and move back up the high level design phase. This means analyzing the module interconnections, class hierarchy, and that sort of stuff. It doesn't necessarily mean they want to "port" or "compile" it.

  17. Cross-reference first: Doxygen is your friend by treerex · · Score: 4, Informative

    It sounds like you are unable to build the complete system and run it, since you're missing functionality. This removes the possibility of using runtime tracing tools.

    The first thing I would do is run something like Doxygen over it to generate a cross-referenced description of the structures. It won't give you a global view of things, but it will give you a decent browsable view of the code itself. Another response mentioned GNU GLOBAL which may work better for you. Yet another possibility is LXR, though it may not work as well in C++. Regardless, a nice thing about Doxygen is that, when used with GraphViz, you can get useful diagrams generated showing class containment and file inclusion graphs.

    After you have that, get out your paper and pencil, and start drawing and manually tracing things. That's how I go about coming up to speed on new code I can't execute and step through. Eventually transfer that knowledge into a text file (or, nowadays, a wiki) so that others can benefit from it.

  18. Re:Legal? by redelm · · Score: 2, Informative
    Alien overlord? I love it!

    "Human capital" is a rather common economics term to refer to those skills and knowledge that enable an employee to produce the desired works. Use the wiki, Luke. In this case, it is the experience and serenity which makes the Tao Master of programming worth several novice salaries :)

  19. Resources For the Code Janitor by sohp · · Score: 4, Informative

    I applaud your professor or thesis advisor or whoever for this real-world task. Here's a few resources which I wouldn't do without:
    Code Reading: The Open Source Perspective
    Object-Oriented Reengineering Patterns
    Reading Computer Programs: Instructor's Guide and Exercise
    Tips for Reading Code

  20. A Couple of suggestions by jschmerge · · Score: 2, Informative

    I've been through this sort of exercise several times in my career so far. 500k LOC is too much for a small team to get a handle on in any reasonable amount of time, so don't feel too helpless... You're professor is throwing you guys to the wolves and seeing what you are able to accomplish.

    As for the actual suggestions, read on:

    First, you'll need a tool to generate some form of cross reference for the entire codebase... I'd recommend Doxygen (hack the config file to generate the inheritance and call graphs). This will speed up your ability read the code; being able to look up the interface to any class with a couple of clicks in a web browser will make life a lot less painful.

    Next, find a text editor/IDE that's good at navigating large projects. This is a must. I personally do this with vi and ctags (although many people will tell you that there are better alternatives). Being able to look at more than one source file at a time is a good thing (tm).

    These are the two primary tools that you'll need. There are some other pointers that I can give too:

    • Become intimately acquainted with the project's build system. The separation of components into separate directories/libraries/modules will give you a great deal of insight into the overall program's structure. You'll be able to accomplish a lot of this by watching a complete build of the project progress. The other place to look is in the project's Makefile(s). I'd bank on the fact that most code stuck in bottom level subdirectories is code that you'll be able to treat as black boxes.
    • As you become more familiar with the codebase, you'll find that you keep coming back to certain source files to look something up. Understand that these files are the ones that are probably the most important. It may help you to keep a web browser pointed to the crossreference material for these files, or memorize their content.
    • Don't get bogged down in understanding every bit of the source. Probably 90 percent of the code in the project is used to do things that you really don't have to ever care about. A good example of this is a project I recently inherited, comprised of about 20,000 LOC. Four thousand lines of code in this project was there just to read XML config files into very simple data structures.
    • If you are having a difficult time figuring out how a piece of the code works, you may want to try running it in a debugger and stepping through the execution. I'm not a huge fan of doing this, but I know people who swear by it.
    • Import the source for the project into some form of version control system. This will afford you the luxury of being able to modify the code without fear of breaking anything too badly.
    • If you have access to the developer's source code repository, sometimes commit histories can give you a lot of insight into why things in the code are the way they are.

    Anyway, good luck!

  21. I'm assuming that you have the source as a guide by ACORN_USER · · Score: 2, Informative
    My assumption is that you're to reverse engineer the software, but have been given fragments of the source as a guide, yet still have to show your methodologies so as to prove that you didn't just re-write the source.

    I'd start buy actually reading the source - building it if you can. Run profilers on it and try to get some kind of visual representation of the underlying code tree. If you have source, try using something like DOXYGEN to autogen some documentation (and structure) out of it. Someone mentioned Rational - you can get a trial license. Try to understand what the code does. For the most part games are straight forward, in that you have objects that have specific behaviours. You can try to establish the object hierarchies and see if you can redefine these to make more sense - or just be different.

    For the fragments of source you don't have - try using tools such as truss to track flow of what is going on. GDB is your friend and you probably want to try running it through the debugger - especially if the extracts you were given were compiled without stripping the symbols. nm is also another useful one at trying to get an idea of the symbols in your binary and establishing 'from meaningful names' what on earth goes on inside.

    Push your binaries through a disassembler like ldasm or datarescue - win. NASM also has a disassembler. Try and get a feel for what is going on.

    Now comes the hardpart - it's not called reverse 'engineering' for nothing. You've done the reverse bit. It's now time to engineer a solution which shows that you've gone through the 'reverse' bit. It can be y our view on how the code should work. Don't be affraid to reuse resource files/bitmaps, etc. That's allowed. It's the code which counts. You'll probably find that the assignment gave you something which was sub-optimal, in either design or processing - or both. It's your turn to write it the way which it should have been written. I'll leave the 'team dynamic' to you. Don't let one person have all the fun. Probably you - it's good to give others a chance. See what people are intersted in and allocate the work load. Just be prepared to fix everyone's bugs the night before submission - it's not so bad - it's 'fun.'

  22. Massive? by idries · · Score: 2, Interesting

    First of all this is not a massive code base for a commercial computer game, it's about average. Many games get into the 1-2 million lines of code. Having said that most games also have teams that are probably much larger than your group of students.

    I'm not exactly sure what you're trying to do here. As many ppl have said reverse engineering something that you already have the source for is not really reverse engineering at all. However if I make the (somewhat suspect) assumption that your objective is to examine the code and extract some kind of high-level understanding of the entire engine which you can then demonstrate in some way, I would advise you to think again. Most games (again, I am assuming that you have a commercially developed code base of some kind) are a giant mess with no overall design or direction in the code.

    Generally you'll find that a few sub-systems have been implemented with some kind of clean design (although not necessarily in a coordinated manner) and then the rest of the game is just a mass of glue code that holds these pieces together. During the original implementation no-one will have had the kind of general overview that you're looking for, each member of the team will know their specific area or areas, and how that part interfaces to the next, but no-one will know how all of them work together. Trying to summarize how all the systems work together will either give you something very high-level (and essentially meaningless) or something so complex that it's almost as hard to understand as the source (and not suitable to give to your professor as 'proof of understanding').

    My advice would be to choose one or more parts of the game and try to gain an understanding (in whatever manner you choose) of those areas. One of the best ways to choose these areas is to look at the USP (unique selling points) of the game itself. Some areas of the game will have been very important to the final product, while others will have been done just because they had to. For example, if the game is an RTS with a focus on the tactical aspect of the single player experience, then the scripting and ai systems will have been very important (and made as good as possible) while the sound engine will not have been very important (and made just good enough). The parts of code which are important to the actual gameplay will have had much more time and attention spent on them and will probably be far more interesting. Having said that the most important parts of the game will also have had more ppl working on them and they may well contain much less readable code.

    Perhaps you should give us some more info on what exactly you want to do, so that we can give you more relevant advice?

  23. Reversing Std C by TheDracle · · Score: 3, Informative

    It's pretty simple, just time consuming. I've seen a few reverse engineering books floating around: "Reversing," "Exploiting Software." Since it's mostly stdC, it shouldn't be nearely as difficult to reverse engineer. Other languages can make things more complicated (Multiple calling mechanisms, more dynamic memory allocation, etc..).

    Tools:

    OllyDbg - Awesome usermode debugger, probably better suited than softice for this particular task. You can add assembly wherever you want, and it will create patches for the exe that can be automagically applied. It's also FREE.

    Numega Softice - Just in case you need to bring in the big guns.

    IDA Pro - Best reverse engineering tool available. Lots of extension scripts to do anything imaginable..

    TSearch - Can search memory at runtime, set breakpoints, disassemble code on the stack, and dynamically insert new assembly at runtime. Nice for understanding the flow of the software as it runs, and identifying interesting variables and structures.

    REC Decompiler - Awesome decompiler that produces a high level representation of the code. Not a replacement for your brain, but can save a lot of time tracing over assembly code to understand the purpose of a function.

    WinPCap & Ethereal - For reversing game protocols, and understanding client-server interaction. Sometimes it's nicer to just figure out where the host name/IP string is located in the binary and replace it with 127.0.0.1, then write a little proxy program to sit in between the client and the server.

    HVIEW: Hex editor with the ability to disassemble.

    (Use Cygwin or mingw for the following) strace: Traces signals, system calls, and spits them out to the screen.

    nm: Dump binary symbol table and names.

    I've definitely forgotten a plethora of other useful tools (especially the binutils ones), but the above consist of some of my favorites.

    For a game, you'll probably be dealing mostly with OllyDbg, HVIEW, REC, and winpcap/proxy. I'd recommend using nm to get a list of all of the symbols in the program, and then maybe split up and assign each student some number of symbols to understand and rewrite in C. Then they can use HVIEW or OllyDbg to navigate to those symbols, and try translating them. If they have a difficult time, have them use REC to get a higher level representation they can cheat off of.

    -Jason Thomas.