Tools For Understanding Code?
ewhac writes "Having just recently taken a new job, I find myself confronted with an enormous pile of existing, unfamiliar code written for a (somewhat) unfamiliar platform — and an implicit expectation that I'll grok it all Real Soon Now. Simply firing up an editor and reading through it has proven unequal to the task. I'm familiar with cscope, but it doesn't really seem to analyze program structure; it's just a very fancy 'grep' package with a rudimentary understanding of C syntax. A new-ish tool called ncc looks promising, as it appears to be based on an actual C/C++ parser, but the UI is clunky, and there doesn't appear to be any facility for integrating/communicating with an editor. What sorts of tools do you use for effectively analyzing and understanding a large code base?"
I hear that the commentator guys are finishing a new product that instead of commenting your code is able to comment other's.
I've always found that stepping through the debugger at runtime is a decent way to start making sense of a large code base. Easier, anyway, than trying to read static code printouts. Just set a breakpoint at a point of interest, fire up the application, and use it as a starting point. You get a sense for program flow and it's a great way to generate questions--lots of them. (What does class SuchAndSuch do? It looks like the application is handling remoting in such-and-such a fashion; is that right?) You can also choose one aspect of the architecture and selectively ignore or step over other aspects, building up your understanding one aspect at a time. In my case, with Visual Studio as a development environment, I can hover the mouse cursor over variable names to see their current values. In the case of variables of a certain type, like datasets or XML structures, I can use realtime visualizers to browse the contents and get a much better feel for what's going on.
If there's no one at your company that can help answer your questions and bring you up to speed, I feel for you - your employers ought to know enough to give you some extra margin. It can be very hard to take over a large code base without some human-to-human handover time.
Also, is it an object-oriented system? I assume that it's not, based on your post, but you don't say either way. If it is, the important aspects of program flow often live in the interactions between classes and objects and the business logic is decentralized. OO is great, but it can be harder to reverse-engineer business logic because it's distributed among various classes. A debugger that lets you step through running code is almost essential in this case.
For C++ code, Doxygen can be useful, as it shows the class inheritance. As requested, it uses a (rudimentary) parser. It works with several other languages too, although I can't vouch for its utility for them.
Ne mæg werig mod wyrde wiðstondan, ne se hreo hyge helpe gefremman.
I use the Mark I eyeball, grep, emacs, and of course, the little gray cells.
(and GET OFF MY LAWN).
You should really be sitting down and attempting to understand the code, ASAP. Asking Slashdot for fancy tools isn't really going to help you. The real barrier here is your own brain.
If its in a language that doxygen can understand, thats the tool I would HIGHLY recommend.
google exuberant ctags and learn how to use the resulting tags file(s) with vim or your editor of choice
I'd give my right arm to be ambidextrous.
Printouts and colored markers.
The Kruger Dunning explains most post on
Sometimes its hard to follow execution, especially in a large codebase. Its made even more difficult when a smug jackass wrote it to be as terse as possible.
Sorry I don't have an open source tool for you, but I've used Understand for C++ in the past and it was pretty helpful. To me, the most useful piece of information for understanding a large codebase is a browseable call graph. I'm sure there are simpler tools out there that generate a call graph, but this is the only one I've used with C++.
Sometimes tools like Rational Rose or Enterprise Architect are successful at reading in the code an building a UML model that you can then attempt to parse through. I'm not familiar with the use of either, but I know it can be done, with mixed results depending on the size and complexity of the code being analyzed. Both tools are fairly expensive though, I believe.
One might as well ask, why are you posting smarmy retorts when you clearly didn't understand the question? The question was about understanding the program, not the underlying language.
I don't care if it's 90,000 hectares. That lake was not my doing.
Sorry about that.
Why have 1 person driving a backhoe when you could employ 20 with shovels?
ETrace : Run-time tracing http://freshmeat.net/projects/etrace/
This book is worth a read http://www.spinellis.gr/codereading/
Draw some static graphs of functions of interest using CodeViz http://freshmeat.net/projects/codeviz/
Write lots of notes, preferably on paper with a pen rather than electronically.
Help children born unable to swallow - www.tofs.org.uk
Yes. Understanding code is one of thing things you hire tools for.
...
Wait, were you talking about software?
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Creating small demo apps that use the code can also help.
mhack
Building a better ribosome since 1997
I use Doxygen for C code, and it is really helpful. One of its most useful features is that it generates caller and callee graphs for all functions. You can also browse the code itself in the generated HTML pages, and the function calls are turned into links to the implementation. Data structures and file includes are also pictorially graphed for easy browsing.
If the system you need to understand has a really big undocumented architecture, then this presentation might be useful to you (there is a research paper, but it's not free yet). In it, the authors present a systematic method of extracting the underlying architecture of the Linux kernel.
GNU Global is able to generate a set of HTML pages from C/C++ source code. This tool has helped me several times. All member variables, functions, classes and class instances are hyperlinks. It provides an easy way to examine source code. It also provides tags for several text editors (for Vim and Emacs especially). http://www.gnu.org/software/global/
Seriously folks, having spent large chunks of my working life having to decipher the mess of those who came before me I cannot stress enough the importance of clear comments, variable/function names, and consistent and readable syntax. AND WRITE F@#$%ing HUMAN READABLE DOCUMENTS DESCRIBING FUNCTIONAL REQUIREMENTS, ALGORITHMS USED, LESSONS LEARNED, ETC.
Calling all your variables "pook" or the like may be very cute, but does not help me figure out what the heck the function is supposed to do or why I would ever want to call it. Yes it's a pain. Yes we're all under time deadlines and want to get it working first and go back and document it later. And yes, it WILL bite you in the ass (ever heard of karma? your own memory can go and then you have to decipher your OWN code!).
That said, if you have inherited a code base from someone who ignored the above, go through and generate the documentation yourself. Write flow charts and software diagrams showing what gets called where and why. Derive the equations and algorithms used in each piece and figure out why the constant values are what they are. Finally, start at the main function or reset vector (I do a lot of microcontroller development) and trace the execution path.
Get the guys who use it to explain what they're trying to do, read the code for a couple of days and then have them show you how they use the application. Then plan on six months to a year to get to the point where you can look at buggy output and know immediately where the failure is occurring. In the mean time just work in it as much as you can and don't try to redesign major parts of it until you know what it's doing.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
It is unlikely that your job is really to 'grok it all'. Most likely there are specific issues that need to be solved - stop panicking and pick the simplest one on the list and start working on it.
In a similar position to you, I followed Brook's advice to study on the data structures and found it good. Also just running the application under a debugger, inserting breaks in important looking code and then having a look at the call stack when that code was used also proved enlightening. A good debugger also lets you explore the data structures.
When smart-asses tell you "Bill would have fixed that in ten minutes." I recommend replying "I never met Bill, why do you think he left?"
Namgge
Emacs and etags are your friend. Meta-. zips to the function under the cursor. C-s for incremental search. Meta-x grep-find for any other search.
Also, run the program with a debugger and step through it. Or put some print statements in key places and see what it produces.
I find that's all I ever need.
There. Now go play some cool javascript games!
I'm afraid you've set yourself an almost impossible task. IME, there are no shortcuts here, and it it's going to take anywhere from a few months to a couple of years for a new developer to really get their head around a large, unfamiliar code base.
That said, I recommend against just diving in to some random bit of code. You'll probably never need most of it. Heck, I've never read the majority of the code of the project I work on, and that's after several years, with approx 1M lines to consider.
You need to get the big picture instead. Identify the entry point(s), and look for the major functions they call, and so on down until you start to get a feel for how the work is broken down. Look for the major data structures and code operating on them as well, because if you can establish the important data flows in the program you'll be well on your way. Hopefully the design is fairly modular, and if you're in OO world or you're working in a language with packages, looking at how the modules fit together can help a lot too. Any good IDE will have some basic tools to plot things like call graphs and inheritance/containment diagrams, if not there are tools like Doxygen that can do some of it independently.
If you're working on a large code base without a decent overall design that you can grok within a few days, then I'm afraid you're doomed and no amount of tools or documentation or reading files full of code will help you. Projects in that state invariably die, usually slowly and painfully, IME.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
The best programmers I've ever worked with didn't have degrees. But some of the worst ones did.
See:
:)
http://www.stack.nl/~dimitri/doxygen/
and:
http://uml.sourceforge.net/index.php
These tools allow you to 'visualize' a codebase in several very helpful ways.
One important way is to generate connection graphs of all functions.
These images can look like a mess, or a huge rail yard with hundreds of connections.
The modules, libraries, or source files that are a real jumble of crossconnected lines are a clear indication of where to start clean up activities.
Good luck!
That should be good for a laugh or three.
They'll be out of date, full of inconsistencies and incomplete.
Then you'll be reading the code only to discover that people's idiosyncrasies and personalities definitely affects their coding styles. (There's even some gender bias where women tend to set a lot of flags [sometimes quite needlessly] and decided what to do later in the execution while men code as if they knew where they were going all the time, just that when they get there, they're missing some piece of information or other.)
If you read code developed by a whole team of people, you'll get to know them, intimately.
Good luck. You'll be at the bar in no time... I kept the stool warm for you.
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
I'm appalled by some of the comments that imply that the poster may not be fit for the job.
A few years back I had to maintain a large module written in C#. I had about 200K lines of code, 50 classes, zero documentation, zero comments, zero error logging support, and I was expected to find and fix bugs and add functionality the day after the module was handled over.
So if you were never in this position, just STFU. Yeah, the code is there, but is this flag for? Is this part really used, or is obsolete? What are the side-effects of using that method? And so on...
Eventually, I learned it, especially after some intensive debugging sessions, but it was frustrating to say the least. I would have loved to have some aiding tools.
Why? I can write crap and you can clean it up. This is Division of Labor, which is the basis of our civilization.
Apart from Understand for C++, I'd also suggest SourceMonitor - http://www.campwoodsw.com/sm20.html It will at least quickly point you to potentially problematic parts (long functions, deep nesting, etc.).
I would suggest a slight variation on the theme. Fire up the application, start it on one of its typical tasks, and then interrupt it in the debugger to catch it. While the process is stopped mid-flight, take note of the call stack to see which classes and methods are being used. Maybe step through a few calls, then let the program run some more.
By doing this repeatedly, you will quickly get a sense for which parts of the code see the most action, and would provide the most obvious places to start studying the code base, and provide the best bang-for-buck return on your time.
Hey, Windows users, there is no such thing as "forward" slash, there is only slash and backslash.
I used to work at a company with a lot of Pascal and C code... It was extremely common (as in, all but a few) for programs to be written entirely in one code file. These files would go on for 20,000 lines or more. So many lines in fact that after the compiler had imported the header files at the top of the file that they would be over 65,000 lines long and the debugger would crap out because it had exceeded the int that it used for line number counting.
Sadly this isn't a joke.
Pete/Petri "damn, my chainsaw is clogged with 1's and 0's again." --clyde
Run these commands (or put them in a script):
ctags *
gtags
htags -Fan
It will create a ~\HTML folder with all the function/variables cross-referenced. Open the file index.html or mains.html in your browser. If your not running Linux, I think these utilities are included in cygwin http://www.cygwin.com/
Enjoy,
It's just the normal noises in here.
I'll plug my own open-source project for this:
Browse-by-Query-- it won't help with C/C++(sorry for the original questioner), but it will handle Java or C#.
It dumps the code into a database and lets you query it to find the relationships.
I'm biased, of course, but I've found it's just the thing to understand how a particular piece of functionality in an unfamiliar code base fits into the big picture.
That doesn't always work for a code base with millions of lines of atrociously written code. I've worked with code where it is absolutely not feasible to step through everything.
It seems like in those cases I end up working from effects... I note some program behavior and then try to find exactly what causes that behavior, which can be surprisingly difficult if you are dealing with the "right" kind of code. After a while, though, the patterns begin to emerge in the system as a whole.
In fact, it nicely highlights the difference between "software engineers" and "code monkeys". Code monkeys just dive in; they never pause to think. In fact ... they tend to avoid thinking. It's not their strong point. After all ... they're paid to code, right? Not to think. Software engineers on the other hand, look before they leap and spot the places where they need to pay attention first. And they're systematic about it.
In fact, a software engineer will happily spend a day or two putting the right tools in place, *including* a full backup and a proper version management system for when he's going to have to touch anything.
The first thing you want to know about a new code base (after you find out what it's supposed to be doing) is its structure. Tools like Doxygen (see previous posts) show you that structure *far* quicker and *far* more reliably than any amount of dumb code-browsing can. And besides ... once you do it, you've got that documentation stashed away securely instead of milling around incoherently in your head (you'll have completely forgotten most of what you read by next month) or on disorganised pieces of note paper.
The second thing is to figure out if it calls any "large" functionalities like subroutine libraries or even stand-alone programs like databases, let alone if it makes operating system calls. The call-tree will give you an excellent view, and the linker files can complete the picture. You wouldn't be the first maintenance programmer who found out after months that his application critically depends on some other application he wasn't told about.
The third thing is to see where your code does dirty things. Let the compiler help you. Just compile your application with warnings on and have a look at what the compiler comes up with. You might be surprised (and horrified). Then compile with the settings used by your predecessor and check that your executable is bit-for-bit identical to what's running (you wouldn't be the first sucker who's given a slightly-off code base).
If performance is at all important, then running the whole thing for a night on a standard case under a good profiler will also tells you lots of important things. Starting with where your code spends its time, where it allocated memory and how much, and where the heavily-used bits of code are. All neatly written down in the profiler logs.
Finally, run your application with a tool to detect memory management errors the first chance you get. Useful tools are Valgrind (in a Linux environment), Purify (expensive, but probably worth it) under Windows, and sundry proprietary utilities under Unix. Just about 90% of the errors made in C programs come from memory management problems, and half of them don't show up except through memory leakage and overwritten variables (or stacks .. or buffers .. or whatever). You'll need all the help you can get here, and as far as these errors are concerned, dumb code browsing is useless. Just keep your head when looking at reports from such tools ... they can throw up false positives. Ask around on a forum with specific questions if you're allowed, or ask your supervisor. After all ... you showed due dilligence.
When you know all that (if you have the tools in place, all of this can be done within 1 day + 1 overnight run + 1 hour reading the profiler output), go ahead and trace through the code in a debugger. You'll be in a *far* better position to judge what you should be reading.
...if you don't understand the language?Yes, it's hard to understand questions when you don't understand the language.
I'm sure you can find some remedial English classes if you look.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
If your project is object oriented, you may be able to get your UML modeling tool to import the code and visualize the classes. When you do this, you'll probably get a HUGE diagram that seems just as unwieldy as looking at the code. The trick is to apply a filter to the model, so you're not overwhelmed with detail. Your UML tool should be able to do that for you.
I recommend focusing on all interface classes first. This can give you a remarkably sane picture of a system, and will help you divide up the code into more conceptually meaningful chunks.
The tool I use is Enterprise Architect, which does quite a lot of heavy lifting yet is still inexpensive enough for me to own a personal copy.
"We receive as friendly that which agrees with, we resist with dislike that which opposes us" - Faraday
Error 'Format Conversion Error, converting from Y2K to Z2L' added to module x1
Error 'Out of Memory Banks' added to module x2
Error 'Object Expected; found adjective instead' added to module x3
Error 'bitbucket 95% full; please empty' added to module x4
Added 1,000,042 to some random value in module x5
Added 5,555,555 to some random value in module x6
Not only will you learn about the code, you'll make a great impression on your boss, when, within minutes, you are able to resolve some mysterious problem that has never happened before.
The best tool is your brain, applied liberally. Here's some thoughts to put in it
Feathers, Michael. Working Effectively with Legacy Code, Chapter 16 especially.
Spinellis, Diomidis. Code Reading: The Open Source Perspective, Chapter 10 lists some tools for you.
My own thoughts now. First, don't trust the comments, they are probably outdated. Second, if it's a big code base, forget the debugger. Write some little unit test cases that exercise the sections of code you need to understand, and assert what you think the code is supposed to do.
Finally, unless you are cursed with a codebase which is not kept in version control (in which case, ugh, time to start the jobhunt up again maybe), then take a look at the revision history. See what changes have been made to the area you are working on. With luck, someone will have put in a revision message that points you towards greater understanding of why a change was made, which will in turn nudge you towards knowing the purpose of the section of code that was change.
An *excellent* stragegy and thorough explanation. Especially the bit about stopping to think and devise a plan rather than just diving in headfirst. All spot on!
The only thing I could possibly add is to say "gather resources to understand the *purpose* of the system", either through documentation or by speaking with project management and/or end users. If you can learn the business rules and processes, that will be an enormous help in understanding the code's design.
There are two kinds of hard problems in programming: problems that are hard because they require ingenuity and deep thought, and problems that are hard because they require weeks of unraveling someone else's garbage.
There are some horrible programmers out there and I have on many occasions been tasked with cleaning up their messes. In your situation I would suggest either a) try to figure out if it would take less time for you to implement it in a clean and maintainable way or b) find someone else you can hire who knows the code base or at least is more familiar with the specific problem.
If you can't do a or b then you're screwed. In that situation, personally, I would either quit, ask for a different project, or print out the whole source code and sit back with a pen and start studying and commenting - one of the few tasks for which I still prefer dead trees.
I've since added both cscope and freescope, as well as the old Red Hat Source Navigator for good measure.
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
Tests are indeed very good to understand a code base- Nearly all the last year I was working on a code base that nobody understood completely, although I had someone to ask about the general code structure. Writing tests helped me to understand what some parts of the code actually do. And where I needed to change things I could make myself sure that I didn't break anything.
Another great tool is valgrind+KCachegrind - it gives you really nice call trees. Vtune can do something similar as well, but IMHO the output is not as good as in KCachegrind. The only problem, of course, is that valgrind makes your program very slow and, it is, AFAIK, not available on MS Windows.Vtune, OTOH, runs the program at normal speed, but it's calltree output is ugly, at least on Linux.
If these two options are not for you than you might add a trace output to each function. IMO this is better than using a debugger - especially in C++ with BOOST and STL, where a lot of stepping goes through inline functions.With proper logging levels you can get a very useful output to see what's going on. It helps to understand the code, and it also helps, if you hit a bug.
It's more like a carpenter asking for a nail gun because it's quicker, less tiring, with less change of damaging themselves. Any carpenter with any sense would ask for one, just like any coder with any sense would ask for these tools.
It's inexpensive, and scales astonishingly. I've spent the last two years in it, and it's just how I audit code nowadays.
You can browse your code, following dependences and definitions. You can also construct queries, do isolate what statements can affect a particular variable, and a bunch of other tricks based on static analysis. There's a programming interface too.
Other good ways to get your head around code (speaking as a software engineer, rather than a guy promoting his company):
Well ... some good points, and some I'd say are too detailed at this point.
I totally agree with point (1). I forgot to mention it since I assumed (always a bad thing) that the author actually could compile and run the thing. An important point to keep in mind. Thanks for bringing it up.
Points (2)-(5) however all come after you've understood the basic structure of your code base.
Next, I'd say that a fairly junior software engineer trying to tackle a large unknown code-base without proper tools is doomed to failure no matter what. So the step from "If you're in a rush" and "You are in a paid job and expected to deliver predictable results." to "forget about tools you're not familiar with and just dive in" is an exercise in self-delusion and a recipe for disaster. Nothing less. It's like someone rushing out of the house and sprinting for work because they don't know where they put the car keys or their bus ticket and feel they are too much in a rush to search for them.
Besides, producing automated documentation is a good way to communicate. The tool communicates the structure of the code-base to you, and you can use e,g, the call-graphs to (efficiently) communicate the complexity (or otherwise) of the code-base to your supervisor. It also communicates to him how you are approaching the problem, which is likely to be a plus.
Now suppose the codebase is really difficult. A competent software engineer is, like any other kind of engineer, co-responsible for making actual and potential trouble spots *visible* to management. Preferably before they explode. Although it's popular wisdom to despise Management, if you, the hands-on person, don't tell Management of the problems, you ensure that they're driving blind. You rob them of the chance do do anything about it before the problem becomes so acute that even they'll have to notice. They will recognise it if you do and keep it in mind when they have to assess you. Depend on it. Besides, you just happen to be the only one who can tell them, and you fail in your responsability if you don't. Part of a software engineer's job is to *communicate*. Now you can't give your supervisor any honest estimate of how well you have the new code base under control before you get to know it. And tools really really help you save time and allow you get a much better overview.
Communication works both ways. If, with all the tools you use, you are unable to understand the code-base, you lack one or perhaps two elements that distinguishes a basic software engineer from a good or even a great one. Talent and experience. And you should be honest with yourself and your supervisor about that too. If the job really is too hard for you, have the guts to own up before you mess up and thereby save yourself and your company a lot of trouble. And believe me ... there are lots and lots of good jobs in software development / maintenance that can be done without a surfeit of either. Such is the power of engineering.
Now Doxygen (or similar tools) may be unfamiliar to the author, but such tools really work. Besides, I've seen students download, understand, and use Doxygen in less than 1 hour after they were told about it.
I had a very wise undergrad EE prof who said on the first day of design class that we needn't worry about the many "complicated" things that we would have to design during the course because we had already completed all of our circuit analysis courses. He said it's much harder to figure out the details of someone's design than to design it yourself. Same applies here in software. I've been there working with other's undocumented code and quite frankly it was infrequently that I left the project with more respect for the programmer. Here I'll just say what I learned from the experiences as useless as it might be.
If the coding style used is appropriate you stand some chance. Lines of code don't matter much when behavior is sufficiently complex that you cannot list the states and events that trigger execution and state change let alone keep track of them in your head long enough to understand their context.
I once had a similar problem with some legacy OS9 c code that performed a simple communication task and updated a monitor. With no documentation from the writers I was to "simply add some new data to be collected and display it." The problem with this 3000 loc was that it was written as a state machine with no modularization - next to impossible to follow in a debugger. What I wanted to do is run a performance analyzer along with the code but I was told that was "out of budget". This would have told me at least the parts of code being executed frequently and I could start to associate the external events with the code processing.
On very large applications like AT&T's RNS (residential account management for BellSouth) that exceed million-lines-of-c-code the only thing that made the application workable for new features was the fact that it was created in a CMM III product environment thus it was well documented in design, development, testing, feature changes, bug fixes, etc. Even with all of this the number of processes and related data stores still showed a lot of bleed over and function duplication (there was no simple way to determine if a function was in existence that already did what you needed and even harder to determine if it was state data dependent and thus unusable in certain other states. Attempts by us (contract coders mainly) to get the company to allow us build a function-finding-tool/database to eliminate this problem fell on mostly deaf ears.
Because of this we had to depend on the longer-lived of the system architects to get an idea of where functionality existed. There were many times though when no one knew and weeks had to be spent reverse engineering communication structures, what the heck undocumented stretches of code did, re-write the documentation correctly and then start to implement the feature or correct the problem that had "been there for years." Management did not like the time taken to repair poor coding as this was not included as one our trackable metrics and therefore not in our feature/bug's budget (since it was not considered to be either).
RNS sounds bad but it was a breeze compared to that tightly optimized state machine code without documentation. So, my recommendations are:
1) If it is stream-of-thought-code (kind of like Faulkner's The Sound and the Fury), not modularized, not documented Tell your manager that it most likely will have to be re-designed to understand it fully. That means do an essential model of it's processing and data stores, use-cases, objects and events or whatever rigorous methodology you prefer. Then use that to re-write it. If management doesn't want to do that then you do not work for a company interested in maintainable code but wants a cheap fix. I would leave as soon as you get from them what they took from you in suckering you into the place.
2) If it is structured and/or developed in a "self-documenting-language" like Ada, Modula, Eiffel, etc. that forces structure (or at least makes it easier to write structured rather than unstructured), finish documenting it properly a
Be as you would have the world become.