Learning and Maintaining a Large Inherited Codebase?
An anonymous reader writes "A couple of times in my career, I've inherited a fairly large (30-40 thousand lines) collection of code. The original authors knew it because they wrote it; I didn't, and I don't. I spend a huge amount of time finding the right place to make a change, far more than I do changing anything. How would you learn such a big hunk of code? And how discouraged should I be that I can't seem to 'get' this code as well as the original developers?"
If you don't have access to the original developers and they didn't document it you're going to just have to spend a lot of time reading the code. =\
"Ubuntu" -- an African word, meaning "Slackware is too hard for me". - stolen from Dan C alt.os.linux.slackware
Try to single-step it in debugger from the beginning up to main loop.
Coding etudes
So you have been handed the steamin' pile o' code, it is great that you are very cautious and deliberate when modifying it. Make a set of regression tests, that is, make a set of test data and procedures and expected results to ensure original functionality that is still desirable is still working and no other errors introduced. It is hard, much more tedious than just creating new code with few constraints.
Doxygen is your friend. run it over the source code and keep the HTML handy for searches and cross references.
If it's Perl or VB, you might want to consider self-immolation as a first step.
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
First of all, 30-40,000 lines of code is not lots of code. Try, 250,000 of code.
To start, use a good programming editor/environment (Xcode, Vslick, Visual Studio, etc.) that gives you the ability to easily go to definition or references to variables, functions, structs and such. Run some sort of profiler or flowchart type program on it to get a high level view of the code and how it fits together. If you can get the person(s) who worked on it before you to give you an idea of it fits together.
Fight Spammers!
(And then shoot him.)
I find that if the other programmer wrote it in such a way where it's too complex for me to follow, I'm not the one who's a moron.
Anything ranging from just sketching out some informal package diagrams on some paper (I quite like using an A3 sketchpad) to something more like Code City which can work with code in smalltalk, java, and c++. There are UML diagram makers, of course, but automated diagrams like that probably need to be edited.
In fact, it is not the finished diagram that helps so much as the drawing of it, which is why paper and pencil is so good. Or a vector graphics package.
I'll echo some earlier comments.
Set up an execution environment with debugger, and run several typical scenarios and trace them with debugger. Get the feel of the big-picture execution scenarios/paths.
It will take time for your brains to get comfortable with it, though. And the details, when you look into them, will throw odd stuff at you. But that's the nature of our work.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
For culinary folks...
The time and money you spend tracing and inserting noodles in the spaghetti will end up being larger than the time it takes to cook a new batch (no pun intended).
For auto folks...
The time and money you spend bondo-ing, welding, rewiring, duct-taping, and C'n'Cing parts for the car will end up being larger than the time it takes to design and build a new car. (Although restoring an old/vintage car for the sake of nostalgia is a much more pleasing experience than buying a new one).
Gain an understanding of the purpose of each pivotal region. Know what your desired result should be, then begin the rewriting endeavor.
'We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.' RPF
PL/SQL or cobol or whatever they throw at me I poke, prod, and play with it in a test environment. Someone up above mentioned pencil and paper to draw out how everything relates and that is a very good practice I've found to just get to know things. It's not instant but it helps more then you initially think. Also I use Open Office Draw to map out things as well. :P
~~ Behold the flying cow with a rail gun! ~~
I inherited a code base of 1.5 million lines of code at the last job I was at. Thankfully I wasn't the only one responsible for it. My advice to the original poster is to add lots of logging information. Log statements should document what the code is doing at any point in time and tell you where it is doing it. If it's java you can get the stack trace from anywhere--this is very handy for logging.
Ha, ha! Just 4 months ago I joined a project with a code base of about 500k lines. I would call that (the 500k lines one) intermediate in size. There are code bases with many millions of lines. I now feel pretty comfortable finding things in it. And I mostly use find and grep.
When 1person suffers from a delusion,it is called insanity.When many people suffer from a delusion,it is called religion
Very well, sir. Here's your 40,000 lines of Perl from the late 90s. It's mostly regex to parse revisions 30 through 451 of our in-house provisioning system. Oh, and BTW don't screw up like the last guy who had this job. He provisioned 32767 customers with tier-1 service, and it was the director's job to explain why we either had to let them have it for the remainder of the year, or else deal with the CR issues.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
And then you re-implement it in the latest language.
Deleted
I have inherited projects and do my best to convince management that a pause is needed to document the code. Personally I try to flowchart the functionality and cover a couple of office walls with Visio printouts. Later on I can use such work to add detail and further documentation.
I inherited some code where the developer used names of girlfriends in variable names, it was just dumb and completely unprofessional. I didn't worry so much about keeping track of those, I was more worried about a change in one spot having unintended (and perhaps unknown until too late) consequences. Rather than spend time fixing problems, I thought it best to do some up-front documenting to at least provide a path to successful maintenance.
When I left the project, the manager had a binder of documentation and almost cried.
I had an English professor who always said, "Structure is the key to understanding." He was talking about literature, but I think the same is true for programs as well.
Try to understand the structure of the program. What is the basic flow? It should have an initialization routine, a main loop, and a shutdown routine. Find out roughly where they are, then focus on the main loop. Usually there will be one piece of code that is central, and it will occasionally pass control into other large pieces of the program. Sometimes there will be more than one main loop, and control switches back and forth between the various main loops. If the program is event drive, this will make a difference in the structure.
If you are just trying to make a small change, try to find the sequence of events that will lead up to where that change needs to be made. Follow the sequence of execution until you get to the line you need to change. If you are changing a single variable, sometimes it's helpful to do a search and find all the places that variable is used, to make sure your change won't have any side effects. This may seem time consuming, but it can save 10 times more in debugging.
Learn to follow code execution with your eyes, without running a debugger. One thing that separates good coders from not so good coders is the ability to follow code that isn't being executed.
Qxe4
Oh yeah, well I just inherited a code base of 2.8 trillion lines of assembly code, and I have to read it over a 12.734 baud VAX connection! Why, back in my day...
Anyway... I've taken on a few large-scale software projects before, and my approach has always been "read twice, hack once". I agree with the the parent, and I'll add a note: for the love of everything sacred and unholy, use revision control, and don't trust it -- that is, back up incessantly. Document the hell out of your process. Once you've really learned the system, you might want to back out some of the newbie mistakes that you're making right now.
And yes. Learning a big system takes a lot of time -- you should be reading much more than writing until you've learned it. I find it helpful to diagram dependencies / draw up finite state machines.
yeah, the clown always creeped me out as well.
Mod me down, my New Earth Global Warmingist friends!
Am I off base here? What do you think about intermediate variables that are not strictly necessary?
I can't say you're off base per se (I don't have nearly enough production dev experience to make statements like that, and even if I did, I couldn't speak for everyone), but my personal style is not quite the complete opposite of yours.
I pretty heavily use intermediate variables. Why? A couple big reasons. One, if you give the temporary variables decent names, they serve as additional documentation. Two, if you're debugging, you can look at those intermediate values in a debugger (or log them) much easier than you could if they weren't explicitly stored somewhere. In most graphical debuggers you can just hover the mouse over a variable and see its value; if you didn't have that variable, you'd have to enter the expression in the immediate window or set up a watch or something like that.
As someone who recently passed off a pile of code of about that size in poorly written and poorly documented php to someone.. All I can say is I'm very very sorry, and I had *no idea* my personal side project would work better than the original commercial offering and be declared 'mission critical' three months before I left for greener pastures..
Identify each major portion of functionality. If you are working with a sales/billing system you would probably end up with : Orders, Invoices, Payments, Admin.
Go through each of those portions and identify the major portions. Orders: Order headers, Order details, business logic, ui logic, reports, datalayer, etc. Repeat until reduced into easily consumable units.
Pick and stick to an SDLC. Use whatever fits the situation and the resources. For a small project (under 100k lines of code) you should be good by yourself. Anything more and you'll have to involve at least 1 other person for testing. For medium (100k-500k lines) you'll need an additional dev...For large projects (500K-5M lines) you'll need a project manager, lead dev, 2 devs, 1 test, and a UAT team...For larger projects you'll have something unique and frightening to the specifics of the software project and corporation/agency making it...anyway, I digress...
Go through each subdivision line-by-line and re-write it yourself (even if you aren't going to put your re-written version into production); the only way you're going to truly understand what is going on is if you do it yourself. Use whatever language you are most comfortable with or is most appropriate to the task (or languages), it does not need to be the same as the original.
Verify that for a given input, your version produces an exact output.
Take a deep breath. It's not a race. It's a one-to-one functional mapping of your software (your mindspace) and the original software (the other developer(s) mindspace(s)). The code probably will not be straight forward. It has also been battle-scarred and will be warty. Changes of initial requirements through time and feature enhancements (feature creep) will have taken it's toll on what may have originally been something simple or even elegant. It's something of a niche mindset and if it is not for you, there exist many other exciting things to be programming.
Ultimately, if you do as outlined above, you'll solve many problems, be able to make whatever changes you like, and in so doing have a way to present your design as a replacement if you want...Or not, if you don't; for 30-40k lines parallel development makes sense, in a way, for one person.
Just out of curiosity, what is your opinion of a "Large" codebase then?
My first programming job was on an enterprise system that was over 7 million lines of just C++ code by the time I left, not including SQL stored procedures, web server code for the reporting system, and surely other code stuff that I can't recall. The entire development team for the system was something like 45 programmers. So to many of us, 30-40 klocs does not seem like a large codebase at all.
That said, I've also inherited code in the 10-50 kloc area of magnitude that was far more of a challenge/nightmare to decipher and maintain than that 7 million line system was. Code maintainability has more to do with good system architecture and coding standards than it has to do with the size of the code base; without those you system will likely collapse under its own bloat long before it can grow to millions of lines.
Momentarily, the need for the construction of new light will no longer exist.
I currently maintain several million lines of perl. It's not hard, it mostly just works, and when it doesn't, it's not that hard to figure out where it's broken IFF there is a consistent repro case for the problem.
If you have a proper development/production divide, there shouldn't be any weird production issues unless you or your predecessor missed some test cases. If you don't have test cases, that's a problem, if you don't have a properly firewalled and complete development environment, that's a problem, the code itself? Shouldn't be a problem.
Realities just a bunch of bits.
Medium size is 250 to 750 thousand lines of code (one person can still understand how it all works). Big is 1 to 10 million lines of code. Really big is >10 million.
I have worked on code bases of all of those sizes, and I like the medium size the best -- it's big enough to be interesting, and small enough that you can understand it all.
One that I've worked on (over 25 million lines) is just too big for my tastes -- over 3 hours to do a clean recompile is excessive.
Ian Ameline
Not without variables, but without unnecessary ones. For example, someone might write:
int a;
int b;
int c;
int d;
int e;
int f;
int g;
a = dropBox1.Value;
b = dropBox2.Value;
c = dropBox3.Value;
d = dropBox4.Value;
e = a + b;
f = c + d;
g = e * f;
result.Value = g;
While I would write:
result.Value = ( dropBox1.Value + dropBox2.Value ) * ( dropBox3.Value + dropBox4.Value );
If both the original developers and the knowledge they had have been lost, then it is probably already too late to perform any major maintenance on this code base. The project has already entered its “servicing” stage.
At that point, you basically have two possible approaches that actually work: you can restrict maintenance to small-scale changes, which may be sufficient if the goal is just to keep the project ticking over for a while, or you can accept The Big Rewrite (which isn’t so big in this case) in order to get a project that can be properly maintained.
If you want to go down the tactical changes path, there are a couple of approaches to finding your way around the code.
If you’re familiar with the general field of the software, just not this particular code, then you can work top-down. Start with the key, high-level concepts you know the program implements, and try to find the code that represents those:
Hopefully, if the code has a reasonable modular design and you just don’t know what it is yet, this sort of approach will identify the organisation of the code at a very coarse level, but then you can try to break down each area in more detail the same way.
Alternatively, you can work bottom-up. Find a significant starting point, such as:
Examine the code near that point. Look at what kinds of data it works with. Look at what functions it calls, and what functions call it. Try to figure out the wider significance of the code you started with, and the other code to which it relates. Then move up a level: what is the purpose of all of that code collectively? Repeat until you’ve explored as far as you need to.
After some other discussions about these topics, I recently wrote up a couple of articles with some more background information than I’ve given here — link in my sig if anyone’s interested (though be warned that they are pretty long).
"A couple of times in my career, I've inherited a fairly large (30-40 thousand lines) collection of code. The original authors knew it because they wrote it; I didn't, and I don't."
A couple of times in your career? You must be lucky. Most jobs you can get coding will always involve taking over someone else's code.
In my experience, design patterns are your best friend, bearing in mind that most of the code base will always remain a black box to you.
For example, when I was doing some health insurance work, I had inherited a code base that was substantially larger than 30 or 40 thousand lines of code. The objective was to make the code that used an older, fixed-length record format work with the newer X837 EDI format, which is basically XML but almost without any tags to help you figure out where the data begins and ends. Suffice it to say that the task was to figure out how to smoothly stick a square peg in a round hole.
The task itself determined the design patterns, of which an adapter pattern was the most used. The type of pattern in turn dictated what in the code to look for in order to implement it, and (of course) how the new code would be built. For example, since we were using an adapter pattern, the first order of business was to find out how the data was represented in the code base, and then trick the "black box" into using your own spiffy, new representation of the data.
For the most part I didn't have to care all that much how the application handled the data as long as I got the right data into a form the application would accept in my adapater.
Perl is like the matrix. At a certain point, after you've stared at it long enough, it all just makes sense.
"I assumed blithely that there were no elves out there in the darkness"
As someone who has done probably 90% of his work in maintenance programming, let me give you my tips:
BTW, the fact that you have a hard time understanding this code may be more a reflection on the original authors' coding skills than on your abilities; any idiot can write code that "just works"; it takes a lot of thought, time and effort to write code that is maintainable, and more often than not, the original coders were short on at least one of those (if not all three). Here's hoping you have the time to follow my above tips; they take a lot of time, but can be worth it if you really need to maintain the code. It's funny to note that apart from the first one, most of those tips apply equally well to developing software from scratch. If the code already has a change tracking system, unit tests, a build/run/test system, *and* automated testing, consider yourself lucky and just start picking apart the unit tests.
Nathan's blog
What do you think about intermediate variables that are not strictly necessary?
Use them if they make things clearer for someone reading the code, otherwise don't. For example, you can write:
screen.displayName = user.firstName + user.lastName;
or you can write
String fullName = user.FirstName + user.lastName;
screen.displayName = fullName;
Thus making it more clear to someone reading that you are trying to use the full name. That is probably not the best example because anyone would probably understand that user.firstName + user.lastName is the full name, but I think you can see the main point, that sometimes it can be easier to read a few meaningfully named intermediate variables than a long equation. If it isn't easier to read, don't do it. But really when I read code, or even write it, I am willing to conform to either way of doing it if someone else feels strongly about it, because that is far less important than things like flexibility of major structures in the code.
Qxe4
When I joined a group that had a 2 Million SLOC program, I learned the most by fixing defects. It gave me a good reason to go traipsing through the codebase. It's painful, but it gives you purpose while reading the code. Just plain reading it gets boring.
I call that a module. Large is anything over 1,000,000 LOC. Step up.
he ports of GNU utilities to Windows [sourceforge.net] or Cygwin [cygwin.com] or even your own company's Interix [wikipedia.org] and Services for UNIX [wikipedia.org] products?
I had Win7 and Vista Ent with Services for Unix I downloaded, and it just did not feel right or work right. The command line utilities work, in part, because the whole OS in Unix is basically a tree of text files. windows isn't, and so, the utilities tend to be less effective. Plus, some gotchas like how Windows handles open files with applications, its all different.
I thought interix would be the ultimate, but it instead it taught me the opposite. If you want unix, use unix. It's that simple.
This is my sig.
Really. A guy asks a question for help and all of these people keep telling him 30-40,000 lines of code isn't much.
That's a lot of code to get your arms around if you didn't write it. It's not the end of the world, but it is a sizeable task, and is the type of topic that few professional journals or books will ever be written about.
Having been in similar situations, I my advice would be:
1) Try to get an understanding of the history of the code. Who wrote it? Why? How many developers? How long has it been around? Do people love it or hate it? Is there a version control system in place you can use for information?
2) Look at it from a technical viewpoint. Is is complete? Does it compile and run? How many languages are used? Are there interfaces with other systems you need to know about? What dependancies are there? How easy is it to setup a test server? What parts are well coded? What parts stink up the joint?
3) Dig for functional documentation. What does it do? For whom does it do it? What business needs does it support? How mission critical is it?
4) Meet with the business owners. Seriously. This helps you do two things: #1-- Define the real business need (which may be different than what was understood by the previous developers), and #2-- Set appropriate expectations about maintenance. You'll work hard to maintain and keep it working, but you are working from a disadvantaged position. It is important they know this and support you in your efforts, rather than complain loudly when something doesn't work.
5) Plan to remove the dead weight. There's always a lot of dead weight in these near-abandonded projects. Get an idea how to simplify things and plan your work in phases.
6) Setup real test and development servers. Yeah, you know that wasn't already done.
7) Use version control. But you know this. It's 2010, and no developer worth his/her salt would code a paying project without version control. Right?
8) All fixes will take much longer than if you wrote the code, so be careful with estimating time.
well that depends on how many developers we are talking about. The original question seems to indicate that the author has inherited the codebase. The need for this question wouldn't exist if the person were on some large team.
For one or two or five people, 40K lines is a sizable codebase, especially if it has been poorly maintained / designed.
blah blah blah
What do you think about intermediate variables that are not strictly necessary?
My general rules of thumb:
1) I don't care how many variables are declared, so long as each makes sense on its own. Like another poster's example, 'fullName' is perfectly acceptable (especially for i18n/l10n aware code that may have different rules for generating a name).
2) I ABSOLUTELY HATE clever arithmetic / pointer arithmetic / expressions all crunched into one line that can be split out. Example: in C-like languages that support pre- and post-increment, I expect the code to use only one or the other consistently, and never mix it with another expression. So this is fine:
i++; ...but this I can't stand:
j = i + 4;
j = ++i + 4;
#2 I picked up from a very experienced developer who pointed out that making the code harder to read is never worth it, the compiler produces the same code as the easy-to-read version. And that making code that looks 'too easy to be clever' is quite a bit harder than making code that looks 'too clever to always work'.
I am currently working with a mission-critical codebase, which is written in PHP and has absolutely no cohesive design to it. Well, unless you consider making everything static and unnecessarily inheriting other classes and overwriting static variables willy-nilly a cohesive design. There are business rules just everywhere and API requests everywhere and all kinds of calls that overwrite static variables. If you don't methodically trace logic it's really easy to get lost. What makes it worse is that there are many many variables that are named very similarly and you don't really know which one is right and which one is just going to get overwritten in some method call you are not looking at right now. And if this software fails, the worst case scenario is that my company makes no money. It really has made my life over the last few weeks pretty horrid. Fortunately I enjoy the job and the co-workers and am well respected there. Otherwise, it wouldn't be worth the aggravation.
My advice: communicate your difficulties to everyone who will listen (refrain from complaining or bellyaching, just communicate). If you inherit something like this, and it is mission critical, then you need to take as long as it takes to get it right. That's right, AS LONG as it takes. Take the time to document everything. Bother the crap out of anyone who can help you. You are responsible for doing your job, and part of doing your job is figuring out how to maintain this beast. And in order to do that, you need to use every resource at your disposal. If anyone wants to rush you along, you need to communicate the difficulty and the importance of the task. If you have been working at a place for a while and have done a good job to date, then they should trust you. If you're brand new, then you'd better hope someone there values your opinion and doesn't merely think you are incompetent. If you are asked to make enhancements, don't refactor until you understand the code. So make enhancements, leaving the potentially crappy code in place, even copying it if necessary. Steadfastly resist the temptation to refactor until you understand the entire piece that you are trying ti refactor. Don't remove seemingly unnecessary variables, and don't reduce seemingly redundant database calls. That comes later when you actually know what you are doing in there. IOW, if you have to navigate a lion's den by touch, don't stop to groom the sleeping lion (unless of course, that is your given task.)
The word inherit seems to imply that either the original maintainer no longer works there or has moved on to a different position. This means that it's you on the hook to figure it out. You've gotta dig in, buckle down, and get to it.
blah blah blah
That *totally* depends on the code base and the way the OP thinks. Sometimes they're a complete waste of time. Others...not so much.
I've worked with plenty of programmers who see pretty much every software problem in terms of FSMs. One size does not fit all.
Seeing software problems in terms of Flying Spaghetti Monsters? Ah, so that's where the "spaghetti code" term comes from!
This is Slashdot. Common sense is futile. You will be modded down.
It somewhat depends on the language used - some languages are easier to penetrate than others. And some languages does more in 10 lines than other languages do in 100.
But anyway - to learn the code you may have to find a starting point (there is usually at least one logical point to start) and then make a flowchart in PowerPoint or something for the general structure. It's no point trying to get into the finer details, just a general sense of flow. You will get things wrong in the beginning, but don't worry. And you may end up finding a lot of dead code too.
When you have a satisfactory overview of the code it's time to really swim and drink the code. Many programmers have a tendency to accept that "it works" and stop there. By throwing the code into the compiler at maximum warning level and then try to fix all warnings you will be even more involved. And if you aren't satisfied you can take on the code with code analysis tools like Splint (for C) or FindBugs (for Java).
And don't forget that the commands "find" and "grep" in *NIX are your friends. Other environments usually have other tools, and IDE:s have their own, so you don't have to install Cygwin or something to get a grip on things.
And if you think that you don't understand the code well enough - try to port it to another operating system or other language.
Of course - this takes a lot of time and consumption of your favorite hacking beverage.
And yes - I'm involved as a single developer in a system with about 400k lines of code written in Java, and it was ported from an older system written in C, C++, Basic, Java, DCL...
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
you're using windows 7 and batch files? use powershell, more powerful than any of the unix based shells (that i've seen) and there must be something wrong with your system...or non-OS processed using up CPU & memory....because i've used windows 7 on below minimum spec machines ( 1 GHz CPU and 512 MB ram and the command prompt was still very responsive.)
Source Insight lets you browse source code - very useful for largish codebases. It's much quicker than findstr or grep because it has an index rather than having to search the whole thing. It's not free of course but I'd never go back to findstr having used it.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
As a result, although the new functionality worked fine, the application still suffered for the "spaghetti" code of patches upon patches of years of various developers adding additional capabilities, but no one ever addressed the reliability of the application. The support group for this application was clearly frustrated with years of late night calls and hours and hours spent trying to correct errors.
About 6 months ago I was tasked with essentially "cloning" the application for new business purposes. I proposed porting the application to a newer, more modern language (java). It took a lot of selling (i.e. convincing management and other developers that the end result would run just as fast, be easier to maintain and have more reliability), but I was able to get them to buy off on it.
The rewrite was completed about 3 months ago and the results were better than i had hoped for. I was able to complete the rewrite in the same amount of time allocated for the original "enhancement" project. The application actually runs faster than the old one, has yet to crash (it runs 24x7), and the code is well structured and easy to maintain. We're now in the position that if/when another "enhancement" is requested to the old application, we can simply clone the new java version and completely replace the old app. Given the results of the last project, it won't be a hard sell (especially to the support group) to go the java route.
I know this is a long post, but the bottom line is that sometimes (more often than many realize), recoding an old application in a modern language and bringing it into the 21st century rather than patching old code can pay off dividends beyond the basic added functionality.
Sometimes the light at the end of the tunnel is the headlight of an oncoming train.
If you surf on over to Krugle.com, you will see that they now offer a free evaluation copy as a standard product. If you want to get a feeling for what can be done with the tool, just check out Krugle.org, where lots of open-source projects are indexed online. I would definitely recommend using the free evaluation tool as a way of speeding your high-level understanding of any new-to-you code base.