Learning and Maintaining a Large Inherited Codebase?
An anonymous reader writes "A couple of times in my career, I've inherited a fairly large (30-40 thousand lines) collection of code. The original authors knew it because they wrote it; I didn't, and I don't. I spend a huge amount of time finding the right place to make a change, far more than I do changing anything. How would you learn such a big hunk of code? And how discouraged should I be that I can't seem to 'get' this code as well as the original developers?"
Yes it's still a bitch to maintain it. But 30k to 40k is by no means large.
If you don't have access to the original developers and they didn't document it you're going to just have to spend a lot of time reading the code. =\
"Ubuntu" -- an African word, meaning "Slackware is too hard for me". - stolen from Dan C alt.os.linux.slackware
Try to single-step it in debugger from the beginning up to main loop.
Coding etudes
You are not them, your brain solves problems differently. I have found that by creating subs in areas where they have not used them, you can begin to re-write the code little by little. other than that, pouring over it or using a debugger to jump the calls is your best bet for full understanding.
So you have been handed the steamin' pile o' code, it is great that you are very cautious and deliberate when modifying it. Make a set of regression tests, that is, make a set of test data and procedures and expected results to ensure original functionality that is still desirable is still working and no other errors introduced. It is hard, much more tedious than just creating new code with few constraints.
Doxygen is your friend. run it over the source code and keep the HTML handy for searches and cross references.
Make it your personal mission to soak the code in comments, refactor it where appropriate, et cetera. Diagramming it can help, too. Do all the things they should have done before giving it up; this will help you find what all of the functions do, and discover the important ones.
Just out of curiosity, what is your opinion of a "Large" codebase then?
If it's Perl or VB, you might want to consider self-immolation as a first step.
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
First of all, 30-40,000 lines of code is not lots of code. Try, 250,000 of code.
To start, use a good programming editor/environment (Xcode, Vslick, Visual Studio, etc.) that gives you the ability to easily go to definition or references to variables, functions, structs and such. Run some sort of profiler or flowchart type program on it to get a high level view of the code and how it fits together. If you can get the person(s) who worked on it before you to give you an idea of it fits together.
Fight Spammers!
(And then shoot him.)
I find that if the other programmer wrote it in such a way where it's too complex for me to follow, I'm not the one who's a moron.
Anything ranging from just sketching out some informal package diagrams on some paper (I quite like using an A3 sketchpad) to something more like Code City which can work with code in smalltalk, java, and c++. There are UML diagram makers, of course, but automated diagrams like that probably need to be edited.
In fact, it is not the finished diagram that helps so much as the drawing of it, which is why paper and pencil is so good. Or a vector graphics package.
The best way to figure out how the code does action X is to run it under a debugger while it does the action, inspecting how the data structures in the program change, setting breakpoints where the decisions are made to see what happens, etc. You get to see dynamically what the program is doing step by step with the computer keeping track of it for you, instead of puzzling it out from a static listing. Running the code that way is a much faster way to gain understanding than simply reading the code.
Thank you for validating my decision to get the hell out of IT.
The only way to learn the code is to work with it. Simply reading through it won't help, you have to go try to change things and see what works and what doesn't.
The main thing that bothers me when working with other peoples code is the sheer number of variables they use. I tend not to declare a new variable unless it is absolutely necessary (and in object oriented programming variables other than pointers are almost never necessary). It seems like code written this way is easier to read and understand (and significantly smaller). This is slashdot, so there are a lot of other programmers out there. Am I off base here? What do you think about intermediate variables that are not strictly necessary?
I wouldn't try too hard with a codebase as small as 30-40k lines, but for an actually large codebase, there are a bunch of different things that can help: - examine a class or function hierarchy and call graph. If you have tools to do so and the codebase is set up for it, go ahead. If not, set up the tools and codebase to be processed for this - you'll learn stuff about the code just by hooking these tools up. - pick medium-level routines in the code base that you are interested and run the applicaiton in the debugger with breakpoints set on them. Take a look at the callstacks, step through the callers, look at the arguments, etc. - you can also get a bunch of knowlege of the structure of the app by single stepping in the debugger - "step over" to see the high level control flow, and "step into" subsystems you want to explore. - documenting the existing code using a tool such as doxygen can help you learn it while at the same time providing useful documentation for other team members.
I'll echo some earlier comments.
Set up an execution environment with debugger, and run several typical scenarios and trace them with debugger. Get the feel of the big-picture execution scenarios/paths.
It will take time for your brains to get comfortable with it, though. And the details, when you look into them, will throw odd stuff at you. But that's the nature of our work.
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
For culinary folks...
The time and money you spend tracing and inserting noodles in the spaghetti will end up being larger than the time it takes to cook a new batch (no pun intended).
For auto folks...
The time and money you spend bondo-ing, welding, rewiring, duct-taping, and C'n'Cing parts for the car will end up being larger than the time it takes to design and build a new car. (Although restoring an old/vintage car for the sake of nostalgia is a much more pleasing experience than buying a new one).
Gain an understanding of the purpose of each pivotal region. Know what your desired result should be, then begin the rewriting endeavor.
'We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.' RPF
PL/SQL or cobol or whatever they throw at me I poke, prod, and play with it in a test environment. Someone up above mentioned pencil and paper to draw out how everything relates and that is a very good practice I've found to just get to know things. It's not instant but it helps more then you initially think. Also I use Open Office Draw to map out things as well. :P
~~ Behold the flying cow with a rail gun! ~~
2000 lines can be enough to throw you off!
I think it is just like learning anything. Keep at it.
The most important thing is whether you have an efficient way to
look at what effect any changes have that you may make. Any effort you put into
that is probably not going to be wasted. (Might be unit tests? Sounds like they did not come with the code)
Stephan
http://stephan.sugarmotor.org
Getting something that allows you to browse code more efficiently certainly helps. There are tools for doing that.
Another trick is to compile in debug mode, run the code inside a debugger, then break and watch the function call stack. This can help understand deeply nested code some more.
In the long run however nothing substitutes practice using the codebase. Even an author can get lost if he spends some years away from the code... Either you just do not remember anymore, or the code was changed so much by someone else's edits it gets hard to recognize. Or both.
If the code does not have consistent coding style standards run it thought a indenting program. You may lose the revision control history but you certainly get a more than reasonable return from it being easier to parse manually. If it does have a consistent coding style standard, even if it is something you are not used to, probably better to keep it that way.
Cleanup code by refactoring common code blocks out, or doing other code refactoring that reduces line code code and/or increases readability. Make sure the refactored version is functionally equivalent to the non-refactored version. Unless you are fixing a bug. Even if you are fixing a bug document the change just in case something actually relies on bug for bug compatibility.
If you do not have time to do cleanups just keep adding the functionality you need. Eventually you will have read enough code that you will know the codebase. If you do not need to add any more functionality, who cares anyway?
Get a copy of Working Effectively With Legacy Code. It'll help you get tests around the code base that give you the confidence to be able to change it without breaking anything.
One million lines is starting to feel big.
I inherited a code base of 1.5 million lines of code at the last job I was at. Thankfully I wasn't the only one responsible for it. My advice to the original poster is to add lots of logging information. Log statements should document what the code is doing at any point in time and tell you where it is doing it. If it's java you can get the stack trace from anywhere--this is very handy for logging.
30k-40k... I am working on a project with ~2 million lines of code spread across C#, SQL & HTML/Javascript/CSS. Mind you, there are 8 developers working on it, but each one of us has to pretty much know the entire thing.
Ha, ha! Just 4 months ago I joined a project with a code base of about 500k lines. I would call that (the 500k lines one) intermediate in size. There are code bases with many millions of lines. I now feel pretty comfortable finding things in it. And I mostly use find and grep.
When 1person suffers from a delusion,it is called insanity.When many people suffer from a delusion,it is called religion
unless they used a God class for everything.
Seriously... if there is a lack of documentation, then you just have to start reading the source code, starting at main(). Then look at each object and read its constructors.
And start documenting it. Add comments in the code, create inheritance diagrams and sequence diagrams.
It will be tedious but you will come out of it a better programmer.
You mean they didn't comment all their code? *gasp*
Very well, sir. Here's your 40,000 lines of Perl from the late 90s. It's mostly regex to parse revisions 30 through 451 of our in-house provisioning system. Oh, and BTW don't screw up like the last guy who had this job. He provisioned 32767 customers with tier-1 service, and it was the director's job to explain why we either had to let them have it for the remainder of the year, or else deal with the CR issues.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
And then you re-implement it in the latest language.
Deleted
I have inherited projects and do my best to convince management that a pause is needed to document the code. Personally I try to flowchart the functionality and cover a couple of office walls with Visio printouts. Later on I can use such work to add detail and further documentation.
I inherited some code where the developer used names of girlfriends in variable names, it was just dumb and completely unprofessional. I didn't worry so much about keeping track of those, I was more worried about a change in one spot having unintended (and perhaps unknown until too late) consequences. Rather than spend time fixing problems, I thought it best to do some up-front documenting to at least provide a path to successful maintenance.
When I left the project, the manager had a binder of documentation and almost cried.
I had an English professor who always said, "Structure is the key to understanding." He was talking about literature, but I think the same is true for programs as well.
Try to understand the structure of the program. What is the basic flow? It should have an initialization routine, a main loop, and a shutdown routine. Find out roughly where they are, then focus on the main loop. Usually there will be one piece of code that is central, and it will occasionally pass control into other large pieces of the program. Sometimes there will be more than one main loop, and control switches back and forth between the various main loops. If the program is event drive, this will make a difference in the structure.
If you are just trying to make a small change, try to find the sequence of events that will lead up to where that change needs to be made. Follow the sequence of execution until you get to the line you need to change. If you are changing a single variable, sometimes it's helpful to do a search and find all the places that variable is used, to make sure your change won't have any side effects. This may seem time consuming, but it can save 10 times more in debugging.
Learn to follow code execution with your eyes, without running a debugger. One thing that separates good coders from not so good coders is the ability to follow code that isn't being executed.
Qxe4
There is a tool called grep which is very useful.
Consider yourself a new explorer in the developing field of Software Archeology. And if you're a programmer, consider that the task is listed under the heading of "jobs for programmers". Try to make it so that the next programmer to deal with the code has a few more advantages than you.
Oh yeah, well I just inherited a code base of 2.8 trillion lines of assembly code, and I have to read it over a 12.734 baud VAX connection! Why, back in my day...
Anyway... I've taken on a few large-scale software projects before, and my approach has always been "read twice, hack once". I agree with the the parent, and I'll add a note: for the love of everything sacred and unholy, use revision control, and don't trust it -- that is, back up incessantly. Document the hell out of your process. Once you've really learned the system, you might want to back out some of the newbie mistakes that you're making right now.
And yes. Learning a big system takes a lot of time -- you should be reading much more than writing until you've learned it. I find it helpful to diagram dependencies / draw up finite state machines.
yeah, the clown always creeped me out as well.
Mod me down, my New Earth Global Warmingist friends!
That is indeed a heinous scenario, but don't conflate "obfuscated" with "large".
Couldn't agree more. Even 4-6 million lines is probably fairly common and still not a big issue. One is more inclined to enter the "cut the cruft mode" sooner rather than later when its at that point.
Run it and step through it. Also, use doxygen (http://www.stack.nl/~dimitri/doxygen/) to highlight keywords, create hyperlinks to follow functions, and describe the data structures.
Good lord, you're not going to eat'em afterward, are you?
I am article submitter O.P. and not retard I am programmer with Master DEgree in Computer Science from Indian Institude of Technology and If I am retard why does IBM give me 40.000,00 lines of code? American IBM cannott do it so they give it to me because of my education in India
IBM paies me 2 Mexican paysos for every line of code I fix that American coder screw up and I need food and room like American does. If American wants money than American should do job correct the first time and not have to send it to INdia to get all the work done correct. As AMerican teenager say DONT HATE THE PLAYER HATE THE GAME
You're not a kernel hacker.
As someone who recently passed off a pile of code of about that size in poorly written and poorly documented php to someone.. All I can say is I'm very very sorry, and I had *no idea* my personal side project would work better than the original commercial offering and be declared 'mission critical' three months before I left for greener pastures..
I just took the easy way out and quit. I had inherited about 30K lines of php code
that was written by my boss. Shortly after inheriting this spagetti mess I ran a grep
across the source the word "function" did not occur a single time in the entire source
tree. To top it all off I was not to rework any of it only maintain it as it was going
away. I did end up installing it on about 5 new machines so going away anytime soon
was not going to happen. On top of all that I would run into about 20 blocks of if
statements per file and in addition most database calls etc had the report no errors
@ in front of them. I found it much easier to just hand it back to the boss and quit.
Got Code?
Identify each major portion of functionality. If you are working with a sales/billing system you would probably end up with : Orders, Invoices, Payments, Admin.
Go through each of those portions and identify the major portions. Orders: Order headers, Order details, business logic, ui logic, reports, datalayer, etc. Repeat until reduced into easily consumable units.
Pick and stick to an SDLC. Use whatever fits the situation and the resources. For a small project (under 100k lines of code) you should be good by yourself. Anything more and you'll have to involve at least 1 other person for testing. For medium (100k-500k lines) you'll need an additional dev...For large projects (500K-5M lines) you'll need a project manager, lead dev, 2 devs, 1 test, and a UAT team...For larger projects you'll have something unique and frightening to the specifics of the software project and corporation/agency making it...anyway, I digress...
Go through each subdivision line-by-line and re-write it yourself (even if you aren't going to put your re-written version into production); the only way you're going to truly understand what is going on is if you do it yourself. Use whatever language you are most comfortable with or is most appropriate to the task (or languages), it does not need to be the same as the original.
Verify that for a given input, your version produces an exact output.
Take a deep breath. It's not a race. It's a one-to-one functional mapping of your software (your mindspace) and the original software (the other developer(s) mindspace(s)). The code probably will not be straight forward. It has also been battle-scarred and will be warty. Changes of initial requirements through time and feature enhancements (feature creep) will have taken it's toll on what may have originally been something simple or even elegant. It's something of a niche mindset and if it is not for you, there exist many other exciting things to be programming.
Ultimately, if you do as outlined above, you'll solve many problems, be able to make whatever changes you like, and in so doing have a way to present your design as a replacement if you want...Or not, if you don't; for 30-40k lines parallel development makes sense, in a way, for one person.
I inherited 30k lines of code when I started work "wet behind the ears". It was actionscript code (so no typing), spaghetti at its best. Probably not the best code to look at as a beginner. I also had inherited another 20k of clean java code, probably that was the only thing I felt very happy about. I agree to AC. 30 to 40k is no big deal. As a fresh programmer, i had inherited 50kloc.
Just out of curiosity, what is your opinion of a "Large" codebase then?
My first programming job was on an enterprise system that was over 7 million lines of just C++ code by the time I left, not including SQL stored procedures, web server code for the reporting system, and surely other code stuff that I can't recall. The entire development team for the system was something like 45 programmers. So to many of us, 30-40 klocs does not seem like a large codebase at all.
That said, I've also inherited code in the 10-50 kloc area of magnitude that was far more of a challenge/nightmare to decipher and maintain than that 7 million line system was. Code maintainability has more to do with good system architecture and coding standards than it has to do with the size of the code base; without those you system will likely collapse under its own bloat long before it can grow to millions of lines.
Momentarily, the need for the construction of new light will no longer exist.
I currently maintain several million lines of perl. It's not hard, it mostly just works, and when it doesn't, it's not that hard to figure out where it's broken IFF there is a consistent repro case for the problem.
If you have a proper development/production divide, there shouldn't be any weird production issues unless you or your predecessor missed some test cases. If you don't have test cases, that's a problem, if you don't have a properly firewalled and complete development environment, that's a problem, the code itself? Shouldn't be a problem.
Realities just a bunch of bits.
People are more likely to be awed by your programming skills if you can help with this person's problem, instead of trying to impress people with the size of the programs you've worked on.
30-40K is nothing. One person should be able to handle that easily. Although I can imagine for an inexperienced programmer it can be too much. I remember the first 'large' program I wrote in school -- it was 400 lines.
10 years ago I had to port 1.5 million lines from one UNIX to another. Well that's a large project.
Medium size is 250 to 750 thousand lines of code (one person can still understand how it all works). Big is 1 to 10 million lines of code. Really big is >10 million.
I have worked on code bases of all of those sizes, and I like the medium size the best -- it's big enough to be interesting, and small enough that you can understand it all.
One that I've worked on (over 25 million lines) is just too big for my tastes -- over 3 hours to do a clean recompile is excessive.
Ian Ameline
It floats... they all float down here...
Don't be discouraged. It's not like English where everyone writes in a familiar way. Everyone writes code a little differently and it is hard to go through it. Even with good commenting it can be difficult. Just persist and hope that you can contact one of the original authors.
Get a copy of Michael Feathers' book "Working Effectively with Legacy Code".
I taught a grad / undergrad course using this book. We took a real open-source program as the class project, and the teams made significant changes to it. I thought it worked well.
Pat
And the answer is obvious. UTSL. And since it's now mine anyway, I tend to walk around and see how things work, find places where things don't work so well, and refactor them. It's quite a lot of work, often meaning touching the same code several times to come up with something more modular, more compact, more efficient. Lots of work is ``enabling'' work. Clean up something, see what that exposes or enables some larger change to be put through. After a while change requests become simpler and faster.
If you want to see how this really works, take projects with lots of fresh graduate or even freshman code in them to poke through. It's not hard, it's just lots of work. But then, what are you being paid for, anyway?
...because I actually enjoy going through someone else's code? I roll up my sleeves and, using print statements and/or a debugger, I diagram object relationships, flow, data structures...anything I can think of. It's like figuring out a puzzle. Of course, I've had the luck of never inheriting a total pile of crap. But give me anything from not-perfect-but-serviceable on up, and I not only can deal, but I'll have a good time doing so.
Check it out, it's called Code Browser . It's a lightweight and powerful editor that allows you to visualize, structurate, link, organize, comment and edit code.
It's my favorite one for very large projects with houdreds of files and thousands of lines.
From the project's description:
"Code Browser is a folding text editor for Linux and Windows, designed to hierarchically structure any kind of text file and especially source code. It makes navigation through source code faster and easier."
"Code Browser is especially designed to keep a good overview of the code of large projects, but is also useful for a simple css file. Ideal if you are fed up of having to scroll through thousands of lines of code. "
Have fun!
There are a couple of questions that you should ask yourself:
First I would find out out how the program was designed, that is: Is it a bottom-up or top-down? Some languages offer better facilities for writing programs in one style or the other and some problems are solved better in one style or the other. Try to think like somebody who was given the task of "implementing X."
If they chose bottom-up, the developers might have been competent enough to refactor code as they were writing it. How would somebody start implementing X from the bottom-up? Start deep down in the hierarchy of abstractions with the fundamental abstract data types that drive most of the program. If file timestamps are accurate, they should be able to tell you what the oldest module of the program is. Start there, then move on to the next layer that interfaces with that code. Wash, rinse, repeat.
If it's top-down, find the design documents. If they're unavailable, reverse engineer them from the current code base. Is it a clean design? Ask yourself if they were competent enough to come up with it right away. How many people were working at the time and what were their levels of proficiencies? Hope that the most proficient programmers were assigned the most difficult modules. At some point integration must have happened. Find the spots where it did. Those are module boundaries. Read each module's code progressing "along the boundaries."
The more accurate X was defined in the first place the later you're going to run into the uglies, the WTFs. That's going to be inevitable. Every code base has WTFs and OMGs.
Ultimately, you must read all code to understand all code. That shouldn't come as a surprise.
Chris Eineke
I've never gotten tier 1 service for anything. But, for all intents and purposes, really, who cares?
He once inserted random mutations into his code, just so he could have the experience of debugging.
When I was programming we did every project in 5 lines of code, or less, period. Anything more than that was just fancy stuff!
>30-40 thousand lines ?
You must be kidding. This is a tool. Business Apps are one scale more: 300-400 thousand lines.
As for your question: Hire the original developer.
No documentation->Code is not worth a cent. In the business world the documentation should be in the code. approx 1:1 code and comments line.
This is the apps that control your money, phone calls, insurance and your airplane tickets. That's the real apps.
My first progrqamming job was also about 7 million lines of code - all assemby code. There were 5 of us maintaining it, and some of the object we were maintaining we didn't have matching source for (which isn't hopeless in assembly programming, fortunately, just time consuming and annoying).
You can just read through 30 klocs in a few months, not a big deal, really. But for a larger codebase you have to learn how to do bugfixes without understanding the entire system. You can often find the source of an error by searching for an error message in the code, then working backwards (assuming you have and error message!). You won't be able to prove that your bugfix won't break something else.
FOr adding new features, it really sucks if there isn't some architecture-level documentation to give you the big-picture understanding. Details around a given bug are one thing, but just finding the right APIs to use when adding some new feature can really such without good comments or an architecture doc.
Stepping through the code with a debugger while you do some normal tasks will really help you understand the organization of the mainline code. Lacking good docs, it's the best way to get started.
Socialism: a lie told by totalitarians and believed by fools.
I inherited a product with a code base of a few hundred thousands lines of code when I was a fairly new software engineer. To make it worse, it was cross platform (AIX/Windows/Linkux/HUX) with something like 20 nested make files. The code was essentially a business service application. My solution was to talk to the consumers of the product and learn what each service call was supposed to do. I then wrote a set of test suites for the application. I had to continually update the suite as a I went along, but it definitely exposed unexpected couplings or other strange behaviors in the code. I also ended up converted the project over to an ant based build script (ant was brand new at the time). It defintely taught me what the code was doing and how it was doing.
It's not just architecture and coding standards. What I find, is that up-to-date documentation is very important. Not so much details about lines of code, but the general design, control flow and design decisions.
RogerWilco the Adventurous Janitor
If both the original developers and the knowledge they had have been lost, then it is probably already too late to perform any major maintenance on this code base. The project has already entered its “servicing” stage.
At that point, you basically have two possible approaches that actually work: you can restrict maintenance to small-scale changes, which may be sufficient if the goal is just to keep the project ticking over for a while, or you can accept The Big Rewrite (which isn’t so big in this case) in order to get a project that can be properly maintained.
If you want to go down the tactical changes path, there are a couple of approaches to finding your way around the code.
If you’re familiar with the general field of the software, just not this particular code, then you can work top-down. Start with the key, high-level concepts you know the program implements, and try to find the code that represents those:
Hopefully, if the code has a reasonable modular design and you just don’t know what it is yet, this sort of approach will identify the organisation of the code at a very coarse level, but then you can try to break down each area in more detail the same way.
Alternatively, you can work bottom-up. Find a significant starting point, such as:
Examine the code near that point. Look at what kinds of data it works with. Look at what functions it calls, and what functions call it. Try to figure out the wider significance of the code you started with, and the other code to which it relates. Then move up a level: what is the purpose of all of that code collectively? Repeat until you’ve explored as far as you need to.
After some other discussions about these topics, I recently wrote up a couple of articles with some more background information than I’ve given here — link in my sig if anyone’s interested (though be warned that they are pretty long).
I have been given projects of this nature and the best approach is to document what is obvious and then use bug fixing as a way in to the code. While it won't give you a complete picture, it should help you understand what is immediately important, and serve as guide posts for knowing more in the future. Generally I try not to spend too much time trying to understand everything, since its a waste of time, unless that knowledge is guaranteed to serve you - sometimes the client just wants it be tweaked once in a while, so it probably is not worth the time if you can't charge them for it.
To sum up: give yourself a general picture and then concentrate on the details only when it matters.
Jumpstart the tartan drive.
> I find it helpful to [...] draw up finite state machines.
Unless his entire code is written in regular expressions (which, albeit, *would* be a total bitch to maintain), I don't think finite state machines are going to be very helfpul.
"A couple of times in my career, I've inherited a fairly large (30-40 thousand lines) collection of code. The original authors knew it because they wrote it; I didn't, and I don't."
A couple of times in your career? You must be lucky. Most jobs you can get coding will always involve taking over someone else's code.
In my experience, design patterns are your best friend, bearing in mind that most of the code base will always remain a black box to you.
For example, when I was doing some health insurance work, I had inherited a code base that was substantially larger than 30 or 40 thousand lines of code. The objective was to make the code that used an older, fixed-length record format work with the newer X837 EDI format, which is basically XML but almost without any tags to help you figure out where the data begins and ends. Suffice it to say that the task was to figure out how to smoothly stick a square peg in a round hole.
The task itself determined the design patterns, of which an adapter pattern was the most used. The type of pattern in turn dictated what in the code to look for in order to implement it, and (of course) how the new code would be built. For example, since we were using an adapter pattern, the first order of business was to find out how the data was represented in the code base, and then trick the "black box" into using your own spiffy, new representation of the data.
For the most part I didn't have to care all that much how the application handled the data as long as I got the right data into a form the application would accept in my adapater.
Sure, but the medical policy must have been ridiculous to cover all the RSIs from the scrolling.
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Hint: monads are two-state machines. Learn them.
Perl is like the matrix. At a certain point, after you've stared at it long enough, it all just makes sense.
"I assumed blithely that there were no elves out there in the darkness"
Somehow, I suspect that the original developers don't remember most of it either.
Unless you work with it every day, little by little, you forget.
First you forget the tricky parts.
About the only thing you remember after a few years is the general structure.
If you work with it every day, soon you will know it better than the original developers.
As someone who has done probably 90% of his work in maintenance programming, let me give you my tips:
BTW, the fact that you have a hard time understanding this code may be more a reflection on the original authors' coding skills than on your abilities; any idiot can write code that "just works"; it takes a lot of thought, time and effort to write code that is maintainable, and more often than not, the original coders were short on at least one of those (if not all three). Here's hoping you have the time to follow my above tips; they take a lot of time, but can be worth it if you really need to maintain the code. It's funny to note that apart from the first one, most of those tips apply equally well to developing software from scratch. If the code already has a change tracking system, unit tests, a build/run/test system, *and* automated testing, consider yourself lucky and just start picking apart the unit tests.
Nathan's blog
It depends on the code quality.
40k lines of spaghetti, undocumented code may be a nightmare.
1M lines of good and documented code may be even easy to deal with, depending on what you're going to do.
First off, all us engineer or good programmers take a lot of pride in our work... This can sometimes be a problem.
The real issue is that a company had 40K lines of code written and didn't staff it properly to maintain it.
First, they should make sure the guy who wrote it didn't leave. Work conditions, payscale....
Secondly, they should have had a transition plan. Either some 'slack' working on the same project.
So that is your starting point. It is not your problem that you inherited this large codebase and have no idea how it works.
Don't take it personally if you make a change and crap happens. Just make a change, hopefully there are testers... if you cause a bug, enjoy the CRs...
That's how companies want to run their software department. That's how you behave.
You will only learn the code by working with it. You will get CRs, grep files for what you're looking for, make a change and deal with the after effects
After a few months, you'll start to get the hang of it. After a year, you'll be good...
That's just life in poorly run software companies :P
I know this'll get modded troll, but boy are you a douche.
When I joined a group that had a 2 Million SLOC program, I learned the most by fixing defects. It gave me a good reason to go traipsing through the codebase. It's painful, but it gives you purpose while reading the code. Just plain reading it gets boring.
Dear Sir,
We have recently been placed in charge of inheritance of 40,000 loc, I have the
privilege to request your assistance to maintain the henceforth mentioned sum.
The above sum resulted from a contract, executed, commissioned and written five
years (5) ago by a foreign contractor. This action was however intentional and
since then the source has been in a suspended terminal awaiting the fg command.
We are now ready to transfer the source overseas and that is where you come in.
It is important to inform you that as outsourced servants, we are forbidden to
debug foreign code; that is why we require your assistance. You will be required
to debug and analyze the code and transfer the bug free code to our central
repository after which we will reimburse you for your time with post it notes
and slightly dated coffee creamer.
We are looking forward to doing this business with you and solicit absolute
confidentiality from you in this transaction. Please acknowledge receipt of
this letter, using the above Telefax number for more details regarding this
transaction. Also endeavor to send the requested information.
I call that a module. Large is anything over 1,000,000 LOC. Step up.
Step 2: Print out all the code (in very small font) and paste the code up on the wall
Step 3: Identify all the classes, functions, DBs, etc.
Step 4: Create a visual map (on a white board) of how they're all linked together.
Step 5: PROFIT!
That wasn't so hard, now, was it? :)
he ports of GNU utilities to Windows [sourceforge.net] or Cygwin [cygwin.com] or even your own company's Interix [wikipedia.org] and Services for UNIX [wikipedia.org] products?
I had Win7 and Vista Ent with Services for Unix I downloaded, and it just did not feel right or work right. The command line utilities work, in part, because the whole OS in Unix is basically a tree of text files. windows isn't, and so, the utilities tend to be less effective. Plus, some gotchas like how Windows handles open files with applications, its all different.
I thought interix would be the ultimate, but it instead it taught me the opposite. If you want unix, use unix. It's that simple.
This is my sig.
Using a static analysis tool like findbugs - and fixing all the problems it fings is a great way to get to know all sorts of corners of a big codebase.
(and incidentally increase quality).
[humor]
You could always ask Microsoft... that sounds like almost every piece of software they currently (or even previously) sell or sold (with of course differences in the amounts of code). Maintaining it properly seems to be working fi.....
...ummmm, never mind. Perhaps you should simply learn how to market it really well, kill the competition via anti-competitive actions, and kludge on a thing or two so that you can claim you've "improved" it.
[/humor]
StarTrekPhase2 - The Five Year Mission Continues!
I've maintained code in the 30-40Kloc range that was "large" and really sucked to understand. Fix one bug, create two new ones. I maintain one such code base still, most modules have a McCabe Cyclomatic Complexity of over 100. Can't refactor/rewrite/redesign, management won't approve it. The original authors are long gone. I embed lots of debug in the code and turn on the debug output on sections that I'm working on.
I use a program called lxr or linux cross reference. I even extended it for myself to handle embedded sql code. And because it runs on a web server it allows a whole team to browse the code.
This is what I've done once upon a time.
1. Tell management the code is completely undocumented, not maintainable, unstructred piece of Dukakis.
2. Offer to rewrite the code completely. Chance are they would agree. Of course it depends how large is the code.
3. Rewrite the code. Make sure it's undocumented, not maintainable, unstructured piece of Dukakis.
4. Resign.
Really. A guy asks a question for help and all of these people keep telling him 30-40,000 lines of code isn't much.
That's a lot of code to get your arms around if you didn't write it. It's not the end of the world, but it is a sizeable task, and is the type of topic that few professional journals or books will ever be written about.
Having been in similar situations, I my advice would be:
1) Try to get an understanding of the history of the code. Who wrote it? Why? How many developers? How long has it been around? Do people love it or hate it? Is there a version control system in place you can use for information?
2) Look at it from a technical viewpoint. Is is complete? Does it compile and run? How many languages are used? Are there interfaces with other systems you need to know about? What dependancies are there? How easy is it to setup a test server? What parts are well coded? What parts stink up the joint?
3) Dig for functional documentation. What does it do? For whom does it do it? What business needs does it support? How mission critical is it?
4) Meet with the business owners. Seriously. This helps you do two things: #1-- Define the real business need (which may be different than what was understood by the previous developers), and #2-- Set appropriate expectations about maintenance. You'll work hard to maintain and keep it working, but you are working from a disadvantaged position. It is important they know this and support you in your efforts, rather than complain loudly when something doesn't work.
5) Plan to remove the dead weight. There's always a lot of dead weight in these near-abandonded projects. Get an idea how to simplify things and plan your work in phases.
6) Setup real test and development servers. Yeah, you know that wasn't already done.
7) Use version control. But you know this. It's 2010, and no developer worth his/her salt would code a paying project without version control. Right?
8) All fixes will take much longer than if you wrote the code, so be careful with estimating time.
Anecdota from the 2.6.13 source tree: ./drivers/usb/media 28846 ./net/ipv6 28901 ./fs/jfs 29103 ./fs/reiserfs 29268 ./mm 29446 ./drivers/usb/gadget 29453 ./drivers/char/drm 31944 ./drivers/scsi/aic7xxx 32463 ./drivers/isdn/hardware/eicon 33054 ./drivers/atm 33462 ./arch/alpha/kernel 34150 ./drivers/net/sk98lin 34598 ./drivers/ieee1394 34683 ./arch/i386/kernel 35251 ./arch/sparc64/kernel 35293 ./arch/ia64/kernel 36738 ./drivers/usb/serial 38002 ./sound/pci 38576 ./kernel 39278 ./drivers/video/console 39445 ./drivers/pci/hotplug 39969
None of these, by any measure, are large
well that depends on how many developers we are talking about. The original question seems to indicate that the author has inherited the codebase. The need for this question wouldn't exist if the person were on some large team.
For one or two or five people, 40K lines is a sizable codebase, especially if it has been poorly maintained / designed.
blah blah blah
I get this exact situation occasionally. The only possible answer is time. The longer you work with it the more clear it becomes. You may even end up liking their style.
If you can work a lot of overtime at the beginning that will help. That is exactly what I do but then again I have the free time and don't have kids to watch after and such.
Bite the fekken bullet if you can and put in a few 60 hour weeks.
I could not disagree anymore with your statement. As a consultant, I have designed and personally coded more than a dozen projects that were much larger than what the poster had. Also, it is simply impractical many times for the developers too stay simply because it is just not cost effective to do so. People generally will pay to have the new system in place, but rarely want to pay allot to maintain it. My experience is that it is generally best for me to move on, and let someone else maintain the system. To be honest, some systems have turned out poorly (typically due to late exploration of the requirements), but generally the codebases are quite simple for even a novice to maintain.
However, I have found thru experience that the key to a good codebase is how it is segmented. Abstracting complexities is extremely important to a well maintained codebase. Meaning, in my opinion, the ideal design is one where you have hundreds of simple objects (although OO principles are not critical) that make up a very complex system.
In short, I have sent a number of systems just like what the poster is talking about. They are generally poorly designed, hard to maintain, and typically very difficult to find/fix bugs on. If there is not a business case to re-design the system, however, then it is typically best to slowly start segmenting and abstracting the codebase until it starts to can reliably predict it will perform in the field.
I recently listened to an excellent Software Engineering Radio podcast on this very subject: Episode 148: Software Archaeology with Dave Thomas
This guy has a lot of good pointers. (No pun intended. ;)
I feel sorry for you! I quit a really nice job about 12 years ago because I was fed up with Windows. I have been much happier since! Before then, I used CyWin which is freely available and offers the goodness of bash, find, grep and many other tools written by software developers for software developers. Recently I had the misfortune of having to write a short batch file on Windows 7 on a fairly powerful 4-processor machine with 8 GiB of RAM and noticed that a) the terminal (DOS?) window felt really unresponsive and b) that copying and pasting in it was bizarrely clunky. What's up with "mark"? Also, I am not someone who would claim that bash is an even remotely sane scripting environment, which is why I switched to Python for most my scripting needs, but Windows batch scripts are a friggin' nightmare! It seems that rather than improving on "sh", Microsoft decided to come up with something far worse. I now live in an OS X and Linux world and am much less frustrated. And you're right, of course, it does take time and effort to become familiar with a new code base. I felt pretty intimidated when I started out at my current job with a new build system and a new programming language. Now I feel like I can fix any bug in it and I have already added several new features! :)
When 1person suffers from a delusion,it is called insanity.When many people suffer from a delusion,it is called religion
I am currently working with a mission-critical codebase, which is written in PHP and has absolutely no cohesive design to it. Well, unless you consider making everything static and unnecessarily inheriting other classes and overwriting static variables willy-nilly a cohesive design. There are business rules just everywhere and API requests everywhere and all kinds of calls that overwrite static variables. If you don't methodically trace logic it's really easy to get lost. What makes it worse is that there are many many variables that are named very similarly and you don't really know which one is right and which one is just going to get overwritten in some method call you are not looking at right now. And if this software fails, the worst case scenario is that my company makes no money. It really has made my life over the last few weeks pretty horrid. Fortunately I enjoy the job and the co-workers and am well respected there. Otherwise, it wouldn't be worth the aggravation.
My advice: communicate your difficulties to everyone who will listen (refrain from complaining or bellyaching, just communicate). If you inherit something like this, and it is mission critical, then you need to take as long as it takes to get it right. That's right, AS LONG as it takes. Take the time to document everything. Bother the crap out of anyone who can help you. You are responsible for doing your job, and part of doing your job is figuring out how to maintain this beast. And in order to do that, you need to use every resource at your disposal. If anyone wants to rush you along, you need to communicate the difficulty and the importance of the task. If you have been working at a place for a while and have done a good job to date, then they should trust you. If you're brand new, then you'd better hope someone there values your opinion and doesn't merely think you are incompetent. If you are asked to make enhancements, don't refactor until you understand the code. So make enhancements, leaving the potentially crappy code in place, even copying it if necessary. Steadfastly resist the temptation to refactor until you understand the entire piece that you are trying ti refactor. Don't remove seemingly unnecessary variables, and don't reduce seemingly redundant database calls. That comes later when you actually know what you are doing in there. IOW, if you have to navigate a lion's den by touch, don't stop to groom the sleeping lion (unless of course, that is your given task.)
The word inherit seems to imply that either the original maintainer no longer works there or has moved on to a different position. This means that it's you on the hook to figure it out. You've gotta dig in, buckle down, and get to it.
blah blah blah
Ever seen that demotivational poster that says "It could be that the purpose of your life is only to serve as a warning to others." ?
Well, that was me on my last project. I inherited a codebase of about 1.2 million lines of antequated C code, written by a dozen or so different people over the course of a dozen or so years, for half a dozen different projects. For your benefit, here are a few dos/don'ts that I learned the hard way:
1. DO NOT try to be a hero and learn the code inside/out all by yourself. Going in, I had a long history of doing exactly that on numerous smaller projects. Turns out 1.2 million lines was WAY beyond my ability to grasp just by pouring over the source code. The whole time I was trying to decipher this massive, seemingly amorphous blob of code all by myself, there were at least 2 or 3 of the previous developers sitting a couple floors up. All I had to do was ask for help, but for a variety of reasons (they are very busy people, I don't want to come off as being incompetent, my own overconfidence, etc), I didn't use that resource nearly as much as I could (and should) have.
2. DO NOT try to learn the code bottom-up, by diving straight in and trying to put it all together one piece at a time like a giant jigsaw puzzle. Get a good, solid big picture view in your head first. Draw it out. Data flow, logic flow, UML diagrams, whatever it takes for you to really understand it at a high level, before you start reading source code line by line, function by function, class by class.
3. DO NOT be afraid to make a few assumptions, at least initially. Yes, this may well mean that your high-level mental picture of the code may have some errors that you will need to fix later on, but you need to use your time efficiently. If you can reasonably discern what a given module, file, or function does without having to read every line of code, go ahead and pencil it in on your high level diagram and move on. If you see a source code file named reset_xyz_board.c, you can be reasonably sure it's resetting the "xyz" board. No need to fully grasp every little detail right off the bat. There will be time for that later, if and when it becomes necessary. But keep in mind that with any sufficiently large codebase, there are going to be numerous dark corners that you never end up seeing anyway. Why waste time meticulously mapping out every single one of those dark corners when, in all likelihood, you are only ever going to modify a tenth of the code, or maybe a quarter at the most? The more time you waste obsessing about every minute detail, the less time you will have to truly understand the code from a high level.
4. DO get help from your team! I don't mean the previous developers. If the codebase is large enough that you don't feel you can learn the code all on your own, chances are you aren't the only person assigned to the project. If you are the only person, and your bosses refuse to get you help, then good luck. Otherwise, enlist your fellow developers to help you figure the damn thing out, before you all go off trying to write new code. In my case, I was the team lead and started off with 3 other developers on my team. I was foolish enough to let my ego get in the way, thinking that it somehow wasn't "right" for a team lead to have to rely on his team to help him figure out the existing code (which I probably could have done if not for mistakes 1-3, but that's beside the point). I wanted to be the guru who had a better, clearer understanding of the code than the rest of my team. Why? Because I figured that was part of my role as a team leader, and I didn't think they would respect me as much if I didn't know more than they did. Let's face it, programmers are a meritocratic bunch. Ranks and titles don't equate to respect. Your fellow programmers will invariably treat you with a level of respect that is in direct relation to their estimation of your
That's nothing. Now if all of the files in the project are 30-40 thousand lines of copy-ghetti, well all I can say is good luck. May the refactoring be with you.
The first thing you do is get everything under source code control. If it already is, good. You should have a clearly-marked branch that shows where you started hacking on it, so you can easily tell what pre-dates you.
And by the way, I highly recommend the Git version control system. Among its many great features, it lets you use a version control system that is only on one computer, and get things right before you "push" your changes up to the group server. Thus you have the full power of a version control system, and the freedom to use it, without worrying about breaking things for anyone else. Best practice use of Git: on your local machine, make a new "branch", check out the branch, and do your experimenting in that. If you suddenly, urgently need to fix a bug in the main code, you switch away from your branch to the main branch, do what you must, then switch back to your new branch when convenient. If the branch doesn't work out, you can just delete it. If it works out, you can merge it. (By the way, the above is true of any "distributed" version control system, not just Git.)
Several others have told you to start with unit tests. If the code base already has a set, start by studying them. If the code base does not have unit tests, write some.
Presumably you inherited a working system. The unit tests will put a definition on what "working" currently means. When you change the code, if you introduce a bug, you want one or more of your unit tests to detect the bug and let you know, before you share your updated code with anyone else. Unit tests are some work to set up, but they provide huge peace of mind for you once you have a good set.
And, whenever you are asked to fix a bug (whether you caused it or not), you add a unit test that tests for that bug. Over time the unit tests will become more and more valuable.
I also second this advice by npsimons. Try various automated tools that check for memory leaks and such. If they find bugs, fix the bugs (in your private branch) and then make sure that the fixed version passes the unit tests. You will learn the code base as you find and fix the bugs, and you will improve the stability of the code.
If you find any particularly important variables or data structures, you might want to add some assert statements that check those values in the Debug build. In the Release build, the asserts don't even get compiled in, so they are "free", but if you run the debug build, the asserts can find bugs for you. For example, if you have a crucial handle to some resource, and the handle is getting clobbered, put asserts all through the code that assert that the handle hasn't been clobbered yet, then run the debug build and see where the assert fires. This may not save you time if the clobbering bug only happens once, but you never take the asserts out, so the asserts can find a bug for you if you accidentally re-introduce the bug. (Note that this implies you will want to run your Debug build under the unit tests, in addition to your Release build. The asserts can fire and show you where a bug is, but you need the code to run, and if you have good code coverage from your unit tests, that will happen.)
Good luck.
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
That *totally* depends on the code base and the way the OP thinks. Sometimes they're a complete waste of time. Others...not so much.
I've worked with plenty of programmers who see pretty much every software problem in terms of FSMs. One size does not fit all.
Add comments as you review and work with the code. The exercise will help you learn and provide documentation when maintaining or rewriting it. After you're familiar with the code you won't have the perspective of someone new to it, start now.
Really. A guy asks a question for help and all of these people keep telling him 30-40,000 lines of code isn't much.
That's a lot of code to get your arms around if you didn't write it. It's not the end of the world, but it is a sizeable task, and is the type of topic that few professional journals or books will ever be written about.
No kidding! 40KLoCs is a bunch of code - especially if it's poorly organized. I can only think of one project I've done that was that large, and if I were to do it again it'd probably shrink by 25-30%. I'd put a bunch of code into a library or 2 and reduce the number of moving parts.
But that's also how I'd tackle this kind of thing: organize it, document the hell out of it, and unit test everything you can. Which to me translates into "make it yours."
I was saddled with a ton of code at one point. It looked like it had been banged-out by the proverbial army of monkeys with typewriters, and they sure as hell didn't write Hamlet. It was pure spaghetti code, written by people who shouldn't have had access to an Etch-a-Sketch, let alone a computer.
I couldn't read it. It was COBOL for Pete's sake, and I couldn't read it. It just didn't make sense. I had to go find several DOZEN of those old IBM flowchart pads and a template, and chart-out every single instruction. Even then it didn't make any sense.
Finally, I took all the flowcharts and spread them out on the main computer room floor, a-la A Beautiful Mind, and go crawling around on them with a big fat red marker. My first break was when I realized roughly 60% of the code was "dead:" it would never, ever be branched to. After striking-out all the dead code, I then wrestled with the file I/O, until I realized that whoever had written it had no concept of a buffer: the code would read a record, get a field, read the same record, get another field, etc.
In the end, I trashed roughly 78% of the code and then re-wrote what was left. One program went from 64 pages to sixteen, then on the re-write went down to four. Yup; FOUR. Run-time for that same program went from sixteen HOURS to 32 MINUTES. Then I re-wrote it again, this time in 4GL, and the four pages became a half-page. THEN I had to go to the Big Boss and tell him that whoever had written the original code had rigged the program to generate falsified fiscal information. Yup; the thing lied right through its teeth. You should have seen the reaction.
Whole thing took about three months, untold amounts of coffee, and three bottles of Maalox. Have fun with your own code.
Regards;
This *totally* deserves a mod up
This site I just googled: http://buytaert.net/cms-code-base-comparison has an interesting (not sure if accurate, but you can wc -l all the files in the latest if you want) comparison on CMS systems.
Wordpress has around 60.000 lines - not too much according to you - and first I somewhat agreed.
To write a module/plugin is relatively easy because docs are OK most of the time. But to MAINTAIN for example the entire WP codebase and knowing every little detail is a different thing IMO.
We have to maintain a similar size ASP JSCRIPT site (around 40k lines last time I checked), and who knows how much more for the native WIN components..... our decision was to rewrite the whole thing in PHP, and the rest in probably JAVA or C with perl for some data processing.
Well, you have to imagine how happy we are with the completely undocumented code that has no comments, and updates sometimes come in the form of unexplained set of files in a cute zip package. A diff would show 10000 changed lines, and since it does not follow the MVC model, you have a lot of html/design embedded in the code (in an ugly way really)....... no explanation on what was changed, not even a list of functions......
Well, what I am just trying to say is that I can see how a small project can span over 3-5000 lines you know by hearth, but how someone else's crappy 40k code can be a nightmare at the same time.
By the way, the language is also a factor..... 40k lines of perl can be a lot to read ( considered "write only" by many), while 2 mouse gestures can generate a few-hundred lines easily in any visual IDE.........
just my 2c really.......
Once for over 1 year my sole job was to maintain the most lucrative product for the company (millions/year). There were numerous other products with newer technology but this was a legacy system comprised of a C++ socket based service and numerous front end scripts and middle tier C++ components (~15-20k lines of code in all those aforementioned technologies). Any wrong change could cost thousands of dollars / day if not more. There were bug fix projects and enhancement projects. I learned that you learn the code one-bug-fix-at-a-time. The first goal is to get it working. Second goal is to break it on purpose and generally play around with the system. Also become very intimate with a debugger. It will make or break you. I didn't have the luxury of having the 'original developers' around (they were fired) so there was no prior knowledge. You are looking to keep your job for a while aren't you? Those who can do maintenance work (everyone wants to work on the new and latest code and coolest projects) will be employable till time ends. It is not glory work. Having done it for over 1 year on the same project I can tell you that the maintenance coder is not in it for the glory but rather for the satisfaction of a job well done AND for a steady paycheck.
Seeing software problems in terms of Flying Spaghetti Monsters? Ah, so that's where the "spaghetti code" term comes from!
This is Slashdot. Common sense is futile. You will be modded down.
... because clearly, you like setting the bar impossibly high for yourself.
You will never know the code as well as the original developer. so stop trying. For very old cases >10 years, that developer was also the analyst who gathered the requirements, further cementing you to a 3rd-bit player in the drama. Let it go.
You *can* maintain someone else's code, though, if you can do a few things:
-dispense with ego
-learn to *read* code, especially as a reviewer
-ask lots of questions
As a maintenance programmer, you have to be fearless about asking questions, even if they dead-end you. You asked. You were thrust into a bad spot, you do your best to figure out where you're at. Assess the situation. There's no rush to fix anything, it's not like the problem's going anywhere and no one is hiring clueless mission-critical coders.
Start small. Start really small, like just reading the code as you might in a code review and see if you can spot trends. If you've been doing this awhile, you can start picking up on the strengths and weaknesses of the author(s). At the very least you can start to immerse yourself in the style and convention, making translation to the actual algorithms easier, i.e. what's this bit doing? I'm not embarrassed to say I've professionally reviewed code that I could never write -- it was VB and ASP -- but I know what object-oriented code should look like, should be capable of doing, and this wasn't it. It wasn't even good procedural/ iterative code... but that's besides the point. The point is, I know when to use a while loop, a for loop, and when to unroll the loop. It's the kind of knowledge that comes in handy no matter what language I'm looking at. Declarative? No problem, it's set-based thinking and straight Boolean logic. Functional? Fine, let's start busting down the parentheticals. It's also about moving data into a register, eventually.
So, you start small, you read the code, you trace some data by hand, a little, and then... run the fuckin' thing with a debugger, step by step, and watch the data move. If it takes you all day to run it once, you're entirely ready on day two to start messing with it. You've likely done what only the original developer has ever done, and that's seen data at the top run straight through to the bottom.
--#
I currently maintain several million lines of perl. It's not hard,
Bow to your superior wisdom. I look at ~3 lines of perl and my brain overloads.
it mostly just works, and when it doesn't, it's not that hard to figure out where it's broken IFF there is a consistent repro case for the problem.
Ah...you just lost a huge degree of the admiration I was feeling.
All the interesting problems I've run across in my career did not have consistent repro cases. If they had, they'd have been easy to fix.
If you have a proper development/production divide, there shouldn't be any weird production issues unless you or your predecessor missed some test cases.
This sentence made me wish for troll mod points.
Even if (and that's a big if) you (much less your predecessor) managed to convince your boss that spending time writing unit tests was worth the time/money, you missed some test cases.
If you don't have test cases, that's a problem, if you don't have a properly firewalled and complete development environment, that's a problem, the code itself? Shouldn't be a problem.
Automated unit tests would make the OP's life easier, to a degree. But they wouldn't make this code base any easier to learn. I feel like I'm feeding a troll here, but someone mod'd this up. So someone actually thinks you were saying something worthwhile, and I just don't see it.
I find the most instructive way is to see a real call stack from an application.
When I was doing Java a great tool was TogetherJ - you could point it at a method and say, show me all the possible calls this method can make. This can yield a really huge visual document (that I printed out on a plotter) but it was really useful into peering into the application.
If you don't have a tool like that, the next best thing is picking some interesting things way down in the bowels of the application, and get a call stack at that point (either breakpoint while you are running or some kind of log). Do that in a few places, and you start to have a sense of how things flow.
The thing I like to understand in any application, my own or others, is data flow - knowing how calls reach either other can help you understand better how data flows through the system.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
During a co-op job I worked on a very large multi-platform app (several million lines of code)
the team had an LXR setup to do project wide searching, however it was aging and having problems, and is a bit difficult to work with.
As a side project intended for a report once I was back on campus, I set up OpenGrok, which worked brilliantly, and was reasonably easy to configure, and nicer to use once we got it setup. The team liked it enough that they switched to that permanently.
both are open source, and were built to handle large code bases (LXR was built for the linux kernel, OpenGrok for when Sun open sourced Solaris).
Another one I had tried, which was very easy to setup was Gonzui. It's also open source, but didn't really handle the huge codebase as well as OpenGrok or LXR. For under 100k lines, it's probably fine, and the ease of setup may be worth it.
All three provide a web interface, and do indexing as a separate process from search, so we would re-index the code base nightly. works very well for larger teams, might be overkill for what you need though.
Heck, that's less than a month's output, if you're working on a well-designed project.
The cesspool just got a check and balance.
Just out of curiosity, what is your opinion of a "Large" codebase then?
That depends on the language, but anything starting above a quarter of a million starts to get large. Consider the Linux kernel - not a typical distro, or the dev tools, or even a minimal bare-to-the-bones distro, but the kernel. The 2.6.0 kernel is over 5 million lines. Later kernels are twice as large. 30-40K is about the lower threshold of a mid-size stand-alone system or a component in a much larger system. For example, at one job I worked on a component that was about 200K LOC. That was one piece in a distributed system containing several dozen vertical components on top of vertical layers of stuff summing up to several dozen million LOCs. This is only considering source code. Once you start considering configuration files, deployment and installation scripts, it gets more complex.
There is now a classification for ultra large systems that in the near (and very likely) future could easily go into the billions, posing new challenges on project management, source control, and just about anything relating to the question "who the fuck knows what this gigantic shit is supposed to do."
Now, difficulty of maintenance is not just a function of code size, but also code structure and organization and documentation.
You can work with a monster system that is in the millions of LOCs and not have a substantial problem implementing new functionality or bug fixes, and then in another job you have to maintain poorly written JSPs that collectively are in the 50-100k (with the later job being a mutant klingon bitch.)
For larger code bases, I use the command line version of glimpse to search through the code. While there are a few open source code search engines, I find glimpse with a few formatting scripts works just fine.
To understand the code, you need to read the code. You'll also need to index the code so you can bounce around it to read, since the limit of most people's stack is only a few items.
Next, figure out the dead wood. Don't remove it yet.
Next, learn what the heck the thing is supposed to do. Find out from what the code interfaces to what it is supposed to do. Talk to users and/or the business owners. talk to the authors of the code. Speak to the problem domain experts.
Next, make sure that you know when it works. Regression tests are your friend here. You need both global tests to make sure you didn't break anything in the large, as well as unit tests, to make sure you didn't break anything in the small.
Next, start to remove the deadwood to make sure it conforms to the spec. This can be an excellent way to learn how the code works, but also is fraught with danger. Why is that extra field always '0'? Remove it. Could be nobody notices, or it is critical for the parser for the consumer of the data to continue working. Learn what matters and why. This step may not be feasible in some environments.
Assume everything will take 3x what you think it will. There's often hidden dependencies, no matter how clueful the original author was. Odds are he/she/it wasn't clueful (playing the numbers), which means 3x is too optimistic.
Resist the urge to recast it in your own image. It won't help as much as you think it will. Rewriting from scratch often is a waste of time, even if it thinks it is a good idea at the time. I've been burned by this several times, often with only so-so results.
Plan on spending extra time documenting and speculating what the code should be like. Chances are this won't be the only time you have to do this.
I've also found it useful in learning to read code to read, say, the 4.3 BSD network code then read the annotated books on the topic. It is big enough to be interesting, and small enough to keep in your head. The linux kernel books cover something that's really too big to learn from easily.
Nobody teaches this anymore, but that's another rant.
233 comments and not one mention of ctags or cscope yet.
XML causes global warming.
I've been in a similar situation myself, though thankfully not (as sounds possible for you) by myself, and I learned one thing above anything else.
Never, ever, trust your memory. As soon as you figure something out, write it down. Right that second, while it's still fresh in your mind exactly what you learned. It doesn't matter as much how you write it down (commenting the code, a separate text document, or for that matter keeping a notebook and pencil close to hand), just that you do. If you don't, you will run across the sinking feeling that you already figured this problem out before, and since you don't remember what the answer was, you're about to do it again. It will also help others that you work with, and even if you don't right now, it's quite possible that you will.
To fight the war on terror, stop being afraid.
cscope or GNU Global are great for learning how code works. They are much more efficient than using find and grep.
5) Plan to remove the dead weight. There's always a lot of dead weight in these near-abandoned projects. Get an idea how to simplify things and plan your work in phases.
There's a lot of anti-IDE rhetoric going on, but I rely heavily on mine, Eclipse (for Java programming). I also rely on Vim, TextPad, less, and so on depending on the task. But for this particular question and the point about dead weight, leverage your IDE to clean house. You can play with compiler and static analysis flags to remove things like unused: private methods, imports, variables or whatever is applicable to your language. If the formatting is inconsistent, run a formatter that pleases your eye (assuming there isn't a group standard for that ... another religious programmer's topic).
Other parts of Eclipse that I rely on especially when I'm in another team's code (we have about 2m real LOCs):
* Call hierarchy [ctl-alt-H]
* References [ctl-shift-G]
* Class hierarchy [f4]
Where I am, some of these conveniences are becoming more difficult to leverage as Spring and its XML configurations define object relationships.
I think the chances are very, very good that the original developers had the same problem you have.
I work on a 300KLOC codebase. It's mid-size. I started working on it about six months into a complete rewrite, and it's been over two years, so I know the code. But it took three months until I knew the codebase well.
40KLOC? You should be able to pick that up in a month, full-time. I've learned an open source project of that size in my spare time in a few weeks.
It's a great way to make sure the code works the way you expect, and when it doesn't you can learn how it actually works. Often you will find that this will expose huge flaws in the original code too.
After that, it's a source of documentation, sort of.
Enjoy!
More should. It's a small part of the problem, but it does help.
Help stamp out iliturcy.
The Software Engineering Radio podcast at http://www.se-radio.net/ had a great show with Dave Thomas from the Pragmatic Programmers on this.
Well, for my 2 cents, I've been working on a project by myself for the past 6 months, starting from scratch, and it's up to about 85,000 lines of code, and I would classify that as medium-scale. It all depends on what your perspective is I suppose.
But, like you said, a well organized 85k lines is a lot smaller than a poorly written/organized 40k lines.
... this is the norm here in the last 7/8 years or so. doin' it differently - unit test, good design & practices, honesty, long term planning etc. - are the best strategies to get you bumped out of the project.
Oh no, there's no documentation...oh wait yes there is...it's on this single sheet of A4....in swahili. Perfectly normal introduction to the new work environment in my experience. Grow a set, and hope the guy who wrote it wasn't actually a genius because it's a hell of lot easier fixing the fuckups of regular developers.
30k-40k is not a lagre pile of code.
If your having problems either the code is poorly written and documented or you have risen to a job that is at the limit of your capacity.
I sugest refactoring the code a bit. It will tidy things up making you able to get to grips with it and let you have a tour of the code while your at it.
Under 500,000 lines of code should not provide any size issues for most good programmers. It is poorly maintained spagetti code that will fuck you up at even 5,000 lines.
If you get crappy code and you have to do any serious amount of work on it, your best path is to refactor.
It somewhat depends on the language used - some languages are easier to penetrate than others. And some languages does more in 10 lines than other languages do in 100.
But anyway - to learn the code you may have to find a starting point (there is usually at least one logical point to start) and then make a flowchart in PowerPoint or something for the general structure. It's no point trying to get into the finer details, just a general sense of flow. You will get things wrong in the beginning, but don't worry. And you may end up finding a lot of dead code too.
When you have a satisfactory overview of the code it's time to really swim and drink the code. Many programmers have a tendency to accept that "it works" and stop there. By throwing the code into the compiler at maximum warning level and then try to fix all warnings you will be even more involved. And if you aren't satisfied you can take on the code with code analysis tools like Splint (for C) or FindBugs (for Java).
And don't forget that the commands "find" and "grep" in *NIX are your friends. Other environments usually have other tools, and IDE:s have their own, so you don't have to install Cygwin or something to get a grip on things.
And if you think that you don't understand the code well enough - try to port it to another operating system or other language.
Of course - this takes a lot of time and consumption of your favorite hacking beverage.
And yes - I'm involved as a single developer in a system with about 400k lines of code written in Java, and it was ported from an older system written in C, C++, Basic, Java, DCL...
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
A former Windows div Microsoftie says: shindex, baby, shindex! If you don't know what that is, ask the guy or gal in the next office over. And then be prepared to spend the next week or so troubleshooting permissions problems until it works across all versions you care about. But after that, you're golden and there's no faster way to search the source. And yeah, I agree with the suggestion of installing SOME *ix toolset. I'm partial to unixutils because they seem lighter weight than some of the subsystem-based solutions like Cygwin or SUA aka Interix.
flow chart it. Crawling through the code is the best way to learn it.
A few years ago i was the maintainer for a couple of unix services. They ran on embedded machines and were 25-30k lines a pop. the best thing I found to cope with it was getting the code in a nice IDE (the cdt for eclipse) and using a visualization package to understand how all the data structures were laid out, I think i used graphviz.
Look for things like misspellings, undefined behavior, indentation screwups, and so on.
The reason is, if there's a lot of these, that's a big clue to you that you have to be MUCH more careful with the code, because it is probably crap. Stupid comments? Probably crap. Explanations of things that are a bit surprising, with citations or justification? Maybe not so bad. Comments that are visibly out of sync with the code? Bad. Consistent naming convention? Good. Inconsistent naming convention? Bad. Tons of copy and paste? Bad.
Knowing whether code is good or bad does you a ton of good in understanding it. If you know the code is crap, you have a better chance of guessing how some idiot will have gotten it wrong. If you know the code is good, you can often guess how someone would have tried to make it robust and/or maintainable.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
My advice to you is to start drinking heavily.
A lot of good advice above, but there's a political aspect to this which is very important.
How you do will very much depend on the expectations that management has. They very often assume maintaining code is much easier than writing it in the first place which is of course the exact opposite of the reality. Make sure you talk with them about expectations up front, about how soon they expect you to be competent in the code.
In all honesty, I worked somewhere that had inherited about 800,000 lines of code, with 5 very sharp guys, and no one understood the code even remotely close to the original authors after 2 years of supporting it.
Companies need to understand that unless they pair program, when they lose the programmer, and he didn't have a partner, they lose the code. You might as well just rewrite it because it will take a new person as long to understand the old stuff, longer perhaps because he is not learning it in the orderly progression that was there when it was written.
There's a maxim, "He whose work is the most incomprehensible gets the most respect". The suits and pin heads who run software companies fall for this 100% so the worst programmers, who by luck of the draw got to write the first spaghetti mess, are glorified while the maintenance programmers are seen as little more than janitors.
I make it a rule to go on unemployment before accepting a job as a maintenance programmer. Avoid it at almost any cost! It's a thankless job that usually ends in frustration and tears unless you have a VERY understanding manager or circumstances have granted you a very successful product that needs few fixes or enhancements.
Often companies hire programmers when they are behind or they lose people because they overworked them. Thus you are coming into a bad situation, already behind with everyone expecting miracles.
The 'size' of the code really boils down to what needs to be examined/changed. If you have a billion lines of code that are rock-solid and a million other that may need to be modified -- that's a big difference. Programming is all about localized knowledge.
That's nothing. I work with 50K+ loc projects and because of that I won't offer any real solution to your problem. I just want you and everybody else to know it an so I write it here.
Not necessarily end to end, but leave yourself a trail of breadcrumbs
as you trace through and learn the code stories.
If you can write about it accurately, you understand it. If you
can't, you have to dig deeper in that area til you comprehend it enough
to summarize it and its quirks accurately.
I had a prof once who shall remain nameless, though he claims to
have "invented" modules. But he did have some good advice. He said,
even if you just hacked together some code (or someone else did), you
can retrofit software engineering standards onto it by going through it
and writing the design document after the fact (assuming the crap didn't
come with one.) This not only leaves a legacy of a maintainable project,
but allows you to understand the essence of the software and the
important decisions that were made in the construction of the software.
Where are we going and why are we in a handbasket?
When I inherit such a monster I just start studying it. Too bad the environment of today's open plan office doesn't allow concentration necessary to learn code. This will doom our planet in a hundred years. If only women in the office could be required to shut up for at least 20 minutes out of every hour.
This has happened to me several times, and again just recently. I'm not sure how many lines of code it was this time (I don't really care), but several thousand files (I do care about the structure). 'This is your new project, we have some stuff we need done ASAP'. The big constraints are:
- They want you to start doing stuff right away. That's usually a given.
- Therefore you do not have time to fully understand this code. You do not have time to do a full dissection. Just give up the idea that you can even do so in the short term; that will just paralyze you.
- Very little useful documentation. Read it if there is any, but keep in mind that it is usually out of date and therefore a filthy lie.
What you need is a good understanding of the parts of this code that are important right now and some high level overview. If you knock off enough of the little things you will end up learning the whole thing. In this way you gain enough confidence to move forward. So, get cracking:
- Make a safe copy. If you're lucky it's already in version control. If not, do it yourself. Check in your test stuff fairly frequently (not in the main trunk!) because you will be breaking things often at first.
- Use cscope or any other tool you like that will let you hop around the code like hyperlinks. cscope lets you do the following very important things: find the definition for this thing (method, structure, #define, whatever). Find all places that are calling something. Find some text anywhere in the code base. Find a file anywhere in the code base. You need this integrated into your editor so you can do all this without thinking - you can be cruising along, hit a reference to an unfamiliar but important looking datatype or method and just hit a few keys and go to the definition, wherever it is. And then pop back. If you're using Visual Studio then this is already built in, as much as I hate VS otherwise. cscope is an easy addition to emacs, I imagine for vi too. As a last resort, stand alone cscope, but it is so much slower than having it in your chosen editor.
- Add plenty of debugging printfs in areas of code you're interested in. #define a macro for it so you can turn them on or off easily. You can run it under a debugger, but I usually find that takes much longer to step through unless you know exactly what you're looking for already. And with the printfs you will soon develop a feel for what's going on and what values you expect to see. Debugging printfs are like a heartbeat for the code.
- Take notes in a wiki or whatever you prefer the general structure of the program - mostly which areas of the code do critical things that you're interested in, like common/engine/pp.c contains the paper path motor and encoder logic. Or anything else important you find.
- Start solving problems. You won't learn the whole codebase at once by zeroing in on a specific issue to fix, but you will learn subsystems fairly well that way. There should be sufficient separation of logic unless the code is hopelessly broken (which is possible). That's the big thing. Don't worry and get paralyzed if you don't understand it all right away, just work on understanding the bits you need right now and eventually you'll build up a picture of the whole thing.
I realize there are people who are going to freak out at the idea that you would go in and poke at things before you fully understand everything, but unless you have the luxury of unlimited time, that's not an option. Someone up above suggested writing unit tests for existing code, which is good idea in general, but is probably far more time consuming than you have been given time for. Try writing unit tests for the area you are working on right now if you have the time. It's possible the codebase is so broken that the little changes you are making here are having adverse effects elsewhere, but all you can do is try. Eventually as you knock off issues you'll gain confidence and knowledge and before you know it people will be coming to you with questions about the codebase.
break the code :-)
it's the same as dismantling your dad's radio/car/computer to see what's inside and how it works and re-assembling it , only to find out there is is still one piece left
Medium size is 250 to 750 million lines of code (one person can still understand how it all works). Big is 1 to 10 billion lines of code. Really big is >10 billion.
I have worked on code bases of all of those sizes, and I like the medium size the best -- it's big enough to be interesting, and small enough that you can understand it all.
One that I've worked on (over 25 billion lines) is just too big for my tastes -- over 3 years to do a clean recompile is excessive.
---
Someone always have to be the biggest and the veriest, don't they? ...
Have to share painful past tale. I inherited a ~30K line app that ran on an embedded system of which we had two copies of the hardware, both in production. These were used to do PIN block translations from an acquirer network to the bank/verifier networks & associated security stuff using strange-o NCR security processing equipment plugged into some kind of Intel OEM chassis with an 80186 board running the embedded app, talking to a mainframe over a pair of 48Kbps SDLC links (and the SDLC protocol was part of the app) powered by the i82530 SCC.
I inherited the code because the author, who was a friend of mine, went home with the flu one day and dropped dead 3 days later -- aged 29.
Worst part...I couldn't even rebuild the current binary that was in production from the code I found on his PC. But I spent time trying to understand the code base...and it was hard, especially without being able to run it on anything.
I had a manager who simply wouldn't listen to me until I had printed all the code out -- which was pointless, and I was too stubborn to just print the code out and say...ok, that was a waste of time, now what.
A middle manager was appointed who came in bursting with enthusiasm. She *did* print out all the code...um, using MS Word as an editor so that it "looked nice" (i.e., appropriately
girly girly choice of fonts).
She was very keen and said...I'm sure we can go through this in a morning. Well, I was secretly thrilled when she was on the verge of tears by teatime. We never got on top of that system -- our management woudn't consider my suggestion that we redo the thing on a normal PC & use Linux and change the comms stuff to TCP/IP. But one of the other disgruntled people from the company saw the gap, quit, started his own consultancy & after not too long, showed me the same SP stuff that was remotely managable over X11.
I think the biggest trouble is with knowing why things were done. You will look at the code and see that decisions appear to have been made arbitrarily. You'll scratch your head wondering "they had 3 design options but they chose this one, why?". You need to understand the use cases to know the why. It's not always obvious because many times its based on tribal knowledge that was obvious at the time but not now so no one thought to document it.
Ask around and find out of any of the higher ups from the original project still remain. Setup an interview with them to get the project history and go over the use cases. When you go back to the code, you'll better understand why things were done.
Camping on quad since 1996.
Use a source browser program and you can easily find thing and understanding code written by others in very little time.
Here there are some links to source browser:
http://linguistico.sf.net/wiki/doku.php?id=software_libero:programmazione#browser_di_sorgenti
A former Windows div Microsoftie says: shindex, baby, shindex! If you don't know what that is, ask the guy or gal in the next office over. And then be prepared to spend the next week or so troubleshooting permissions problems until it works across all versions you care about. But after that, you're golden and there's no faster way to search the source.
And yeah, I agree with the suggestion of installing SOME *ix toolset. I'm partial to unixutils because they seem lighter weight than some of the subsystem-based solutions like Cygwin or SUA aka Interix.
Actually, permission rights were never the problem for me. As a build engineer, I had full read and write permission to every windows code base. And, I had a keyword that would override any Product Studio that might be there.
Basically, I could have checked in a "Hello World" dialog into explorer.exe without any approval from anyone... as a build engineer, one needs that kind of power.
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
Thar: fixeth'd it fer ya!
You've inherited a fairly large (30-40 thousand lines) collection of code ...
30k to 40k lines of code is not large by any means of measurement.
A programmer running mad will chill that out in a year or less. But perhaps that is your problem ...
Anyway as hint of understanding I suggest debugging it. Perhaps you find old bug reports (hopefully fixed meanwhile) and you can try to play them back with a debugger and put nice break points and get an idea. OTOH I fear your program is just old plain C so it might be hard to grasp in debug mode nevertheless.
Good luck.
angel'o'sphere
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Many times my job is to help companies that had this kind of problem:
"We need to fix a bug/add a new feature to this huge code base"
Most of the time it's few hours max.
The "secret" of how I do it:
1. Don't say "Who's that as&$## that wrote this code?" (it'll not help you) :) "Good luck and may the force be with you"
2. Don't say "Why he code it that way? I could be done with much less code elsewhere" (it'll not help you either)
3. Your job is to find how to add the requested change while not changing too much code. Always remember: Every line of code that you change = tons of new problems.
4. Tools needed: Notepad / vi / pico / nano, Windows Explorer Search of XP / 'find' in unix, and the compiler is all I need. For this kind of jobs I'm not spending my time installing IDEs and doxygen.
5. Last thing to remember before starting to work: Try to avoid adding additional libraries that depending on other libraries / special system features. I'm trying to find open source / free and small code. Pure / close to pure ANSI C/C++ is the best. Few source files - best!
6. First step: Re-compile everything if it's not take too much time and run the compiled code to check if you have all the environment needed. If re-compile can take too much time (I had a project that taking over 24 hours to re-compile...) compile only the relevant modules.
7. First thing to do: try to break the code into modules but ignore any module not related to your task. Write a text file with all the relevant only things that you find.
8. Try to find the smallest change to do on the code. I can be a crazy change but the most important: it must be the smallest change.
9. Pray that it'll work
Whaaaaat? Why does the person doing the builds need write access to ANY of the code base? That makes no sense!
For a site about things like basic rights, Slashdot users sure do like to censor "dissent".
it is a sizeable task, and is the type of topic that few professional journals or books will ever be written about.
Right, no one has ever written a single book on that topic.
(...) What do you think about intermediate variables that are not strictly necessary?
Obviously your example is exaggerated, but I wonder the same thing as you. I used to declare too many variables, I think. I read the code from Triplify (just 500 lines of PHP) and it was interesting to see how they did more inline - I think it was neat, but sometimes confusing. I didn't find a balance yet, though.
Ooooh yeah.
Four #includes and a line that starts "int main" is all you need.
"Whaaaaat? Why does the person doing the builds need write access to ANY of the code base? That makes no sense!"
Not sure how you do it, but I tag the source before building it.
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
40K Sizeable? Hell no.
I picked up a one million line codebase with one other engineer. Sure we don't know it inside out, but we're able to work with it. I'm never going to know it like I wrote it myself, but well enough to maintian and add functionality, sure.
Already stated, but my 2 cents: - use a good IDE with fast referencing possibility (e.g. right-click on a function call => "follow") - use a profiler to see a flowchart or UML for a high level overview - start commenting the classes and refactor their names if unclear. There are nice tools out there (depending on the language), which create DocBlocks for everything first and then you can use DoxyGen to generate a nice overview over everything. And about the question of how bad you are: one of my IT lecturers had worked at BMW and they had made a test on the efficiency of their programmers on new code and on code written by other developers. When the same developer had to extend or change code of other developers, he was a hundred times slower than when he would code on his own himself. That was around 2001, if I'm not mistaken.
Check out Code Rocket - this is what it's for.
Now I've heard everything. Mission critical and PHP in the same sentence.
You need to
1. Understand the business rules.
You need to know what the system / application does before you can begin making changes to the code.
2. Get an overview of the system design / code structure. If there is any (otherwise it is going to be very difficult).
Break down the system into use cases and try to see what part of the code each case covers.
That should give you an idea of the business logic and the class structures (assuming it is not one big bowl of spaghetti).
3. Create a working document with your diagrams and development plans.
Put all your observations on a whiteboard, paper or a napkin as needed. But remember to draw it Visio, Word or OpenOffice.Writer too.
You don't have to do this all at once. It can be done as you move into the code to fix bugs or when making changes.
It will probably take you between 6 to 18 months to get fully acquainted with 30-40K lines of code.
It also depends on how hard business is pushing you. The more pressure on bug fixing and system changes, the less time you will have to learn about the system as a whole.
Even though 30-40K lines isn't that much it is probably more than a one man job.
If it is a business critical system. It is more likely to be a 2-3 headcount.
You should have you own exit strategy ready and get out of there, in case business wont take your challenges seriously.
Anyways i hope they pay you well.
Good luck with it.
reading the code is no good. Instead should learn what their class names and function prototypes are. You can get pretty good picture of the code just by looking at the functions.
I am starting work on an extremely large code base with globally scattered teams. It scares the hell out of me and makes me want to retire even though I have been writing code as an EE for 35 years, mostly real time, hardware centric. This new one is a gigantic GUI based, distributed nightmare.
I think the tools if not the applications have reached a complexity that challenge the best and brightest. To make it worse, there is a tendency for less mentoring and training. The entire prospect of multi-tasking between complex products, regularly switching between products is inefficient because you tend to lose focus. Management expectations are untenable as the bug count exponentiates. The entire profession needs to step back because the limit of human capability has been reached with this paradigm.
I even found a bug in slashdot as I was typing this missive !!
Source Insight lets you browse source code - very useful for largish codebases. It's much quicker than findstr or grep because it has an index rather than having to search the whole thing. It's not free of course but I'd never go back to findstr having used it.
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Whaaaaat? Why does the person doing the builds need write access to ANY of the code base? That makes no sense!
First, the build tools are in the code base... we're not just running "make" here, there's a hojillion scripts doing a hojillion things every which way...Windows goes through a crazy amount of pre and post processing...
Next, this is Windows... it's a critical build... the build MUST be pushed out every day, and include as many checkins as possible.
Someone pushes out a checkin that breaks the build... 100 other people made checkins and their code didn't break the build, and they need to test their code now... We can't just say, "sorry, build broke, we're scrapping, Person XY needs to fix the break, and then we'll start again." Because the build takes 14 HOURS!!!
So, it's the builder on duty's job to revert the checkin and then restart the build, hopefully, you will have enough time to make the build finish by 9:00am tomorrow, when people start arriving at work.
You may be happy with your 3-4 hour compiles, and builds that can sit around broken, because 100 people aren't depending upon you for that build... meanwhile the real build engineers have to deal with serious shit.
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
"Whaaaaat? Why does the person doing the builds need write access to ANY of the code base? That makes no sense!"
Not sure how you do it, but I tag the source before building it.
Windows also has a bunch of metadata that each build generates along the way, and this metadata gets checked in during the build process...
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
I've taken over some code projects of the 10,000 - 100,000 line range. Usually they'd gone through several hands at that point and I think that's the hardest part. I mean, if you have code written by one person you get used to their syntax and style. When th code has been passed around, you run into a lot of difference approaches to problems.
My advice would be to:
a) backup the code when you first get it. That way if you screw up, you can go back to the original.
b) read through the code for a while without a mind to change anything. Just get used to reading it and see if you can figure out what's going on
c) decide to change something minor. Something completely trivial and then do it. Any really little, pointless thing. It'll teach you how to find details in the code.
d) Do NOT change a bunch of stuff because it "the right way" to do something and the code is doing it "the wrong way". Often times there are things in old code projects which work in a fine balance and you don't want to change them. You may regret it later.
One thing enjoy about taking over older code bases is learning how people used to do things. I've looked at code ported from old UNIX or DOS boxes and it's interesting to see how they got around memory or file restrictions. Definitely a great learning opportunity.
'I am currently working with a mission-critical codebase, which is written in PHP..'
Sorry, I'm sure you have valuable points to make, but I stopped reading at this point because I was laughing too much to continue.
Mission Critical. PHP. *wipes eyes, sighs*. Good one.
You have my deepest sympathies.
The freely available book "Reengineering Patterns" (http://scg.unibe.ch/download/oorp/) contains practical advice and shows systematic ways to tackle these situations froma variety of angles. Without knowing more details about your problem it is hard to recommend concrete steps, but _do_ read the book in any case.
Specifically limiting yourself to "reading code" and relying the likes of "grep" is (as far as I'm concerned) behaviour of a Code Monkey, not a Software Engineer.
As a result, although the new functionality worked fine, the application still suffered for the "spaghetti" code of patches upon patches of years of various developers adding additional capabilities, but no one ever addressed the reliability of the application. The support group for this application was clearly frustrated with years of late night calls and hours and hours spent trying to correct errors.
About 6 months ago I was tasked with essentially "cloning" the application for new business purposes. I proposed porting the application to a newer, more modern language (java). It took a lot of selling (i.e. convincing management and other developers that the end result would run just as fast, be easier to maintain and have more reliability), but I was able to get them to buy off on it.
The rewrite was completed about 3 months ago and the results were better than i had hoped for. I was able to complete the rewrite in the same amount of time allocated for the original "enhancement" project. The application actually runs faster than the old one, has yet to crash (it runs 24x7), and the code is well structured and easy to maintain. We're now in the position that if/when another "enhancement" is requested to the old application, we can simply clone the new java version and completely replace the old app. Given the results of the last project, it won't be a hard sell (especially to the support group) to go the java route.
I know this is a long post, but the bottom line is that sometimes (more often than many realize), recoding an old application in a modern language and bringing it into the 21st century rather than patching old code can pay off dividends beyond the basic added functionality.
Sometimes the light at the end of the tunnel is the headlight of an oncoming train.
This will kill two birds with one stone. Write unit tests for the codebase. You will learn the code and learn what it's supposed to do well while you're doing it. Further you'll be in a better position to make changes without breaking the existing functionality.
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
If you surf on over to Krugle.com, you will see that they now offer a free evaluation copy as a standard product. If you want to get a feeling for what can be done with the tool, just check out Krugle.org, where lots of open-source projects are indexed online. I would definitely recommend using the free evaluation tool as a way of speeding your high-level understanding of any new-to-you code base.
As a fellow build engineer, I always find it interesting to hear about the processes at Microsoft. One of the books I read prior to taking my first build position was "The Build Master: Microsoft's Software Configuration Management Best Practices" by Vincent Maraia. It was interesting to read about the type of processes that come out of a build that does take 14 hours and has hundreds of people working on the codebase.
One of the concepts I liked quite a bit was "The Gauntlet". I can't remember if this was used on Windows, of if it was specific to the Visual Studio team, but it was pretty slick in detecting what change actually broke the build. Though I heard the system would get backed up from time to time causing lots of delays.
With the amount of large code bases Microsoft, or other companies maintain, it still surprises me how primitive most build systems are. Only recently have companies started to release build specific products, most only suitable for small codebases, or built for java/web development environments. I guess the problem is that large products are pretty unique in their build requirements. I work in the games industry, and most of our code build times are measured in minutes these days when the proper hardware is thrown at the problem along with distcc/incredibuild. The time consuming processes tend to be more related to game content now, things like lighting levels, or generating AI pathing information.
I just started a few months ago a job where I'm maintaining an old embedded system (an isdn gateway, old technology) that is supposedly written in C++, but is actually bad C.
It has no comments and no documentation of any kind. Indentation is broken beyond repair. A lot of functions are several thousand lines long, while most files are in tens of thousands of lines.
All I needed to deal with it was generate tags. Once you've got the tags, you can jump to a declaration or definition easily anywhere inside the code base. That, combined with grepping all the files of the project for the right strings or regular expressions (the system does a lot of logging, so I can just grep for the log message to find the relevant piece of code), makes the job doable.
But then, it's still a boring job with little opportunity to shine. I'm personally leaving whenever I can afford to move again.
You find out what it's supposed to do according to functional spec, and you write a test-suite against it. Two birds with one stone.
Religion is what happens when nature strikes and groupthink goes wrong.
You shouldn't feel bad about not understanding easily all parts of a large code base. I've been programming for over 10 years and there are some systems that are still in production for more than 7 years and I am still in charge of maintenance. When I have to go back and change something it is very difficult, it is almost like someone else programmed it and it is tough to remember how things work. The problem is not that I am dumb now, the problem is I wasnt as good a programmer then and there was no budget/time for decent documentation.
meanwhile the real build engineers have to deal with serious shit.
Real Engineers use conditional compilation so they don't have to recompile every single stinking row of code every night.
"I don't know, therefore Aliens" Wafflebox1
I haven't read through all the posts and there are some great suggestions and strategies that have been outlined.
I've been through the same situation quite a few times in my career.
Have you been able to track down any of the project artifacts developed as the software was being created.
Business requirements, functional requirements, use cases, design docs, database designs, user guides, etc.
I know these documents, if they exist, can be out of date, incomplete, or puzzle pieces for how the software has evolved over time.
However, what may exist might be able to provide a high level picture of the software from different perspectives and shed some light on little nuances.
Just last summer I took over a project with over 250,000 lines of code. It was a complete disaster of a codebase, a total Rube Goldberg machine... but somehow, after years of poking and prodding and band-aids and what-not, it WORKED...however, even the tinest code change too weeks to happen because the code was so badly written. The project had a ton of turnover through the years, and from the looks of it many of the coders use conventions from different languages they were familiar with, copy/paste all over the place, bad structure, fragile inheritance schemes, etc., etc.
So, I did the only thing that made sense. Started completely from scratch, picking out the parts that were usable as we went. We haven't finished yet, but I haven't looked back...
"If at first you don't succeed, lower your standards."
use revision control, and don't trust it -- that is, back up incessantly.
Dude... Find a better revision control system.
It's true. You'd think the compiler could just do the same, but it can't, not always. i++ can have consequences. ++i never does. But frankly, if you aren't doing billions of these a second...
I work at a major software company with millions of lines of code in our software repository. A lot of the developers here favor Source Insight www.sourceinsight.com/ It is an excellent code browser for complex code bases.
Just FYI, you wouldn't need cygwin anyway. There is the minimal GNU system for windows, which is a native port of some basic GNU tools to windows.
I use it quite a lot with mostly satisfactory results.
You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
few (fy)
adj. fewer, fewest
Amounting to or consisting of a small number: one of my few bad habits.
Being more than one but indefinitely small in number: bowled a few strings.
n. (used with a pl. verb)
An indefinitely small number of persons or things: A few of the books have torn jackets.
An exclusive or limited number: the discerning few; the fortunate few.
There isn't one. All software has bugs. Even your revision control system. Don't trust it. Make backups. Redundancy is the only way to be safe.
Okay, first off, 40k lines isn't big, and unless they did a really horrible job of naming and organizing the parts, it shouldn't be hard to tackle. I'm dealing with a 1.5 million line assortment of legacy code, and while it's taken a while to suss out, I have a pretty decent grasp of where everything is.
It's a Zen thing. You look it over until you identify the top level units, then work your way down. Most applications have a framework. If you can't find a starting point, figure out which code is the most outward facing, read through the high level functions, and dig downward. Make notes. Look for comments along the way. If you don't see comments, write some. Absorb. No one understands any system immediately, if you become one with the code, you too can be a master.
Nowadays, I'm the architect, and while there's still more code written before I got there 5 years ago than since I arrived, I understand just about all of it. I've been programming for 20 years, and this makes the fourth time I've started with a million+ line system and ended up being one of the experts.
Patience, grasshopper.
*** *** You're just jealous 'cause the voices talk to me... ***
Results 1 - 10 of about 1,070,000 for "legacy code"
Let's me preempt your next comment: "but how many of those are 'professional journals or books'?". Well, 2,640 of those are from the journal of the ACM. That's just a bit more than few now, isn't it? Looks like you have some reading besides dictionary.com to do.
If you don't have a repro case for a problem, you are getting way ahead of yourself trying to fix it, as even if you fix it, you won't KNOW that you've fixed it.
Without tests (and note, I did not specify automated unit tests, those are handy and speed things up, but I personally prefer end to end integration tests when dealing with a system I didn't write) you can't figure out how a system is intended to work, at which point understanding how it does work usually isn't helpful, and can actually be harmful as you internalize a model of how it does work as how it should work. It hides bugs from you, and often leads to your internal model being horribly flawed (from the perspective of what the program should do).
Does your boss have expectations about what the system does? If so, and if they tell you those expectations, you have tests. Sure, they are the manual integration kind and probably underspecified, but it's a starting point.
And yes, tests make a code base easier to learn as they give you a something to trace through and a basis for reasoning about how the code base should work. Fleshing out and automating those tests refines that understanding.
Realities just a bunch of bits.
The CFT utility (C Function Tree Generator) provides a summary of the functions and calling hierarchy.
The CST utility (C Structure Tree Generator) gives a summary of the data structures and how they are nested.
I don't think these utilities have been updated for quite a while. Can anyone suggest more modern versions of tools that do similar code analysis and reporting?
Yeah: that's a use of the word large with which I wasn't familiar.
Break it up into what appear to be the logical sub-components, test by making libraries, and seeing how things link together and headers are included, until you have sufficiently manageable pieces.
But, even in the aggregate, one or two read throughs should get you a "feel" for the code.
In Liberty, Rene
Size is relative. A well-organized, commented, documented, 200k-300k program is by no means large. Even for a single newstart developer. But 10k-20k of, say, badly written Perl might require a few sessions of therapy afterwards.
Believing something doesn't make it true. Not believing something doesn't make it false.
No, my next question would be:
"Is hanging out on Slashdot looking to cherry-pick a phrase out of context for the sole purpose of telling someone, anyone, that they are wrong, a lonely life?"
I'll stick with my opinion that the submitter's question was A) A good question, B) Worthy of honest response and discussion, C) Germane to an area that gets less coverage than it deserves.
And that your response added nothing worthy to the discussion.
Check out the excellent article Code Spelunking Redux: Is it getting any easier to understand other people’s code?, and learn to love Doxygen and DTrace (if your language is supported).
May I suggest reviewing the FAMOOS Object Oriented Reengineering Handbook. Ignore the
the age (1999) and consider the approaches.
FAMOOS Handbook: http://scg.unibe.ch/download/projectreports/FamoosHandbook.pdf
I feel like I'm feeding a troll here, but someone mod'd this up. So someone actually thinks you were saying something worthwhile, and I just don't see it.
On the other hand, some people just are that lucky, and never have to wade through three or four feet of someone else's leftover muck.
The higher the technology, the sharper that two-edged sword.
I was thinking more of the perennial "It works fine on my machine" problem.
Then a tester (or worse, a client) installs it, and there's some Terrible Thing that happens pretty much at random that you don't have any way to get enough information about to reproduce.
We had a case a few years back where, every 2 or 3 months, all our machines at one client would quit responding to input. They'd have to shut down production, hard reset the server, and then all the clients.
We spent months trying to repro this in-house (the client was in France, so flying someone out to their site just wasn't in the budget...although wasting those months probably cost more in the long run.
We finally narrowed the problem down to e-m interference between some machine they only used about once a month (so it still didn't happen every time) and our wireless network.
The solution, which took one guy a weekend, was to switch our communication protocol from TCP to UDP.
It's kind of hard to predict test cases for that sort of thing.
I could not disagree anymore with your statement.
If you're not disagreeing anymore, presumably that means you're agreeing. Or something.
The higher the technology, the sharper that two-edged sword.
I am article submitter O.P. and not retard I am programmer with Master DEgree in Computer Science from Indian Institude of Technology and If I am retard why does IBM give me 40.000,00 lines of code? American IBM cannott do it so they give it to me because of my education in India IBM paies me 2 Mexican paysos for every line of code I fix that American coder screw up and I need food and room like American does. If American wants money than American should do job correct the first time and not have to send it to INdia to get all the work done correct. As AMerican teenager say DONT HATE THE PLAYER HATE THE GAME
The good news is that he's not a technical writer.
The higher the technology, the sharper that two-edged sword.
When on separate lines (or as separate expressions like 'for (... ; ... ; i++)') they should compile to the same code for C and Java.
They compile to different things in C++ for non-primitive types when operator++/operator-- are defined, such that pre-increment has a (slight) performance gain.
Of course you should make backups, but sitting around gnawing your fingernails in terror isn't really necessary.
Assuming that you only use the basic tools, but when you are a *NIX nerd you will soon get accustomed to a lot of the other tools too.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Ahah, you see my point! I don't gnaw my fingernails in terror because I back up regularly!
I used to port Japanese RPG games into English for Working Designs, which were similar if not larger code bases. All the comments were in Japanese, and frequently many of the tools used to build the product and assets were missing.
The way I dealt with it was to only focus on the problem I was trying to solve, and not worry about the rest of the code. The poster who said to backup the code in a VCS was right on... once you know you have a stable base to go back to, you can try all the changes you want.
If you approach the code with a goal, you can then think about likely places where that code would be. Grep is your friend. If the code has embedded strings, you can search for those strings. Otherwise, you can find the handles for those strings, and search for those. If it is some sort of I/O or database access, search on those call names. Frequently there are naming conventions that you can learn and use to find stuff.
The idea the earlier poster said about putting breakpoints in and/or stepping through the code was right, this is also a very useful practice. It is much easier to follow the flow if can step through it, especially with C++, where inheritance can often leave one baffled as to which code will actually run.
Following main (or your language equivalent) and then drilling down is sometimes useful, but it is often easier to find the bottom and work your way back out.
Watchpoints can be really useful. Find a variable with a value that interests you, and put a watchpoint on it so that it will break when that memory is accessed. A great way to see which routines are involved.
If all else fails, pepper the code with print (or logging) statements and see what shows up. Try to narrow down what you are looking for.
Another useful technique is to comment out a section of code and see where the compile breaks to find dependencies.
As you figure stuff out, add comments. Perhaps also keep a file of notes when you find stuff or figure out how things work.
As long as you are focused on solving a particular problem, the code base isn't so unreasonable, because you don't care about most of it. As you knock down each problem, you learn a little more about the structure of the code.
Remember, programming is the art of breaking problems down into smaller problems until they disappear.
We actually have the developers tag the build after checkin of any completed defects.
Granted, we don't have a dedicated build person, but if we did they wouldn't have checkin rights to the codebase.
For a site about things like basic rights, Slashdot users sure do like to censor "dissent".
Heh, I'm a developer, the cvs "gatekeeper", and the primary builder in a shop of 20 odd programers.
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
PS: Developers also use change tags but I was talking about the build tag.
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
Why would you want to archive stuff that can be reproduced by a build?
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
Sorry, but all the people giggling saying 40k lines of code is "small" are just hothead braggers. Aside from maybe the final graduation project or program for your undergraduate thesis, how many of you with CS or CIS 4-year degrees ever had to write even close to 40k while in college? None of ya. Only to then graduate, work for a while, and then act like anything under a million lines is small? Whatever.
If your particular language is supported an IDE can provide very easy navigation on the nagivation. You can branch through the code.
And doxygen can generate call trees and stuff which can help you.
Of course, the nature of your target language should support these (i.e. statically and strongly typed languages).
So your opinion is that my suggesting two books to read, followed by notes on how to find the extensive academic library on this subject, added nothing worthy to the discussion? Interesting.
I corrected one point in your otherwise useful commentary, admittedly in a somewhat snarky fashion--that was meant to be a bit of a joke by the way, which you didn't take well. The cherry picking out of context started when you decided to pick on one word I used, rather than considering that perhaps my quick spoof suggesting literature in this area was just alluding to a larger issue in how you described it. You don't quite seem to have gotten that still; "less coverage than it deserves" is just not a defensible position, given that there are in fact two major books and thousands of research papers on this very specific topic.
One of my first inheritances at work was about 40k lines of code. One big ass file. Assembly, so I had no chance with Doxygen. Lots of macros to build totally different things depending on definitions and phase of the moon. Half of the comments were in french, which is erm, french to me. And each and every bit of RAM in the target was in use. "You only need to add this little feature." And be quick, because the system is already sold. And it is overdue, too, because someone in sales forgot to place a development job for the "small" change.
And now tell me that 40k lines is small and easy...
Two links to the same book doesn't equal two actual book links.
Perhaps rather than using the word "few" I should have calculated whatever small percentage of books are available on the topic.
But then again, I have a life.
In my experience, the client (software, hardware and wetware) must be considered part of the repro case until demonstrated otherwise. I don't know how many bugs we've tracked down to interesting browser behaviors when certain windows accessibility features are turned on.
I will admit that as I do web and infrastructure development, I probably have a leg up on those doing traditional software deployment.
Realities just a bunch of bits.
There are many things I'd do and some are dependent on the language as some things make more sense in some languages and less in others.
First thing I'd do is get all the existing documentation I can find including the end user documentation of how to use the software.
I'd next try to break the software down by modules, subroutines, functions, library routines, etc. to get an idea of what does what. I'd also try to determine variable usage, such as local vs global variables and where things are defined.
If the above is not already documented I'd work on creating the documentation so I don't have to refigure things out each time I dig into the code for something.
The code style of the previous people who worked with the code can be very important. Some languages are easier to write obscured code in then others. If the code is NOT documented or the documentation is obsolete I'd start working on the inline documentation. Anyplace that the code is very obscured or poorly written I might look into rewriting so the code is easier to document and easier to read.
Don't trust any of the documentation until you've made sure it is up to date.
At one of my jobs the package I was hired to maintain, support, and enhance had been modified on a per customer basis where some varialbes had different meanings in different versions. There where some features where the feature was implimented differently in different systems to meet different customers differing and conflicting needs. In some cases the mainline module code would look the same but the differences would be hidden in the subroutines. This was made even more complicated by being a multiuser application that did its own file locking. The original application had been single user so there was more then one method of gdoing file locking, some of which was based on what files where in which 'partition'. The system only allowed locking entire 'partitions' at one time. As customer grew to need multiple disks with multiple partitions the multiuser locking would erratically fail, corrupt data or deadlock, etc.
Look for the tools people mentioned that can help you easier figure out how things work. There were no tools for the system I worked on so I had to create my own (proprietary non-standed OS and interpreted language).. My boss complained aobut some of the time I spent working on the tools until he saw how they were saving time and helping make it easier to make changes.
Don't be afraid to look for tools to make your life easier. Don't be afraid to write your own if there is a good reason to do so.
The system I worked with was about 200 programs / customer with about 200 subroutines (sometimes unique for a customer) in each system.
I don't really have a good answer to the problem. I just graduated college 4 months ago, got hired to work on a code base that is around a million lines of code and it's not easy. If your lucky the code maintains a certain amount of consistence and you find a lot of similarity in various objects/modules. I found that spending some one on one time with my debugger (gdb) has really helped me to get a handle on the structure (request/response socket classes, model/view controllers, cache dbs, etc). Patience is your best tool.
http://sourcemaking.com/antipatterns/spaghetti-code
True, it does have awk, sed and a few other handy utilities but it doesn't include other things you might like, such as python.
You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
A good measure of a programmer's competence is based on measuring two temporal differences.
First, measure the time it takes from when he starts the job until he makes the comment "I really think we need to rewrite this".
Second measure the time it takes from the first point until he realizes that it's better to maintain what you have since it's too big of a job and even if it were "done properly", deadlines would screw up the new codebase as well.
meanwhile the real build engineers have to deal with serious shit.
Real Engineers use conditional compilation so they don't have to recompile every single stinking row of code every night.
Real codebases actually have this dependency information written out so that one can do incremental builds. The Windows codebase however does not have such information declared. Any change could potentially affect anything else in the build.
Again, I already stated: the Windows codebase and build process is not a "product" and thus does not receive the attention that it should get for shine and polish.
We had one guy working on building an accurate dependency graph when I left... he was trapping the syscalls to report which files each compile was using, and dumping it into a large database (we're talking over 4GiB).
Again... spaghetti code is enormously difficult to maintain, and when that spaghetti code is Windows, one cannot simply dump months or years of work into sidetracking to make things work the way they're supposed to.
I mean, that's why Windows Vista was so late. x86-64 came out and Microsoft basically said, "porting WinXP to x86-64 will cost us just as much money as porting Server 2003 cost... let's just dump the work we had already, and start from Server 2003".
I totally agree with you, and as I already stated the Windows codebase is a horrible mess... but we had to make it work... that's what they paid us for.
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
Why would you want to archive stuff that can be reproduced by a build?
The things like the publics need to be available to the other build machines in the distributed system. The easiest way to do that was to check it in, and then push the data back out.
Recall, we're talking about just maintaining code... I didn't write it, and so I don't really know everything that was going on worked. I had a good overview of what was going on, and some deep introspection into some very limited areas (namely those that broke).
Honestly, there was a ton of retarded shit that went on in the code base, and I would have done it entirely different... but I wasn't the original desginer, I was just a maintainer.
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
That's certainly a decent option, but obviously for large codebases with many different combinations of actions that can be performed it may become unweildy.
Personally, I'd argue one of the best things someone can do to help themselves in this situation is to learn design patterns, and learn to recognise them.
Even if people don't specifically follow design patterns, they often do so unintentionally, because this is really the beauty of design patterns- they are common solutions to common problems. If you can start to recognise design patterns, then you find you are no longer looking at lines and lines of code, but you are looking at the bigger picture, beginning to see what sections of code do in general, and can then get to grips with the role of these more abstract components in the larger system and understnand how it works.
You will still have to figure out how the algorithms in each component work, but you should at least be able to understand how those components fit in the bigger picture and their effect on the system as a whole.
"The things like the publics need to be available to the other build machines in the distributed system."
;)
Ahhh, I missed the distributed part. I've worked on some very large systems that took hours to build but I've never actually come across a distributed build during my 20yrs in the industry. The system I'm looking after at the moment builds win32, x64 and ia64 all on the same box, it uses a single python script with a cvs tag as a paramter to do the lot. The *nix builds are also kicked off from a single makefile/tag but run on seperate boxes for the half a dozen flavours we produce.
"Honestly, there was a ton of retarded shit that went on in the code base, and I would have done it entirely different"
A wise man once told me that source code is like shit, everybody else's stinks.
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
And once you're done with that, create UML sequence diagrams of various use cases. Make sure to use a debugger to create the sequence diagrams. After you're done, you'll know parts of the code better than the original authors.
Have fun and good luck!!
I worked at Microsoft and several other large corporations.
Let me tell you: Anything less than 1 million lines of code is small.
You just need the right tools in order to handle large code bases.
First things to do:
1. Use source code version control (CVS, Svn, Git, Hg). Perform code reviews before check-in.
2. Use automatic compilation (make, ant, maven).
3. Use a build machine that pulls the source code from the Svn machine and builds it automatedly through ant or maven, sending email in case of failure, to detect compilation problems early.
4. Use an issue tracker to remember things to do and to keep track of time.
5. Use automatic unit tests (Junit and the like). Unit test every single method for all border cases. A test failure is a build failure and should be looked at immediatly.
6. Refactor mercilessly once everything has been unit tested. Avoid repeated code like the plague. Repeated code makes maintenance difficult, if not impossible.
7. Use AOP or proxies for logging and security.
8. Make sure no package has more than 10 classes, no class has more than 10 methods, no method has more than 10 lines. Refactor, refactor, refactor.
9. Make sure no class has more than 3 instance variables, no method has more than 3 parameters and no method has a loop AND a conditional sentence.
10. Make sure everything is specified only once (the DRY principle).
This is a problem I have encountered several times in the past, inheriting reasonably large, poorly documented code bases. It can be an interesting personal challenge, deciphering someone else's code, but not when you are working to a timescale.
I became so frustrated that I decided it was time to try and do something about it.
As a result, we (myself and a couple of other developers) have developed a new software tool which aims to cut through legacy code, to visualise it in an abstract way, and to allow you to build a picture of what its doing quickly and efficiently.
In simple terms our tool (named 'Code Rocket') is a detailed design documentation tool - kind of like doxygen, but taking documentation a step further.
We use it to prevent the code from becoming a legacy nightmare in the first place (by ensuring it is structured and documented to a high standard but with limited overheads for software developers during development) and to reverse engineer the meaning of any existing legacy code to guide us through an understanding of it. There are many other side benefits as it turns out relating to project management, review, communication, but the main thing is that I now feel a little more comfortable when presented with a batch of legacy code to investigate. I also agree with the recommendations of building in unit tests.
If anyone is interested in checking out our tool, you'll find it at the following web site: http://www.rapidqualitysystems.com/
"The things like the publics need to be available to the other build machines in the distributed system."
Ahhh, I missed the distributed part. I've worked on some very large systems that took hours to build but I've never actually come across a distributed build during my 20yrs in the industry. The system I'm looking after at the moment builds win32, x64 and ia64 all on the same box, it uses a single python script with a cvs tag as a paramter to do the lot. The *nix builds are also kicked off from a single makefile/tag but run on seperate boxes for the half a dozen flavours we produce.
Yeah, Windows Server 2003 takes over 5,000 tasks to get everything done for x86, ia64, and x86-64, just for the English, Japanese and German localizations only.
This was across, say... 12-ish machines, I believe...
"Honestly, there was a ton of retarded shit that went on in the code base, and I would have done it entirely different"
A wise man once told me that source code is like shit, everybody else's stinks. ;)
Oh, I wholly agree. I look back at my crap from earlier days and I go "holy crap, did I write this?" The main project that I've been working on the most for getting close to 10 years now has been rewritten at least like 3 times... I'm at major version number 3, and no one else even uses it!
But each time, I refine the techniques, and the models, until now I would say it's pretty nice.
WARNING! This girl exceeds the MAXIMUM SAFE standards established by the FDA for BRATTINESS
1 - Convince your customer that the platform the project is based on is deprecated, and they need a complete rewrite in a fancy completely new technology, so they can have all the functionality previous programmers denied them, and more. For example, you could migrate from Cobol to Java, from Java to C#, from C# to Go...
2 - Be sure to do it with fancy shiny graphics, so the customer can see there are advantages to the new stinky pile of crap with half of the previous functionality you have sell to them for a lot of money.
3 - When they complain about bugs and lack of functionality, sell them a mainteinance contract. They will sign it, or lose all the huge pile of money they have already wasted in the rewrite.
4 - Profit!
I am currently working with a mission-critical codebase, which is written in PHP and has absolutely no cohesive design to it. ... There are business rules just everywhere and API requests everywhere and all kinds of calls that overwrite static variables. ... If you inherit something like this, and it is mission critical, then you need to take as long as it takes to get it right. ... Don't remove seemingly unnecessary variables, and don't reduce seemingly redundant database calls.,,
What a lame and sorry state of affairs. It makes me wonder if you work for Honda or Toyota: "Sorry sir, we knew the pedal wasn't working right, we knew you could die, but we couldn't afford to recognize we screwed it up".
nope, that sir is life at a startup. You have to deal with crap like that, but as a trade off you get to work in a fairly unregulated environment and with cool new tech. You just gotta quit whining, get to work, and hopefully enjoy the challenge.
blah blah blah
You need to grab a copy of "Working Effectively with Legacy Code" by Michael Feathers and read it carefully. That will help.
And yes, 40kloc is not a lot at all (unless these aren't huge perl regular expressions *evil grin*).
I was unemployed for the first quarter of 2002 and found some by-the-hour contract work maintaining an old Win16 application. Hideous tangle of C code making up a very vertical application.
I wound up using SciTool's Understand to figure out unwieldy code bases. Honestly I never paid for it, as I said it was a short-term, per-hour contractor job and they weren't paying for tools, so I used the demo until the demo period ran out. (And this wasn't a $200 an hour kind of contract.)
I've also had Indian consulting firms, as part of their claim that "they can analyze and understand our code base" hand me a report that I'm pretty sure was the output of that product.
In any case, something like that is a good starting point.
I guess Visual Studio now has some of that sort of thing built in, but a proper just-for-that tool may suit you better depending on language and style.
http://www.scitools.com/
The preferred solution is to not have a problem.