When Making a Comprehensive Retrofit of your Code...
chizor asks: "My
programming team is considering making some sweeping changes to our
code base (150+ perl CGIs,
over a meg of code) in the interest of consistency and reducing
redundancy. We're going to have to make some hard decisions about
code style. What suggestions might readers have about tackling a
large-scale retrofit?" Once the
decision has been made for a sweeping rewrite of a project,
what can you do to make sure things go smoothly and you don't run
into any development snags...especially as things progress in the
development cycle?
I'm sure a bunch of code nazis will disagree with me (please note the clever way I attempt to pre-emptively undermine their arguments by labeling them as 'nazis') but sometimes massive engineering re-writes are not necessary.
Your tangled mass of spaghetti code paths are probably full of almost incomprehensible little design decisions and seemingly out of place declarations and functions, but most of those were probably added as specific fixes for bugs encountered under real-world use.
Most companies that decide to massively re-engineer their code (do a big rewrite) usually end up regretting it because it forces them to re-fix the problems that caused the original strange looking code in the first place.
Does your CGI nest work? If so, maybe you should leave it alone. If you are fixing specific problems, then go ahead, but if this is a generalized attempt to fix the 'not invented here' syndrome that plagues engineers (who will almost universally agree that it is easier to write code then it is to read it), perhaps you should reconsider.
Don't stop maintaining the old code code until the new code is on solid ground. No matter how sure you are that you can do it, the new code might never come through.
The Gardener
--
1) Know what you are doing. 2) Know what the code does.
Both are expedited by good documentation. This is so important, it deserves to be written thusly:
DOCUMENTATION!
Write everything down. If the code is not commented, figure out what it does and write it down. When you add a line or a module, write down what it is supposed to do. Declarations? Write those down too. Document everything so that you can figure everything out, both now and down the road when you decide to fix something else.
This is the voice of experience. I have had to reverse-engineer my own code 6 months after I wrote it because I failed to document anything. Learn from my mistake.
I have a strong belief in the Second Amendment.
A ton of people will tell you a ton of things, never having retrofited anything.
- Do not undervalue the investment you have learning your existing coding language. New challenges await you if you jump on a new language like java. Make the jump if you are excited about learning about the new language.
- If you use your existing coding language you will literally fly through the retrofit. Do it piece by piece. Make all those changes first, then test app, then make next set of changes then test. The simple fact is, most wasted time is spent on bugs not working on performance, and you've already knocked down a lot of bugs, don't let them pop back up by blowing everything up. There are books on this.
- Sometimes blowing everything up is worth it. Do it right this time. Realize it won't be as perfect as you might think it will be.
- Remember there are countless open source and shareware products that tried to create TNG with a total rewrite, got nowhere, and ended up improving their existing product. Remember the lesson, bite off what you can chew.
- Spend a week poking around researching possibilities. I do this all the time, bookmark things I think are important. Then for the next project you've got all the little things you might forget at your fingertips. Optimzations/Tools/Paradigms. Think you know it all? You'd be suprised at what is out there and what you missed. And what you spent a month in house re-inventing. This one's important.
- Use open source software. Nothing beats free. Nothing is more fun. Java's ugly standardization history makes me puke... the BS Sun has pulled with Java is staggering. That the Java Lobby swallows it and loves it even more so. This is irrelevant to your question, and not fair to the Lobby, but I like to give them a hard time.
- Colorary to Java. You need less abstract design then you think. Endless object hierarchies will weigh you and your app down... Their are books on this too.
- You need more documentation then you think. Ever found code someone ELSE wrote too EASY to follow. I don't think so. Especially if you are using perl and someone is enjoying the line noise capabalities perl allows. Perl has 20 ways to do EVERYTHING, you may not know the latest or twistiest. Document as you WRITE the code. Do not leave at the end of the day without catching up the docs. A week of documenting is the worst form of hell, avoided with a minutes worth of clarification each time you write a function/class.
- Hardware is cheap.
Anyways, have fun... and good luck. Be interested to read what others have to say.
Alright, maybe I posted a little too soon, but shouldn't "flamebait" be attracting flaming responses? I don't see any...
Anyway, if I'd spent a little more time thinking about the advice side of it, taking a look at appropriate programming methodologies (like Extreme Programming advocated in another thread here) would be one piece I'd advocate. Given the size of the code (1 MB = about 20-30,000 lines?) there's no need for major heavy-weight processes here. More important I'd say is sitting down and figuring out in the appropriate level of detail what exactly your system is doing right now - you can do this using UML diagrams which seems to be becoming a standard, though the main use we've found is to try to get an overall view of things which we then throw out when we get into the details again.
The other thing to do along these lines is look for your use of standard patterns within your code - the Design Patterns book is extremely helpful if you're moving to an object-oriented framework at all; following well-known patterns and indicating clearly what you are doing can make your code much easier for others to follow.
Energy: time to change the picture.
Your tangled mass of spaghetti code paths are probably full of almost incomprehensible little design decisions and seemingly out of place declarations and functions, but most of those were probably added as specific fixes for bugs encountered under real-world use.
Yes, and if they're cryptic and uncommented, they are worthless. Eventually one of these incomprehensible, magical fixes will stop working. Perhaps the bug it works around is fixed. Perhaps how the function is being used changes to previously unexpected behavior. Some poor engineer will look at the little big of magic, scratch his head, and be forced make a blind decision about how to fix it. Perhaps he can change the code while leaving the bit of magic in working, but he can't be certain, since he doens't understand it. If the collection of cryptic tweaks becomes dense enough, any attempt to fix a bug or add a feature becomes highly risky.
On a related note, don't let this happen to you. If you add one of these strange little fixes, for the sake of the programmer that follows you, document them. Just a little "Need to toggle the Foo because Qux 1.4 is correctly fails to do so" will bring tears of joy to the eyes of future programmers.
Large overhauls are usually mistakes. Details in the previous code are lost. If the overhaul takes non-trivial time, people become frustrated that two weeks ago they had a working (if problematic) system and today most of the system doesn't work.
Instead, make small incremental changes. Pick something lots of code is replicating and attempt to unify it into a shared code base. Spend some time documenting key parts of the code. Pick a particularlly hairy class or function and untangle some of the worst bits. These sorts of changes can reveal minor bugs, build up to significant improvements, and leave you satisfied at the end of the day that you improved things.
If a signficant overhaul is necessary, try to overhaul portions while maintaining the existing bits.
1) Identify common functionality.
2) Encapsulate in libraries
3) Be sure to extract enough generality that you don't have special case functions
4) Don't extract so much generality that functional interfaces become unwieldy.
5) Write everything in the same language.
6) Find any complex pieces or algorithms. If they can be simplified or re-written, do it. If not, save it so you don't need to debug it again.
7) Throw everything else away.
The first thing you definitly have to do is Setting up a test suit for regressoin testing.
For those not familair with the term 'regression test':
Program a set of so called "test drivers", programms calling your code: routines/scripts.
Define test data, either in a DB or in flat files, used by those driver programs.
The test programs and test data needs to work with the old code, of course.
As the new code should behave similar you only need to adjust PATH or script names to let the test programs work with the new code.
Plan your project by defining which test cases(test program plus test data) should work at a planned milestone with the changed code.
After making changes rerun all tests.
Well, there is a lot more you could do, but that above is minimum (basic software engineering, sorry no art involved here).
Regards,
angel'o'sphere
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Most of the computer industry now calls this Refactoring.
I would HIGHLY recommend the book "Refactoring" by Martin Fowler.
There is a number of things you should do here.
1. Document your plan and come up with an official PROPOSAL document. Allow others to comment on this document and incorporate fix all relevant issues.
I started using this under the Apache Jetspeed project and now a lot of other Apache projects are accepting this practice.
It really allows the community to become involved in your changes and encourages constructive feedback and involvement.
2. Break this into phases. You should NOT attempt to do this all at once. Each phase should be isolated and should consist of one unit of work.
Each phase should be branched off of CVS, worked on, stabilized, brought back into HEAD and tagged. You should then RUN this code in a semi-deployment role for a period of time to correct all issues which WILL arise with the updated code.
After this you can then start your next phase.
3. UNIT TESTS! If management (assuming you have management) has approved the time for this type of refactor then you need to take the time and write Unit Tests for each major component.
It is important that Unit Testing can sometimes be just as hard, if not harder, than the actual development itself.
In some situations you can avoid Unit Testing, some here are going to call me crazy for saying this but it is true. In a lot of high level applications, which are NOT used as libraries by other applications, you can bypass Unit Testing in order to increase development time. This is a dangerous practice but it is often outweighed by the extra functionality you will end up with in your product.
Anyway. Good luck!
Kevin
I second all that has been said about making sure that you really need to do this and that it is worth the time and risk. One sign that you may need to do so is an excessive reopened bug rate, where fixing one bug often creates another bug due to side effects and component interactions. If you decide that it is, then the three keys to success will be modularity, incremental rollout, and unit tests.
Modularity is probably what you're already thinking about. Go over the old code base, in a code review, and find where the same thing is done over and over either with copy-and-paste code -- the bane of crap engineers -- or with different code that serves the same ends. Look for repeated sequences in particular. Create a new library that encapsulates those pieces of code.
Incremental rollout is vital. Only replace small parts of your system at a time, doing complete retests frequently. Don't write a new encapsulated routine and then roll it out to each of the three dozen places in which it appears in the whole code base. Write the whole function library, with unit tests, and then start applying it to separable modules one by one, retesting as you go. Otherwise I guarantee the whole thing will fall apart and you won't be able to tell why. Ideally, you might set a threshold on the rate of replacement of old modules and work primarily on creating new modules with the abstracted logic.
Unit tests are crucial because, as noted, the messiness of your old code probably conceals a lot of necessary logic. We had this great phenomenon on Apple's Copland where people who had never used the old OS managers were rewriting them in C or C++ from the assembly source. When they saw something in the assembly they didn't understand, they just ignored it. Guess what -- the new managers didn't have any backwards compatibility. The only answer to this is to have a thorough unit test for any module that you replace, against which you can test the new version. This also confers other quality benefits, but during a rewrite it's critical.
Finally, once you have replaced a significant number of your modules, you will find that new levels of abstraction appear. The average size of each function or method will have shrunk considerably, and now it becomes possible to see new repeated code sequences that were not visible due to the old cruft. Move these into your new library modules and start using them in continuing replacement work. In addition, start going back -- slowly and incrementally -- through the already converted modules and replacing the repeated sequences with calls to the new abstractions.
Finally, figure out how you got into this mess in the first place. The worst programmer habit I know of is copy-and-paste coding instead of using subroutines. You can tell people not to do it, but some always will. Those people should be bid farewell -- you can't afford their overhead. Other common problems include lack of planning and review, a code first and think later mentality. Start moving your organization up the levels of the CMM and you may find that you wind up with fewer modules that need replacement.
Hope this helps.
Tim
First of all, I think its important to realize that you have a medium-sized website and not a big software project. Therefore, some of the above comments recommending refactoring, UML, and eXtreme programming may be a bit overkill.
Web programming != software development! Its usually done at a much faster pace. Even if an object-oriented approach is taken, you are still probably talking about simple function libraries rather than complex C++ or Java classes. Again, overkill.
150 files is still a small enough project to be managed by one or two decent coders. Actually, I just looked at the amount of stuff I've written over the years for my online bookstore and its more like 500 files and over 4 megs of code. I don't feel like its too much of a job to manage this codebase by myself.
So, here are my recommendations.
You probably have gotten better at programming since the time you started your project. Take a few of the most recent CGIs you have written and compare them to the first ones you wrote. You just might notice a glaring difference in the quality. Also, the first pages you wrote are likely to be among the most important in your project, yet they are also likely the worst quality-wise.
Regardless of what language you program in, I think its important that you can tell whats going on in the program by reading the comments. If a manager can understand what a program does by reading the English bit, there's a good chance other programmers will be able to jump in and help as well. One specific rule I also follow: if you do regexes, say IN ENGLISH what those regexes do. I say this because regexes are one of the hardest things to read.
Look for any code that can be "factored out" of your scripts and put those into function libraries. Then include those in your program. The only problem with this occurs when you have huge function libraries that slow down your scripts when you include them. In that case you would logically separate your functions into different files. I have included very common functions in different include files, so I can make the actual code compiled or interpreted as small as possible.
Consider using a flowcharting tool as an aid to programming and/or documenting your code.
Standardize how you name variables and functions, write comments, identation, and spacing.
Be sure and include the date you write your scripts in the comments, in case the filesystem wipes this out.
I'm sure theres other things I've left out, but following the above guidelines have helped me do exactly what you are trying to do: manage a growing codebase. But don't forget, this is web programming, not rocket science, and some of the above suggestions may be more trouble than they are worth. Keep it simple.
No, Thursday's out. How about never - is never good for you?
Refactoring is not re-writing. Refactoring is a methodical approach to moving, redesigning, and making the code cleaner. It's not throwing away and starting from scratch. The two words are wholly semantically different.
In his book, Refactoring, Martin Fowler talks about how code "smells". He identifies a whole slew of reasons why code can "smell" bad, such as having a method that's too long, having a method on the wrong object, incorrect use of polymorphism, etc. He then outlines "Refactoring Patterns" which can be applied to certain "smells" to make the code more managable.
The last project I managed was a 3 month re-write of over 75,000 lines of code for an auction component that had gotten way out of control. The original component integrated with a very very poorly written (yet still very expensive) auction engine, and our task was to take it out.
Because there was no time allotted to proper requirements gathering, the powers that be decided that we could use the existing system as a requirements base, and just make the new system "act" like the old one.
So I was thrown two things: A staff that was inexperienced in design, and a lack of requirements. We were re-writing from scratch -- throwing all of the code away and starting again. I'm one of the only ones on my team that actually knows what refactoring means, let alone how to apply it.
So we built the new system based on a solid design. We even did the design in UML. But that only goes so far. Given a properly designed system, there is still another lower step -- how to actually implement the code. Invariably, the design doesn't take into account helper methods that are necessary, other objects, etc. Because we were on a tight schedule, I left it up to each developer to design these new tidbits.
If they had known what they were doing, everything would have been fine, but there are several places where the code is absolutely abysmal. It's not that it will perform poorly or is totally unmanagable, but in some places it's close.
Granted, the business rules for the system are some of the most complex I've ever seen, but if the developers had known more about refactoring, they might have made better choices.
There's a whole notion of Quality here that should be discussed as well. Most good has a good enough quality, and though it looks ugly, I'd throw it away. There's an interesting interview by Joel (a former microsoft programmer) which tackles these problems of Quality head on. I also reccommend reading Zen and the Art of Motorcycle Maintenence: An Inqiry into Values for a pretty good discussion of the whole notion of quality.
I'd say just keep the codebase unless you are totally switching languages (or in our case taking out a 3rd party integration). Otherwise if you re-write you'll wish you hadn't. The new codebase will end up being in the same state as the old codebase, it'll just take a little bit more time to get there.
I'd also reccommend just doing a performance test of the site, and find out which modules really need to be re-written. The 80/20 rule will invariably apply -- 80% of the execution time will be spent in 20% of the code -- so at a minimum just re-write the 20% of the code that's executed on the most frequent paths. Leave the rest alone. Re-writing it will just be a waste of time, and you have to fix all the old bugs that were re-introduced during the re-write.