Slashdot Mirror


Ask Slashdot: What Tools To Clean Up a Large C/C++ Project?

An anonymous reader writes I find myself in the uncomfortable position of having to clean up a relatively large C/C++ project. We are talking ~200 files, 11MB of source code, 220K lines of code. A superficial glance shows that there are a lot of functions that seem to be doing the same things, a lot of 'unused' stuff, and a lot of inconsistency between what is declared in .h files and what is implemented in the corresponding .cpp files. Are there any tools that will help me catalog this mess and make it easier for me to locate/erase unused things, clean up .h files, and find functions with similar names?

9 of 233 comments (clear)

  1. If you don't know what it does, don't touch it. by BlueKitties · · Score: 5, Interesting

    Seriously, you never know when some previous programmed made a "duplicate" function to do something bizarre, like force a particular initialization order of static-class-member variables between translation units. Sometimes deleting pointless code can do... terrible things. Just be careful, test your changes, etc.

    --
    "Sorrow is better than laughter, for by sadness of face the heart is made glad." [Ecclesiastes 7:3]
  2. Unit tests by Midnight+Thunder · · Score: 4, Interesting

    While I dislike writing unit tests, I have to admit they are useful in protecting your butt when something breaks, since the test should catch it first. Of course you need to decide whether in a particular scenario they add value or just make you manager happy.

    In a case like yours, you can make code modifications and hope nothing breaks or build unit tests and ensure that you don't break any of them when refactoring. Initially rather than just ripping out the seemingly duplicate methods, rip out/tweak their implementation and have them point to what they seems like a the right method to provide the common functionality. If your unit tests show breakage, then you know that you missed something.

    If you do things wholesale, then you are likely to break something in an unmanageable way. Oh and make sure things are version controlled ;)

    --
    Jumpstart the tartan drive.
    1. Re:Unit tests by gstoddart · · Score: 5, Interesting

      I've maintained several legacy code bases over the years.

      And I will flat out tell you that unit tests have VERY limited utility in terms of understanding a mess of code you inherited. At least, in the beginning.

      Sure, you can start with a couple of basic premises, and you can convince yourself those basic premises still work.

      But the initial grokking of your code, understanding all places where a function may be used, understanding all of the tricky bits and gotchas, trying to understand why there are 9 functions which look like they do the same thing? That takes some time and effort, and quite possibly some tools.

      Unit tests are great for starting to build up a few things, and move towards better stuff ... but in a system which has several hundred (or several thousand) functions and interactions, resulting in really large numbers of code paths ... having a few unit tests describing the stuff you understand doesn't mean all of the stuff you don't understand wasn't broken, simply because you don't know what you don't know.

      So it is important to understand your new unit tests on legacy code are, at best, a VERY incomplete view of your code. That will improve over time, but you could potentially need to write a few thousand of them to be sure you're not breaking anything in the big picture.

      If you do things wholesale, then you are likely to break something in an unmanageable way. Oh and make sure things are version controlled ;)

      Oh, yes .... This .. for the love of god, this.

      You should learn how to tag branches and the like in your version control so you can identify a baseline of "before I ever touched anything" and then be able to cleanly build everything which predates you, as well as building your "after refactoring this part".

      Branching/tags/whatever your version control calls it -- that doesn't take up much space, so use them often, and consistently. Let the tool do the heavy lifting of keeping track of what you've changed.

      You do NOT want to find yourself unable to build it as it existed, or identify all of the diffs between what you started with and what you have.

      --
      Lost at C:>. Found at C.
  3. Looks like a reverse engineering project by prefec2 · · Score: 4, Interesting

    Modularize the software. There are a lot of tools which can help you to analyze static dependencies in the code which can help you to identify components. You could also use a run-time analysis tool for example Kieker which is initially for Java, but there is an extension for C/C++.

  4. Answer: read slashdot for long enough by plcurechax · · Score: 5, Interesting

    See: Working Effectively with Legacy Code book review (2008) for a book of that title by Michael Feathers (PDF article) on that very topic.

    There is even a summary of key points at Programmers @ StackExchange. Hundreds if not thousands of programmer's blogs address this very topic.

    You're welcome. Now get back to work.

  5. DXR, the code indexer by Grincho · · Score: 5, Interesting

    Wow, what an easy pitch. :-) At Mozilla, we've put together a tool called DXR ( https://github.com/mozilla/dxr... ). It indexes your code and lets you do text and regex searches. But if you can get your project to build under clang, you can really have some fun, with queries that find...

    * Calls of a function (great for dead code removal)
    * Uses a type
    * Overrides of a method
    * Uses and definitions of macros
    * etc., etc., etc. There are something like 24 different structural queries you can do.

    Because all of this is informed by the internal data structures of the clang compiler, it's nigh on 100% accurate (aside from more dynamic behaviors like sticking function pointers in a table and passing them around). You can also explore a hyperlinked version of the source, bouncing from #include to #include and drilling into methods.

    Here's how to set it up: https://dxr.readthedocs.org/en...
    Here's our production instance you can play with: https://dxr.mozilla.org/mozill...

    If you run into trouble, pop into #static on irc.mozilla.org, and we'll be happy to help you.

  6. Bware of 'cleanups' by plopez · · Score: 4, Interesting

    Anecdote from the mists of time:

    There was this C program which had been around a while which had undergone some evolution and maintenance. The decision was made to 'clean it up' There was a data structure, an array I think, which was unused in a subroutine, lets call it subroutine A. So it was removed. The next test runs of the application and suddenly the program started core dumping. After some agonizing debugging it was discovered to come from another subroutine, lets call it subroutine B.

    There had been an array in subroutine B which a loop had run over the end of. But subroutine A had loaded just prior to B and allocated memory for the unused data structure. This had provided enough space to handle the array out of bounds error in subroutine B but when removed subroutine B began overwriting subroutine A causing the crashes.

    It was good that the crashes were easily reproducible or could have been one of those intermittent things that drive people insane. An automated tool may not catch things like that since it may not show up until run time. It is C/C++ we are talking about now isn't it?

    --
    putting the 'B' in LGBTQ+
  7. Comment removed by account_deleted · · Score: 4, Interesting

    Comment removed based on user account deletion

  8. Could it be a threading issue like a a deadlock? by Paul+Fernhout · · Score: 4, Interesting

    Debugging code that prints or logs may act to synchronize access to some data structure. Sometimes that can prevent a deadlock or illegal pointer access as a side effect:
    http://stackoverflow.com/quest...
    http://en.wikipedia.org/wiki/D...

    So yes, complex programs can act in strange ways from seemingly minor changes.

    I spent a couple years helping maintain a large complex multi-threaded app (which included message passing between the apps, for another layer of fun) which supported 24X7 operations where a minute's downtime could cost millions of dollars in some situations, and it was not easy. The code base was easily 10X to 100X of what the poster of the story is tasked with maintaining. Versions of the code had been in production for over fifteen years. Much of the code had been ported from C++ & Tcl to Java (although C++/Tcl systems remained), but the threading model was somewhat different between the two, and the port had not taken account of all the differences. It would have been nice to be able to rewrite some key parts of the system to make them more maintainable, but there was never enough time for that in a big way -- and realistically, bigger rewrites likely introduce new issues. Still, eventually we got most of the worst deadlocks and memory leaks and similar such things fixed and the system got to the point where people stopped even remembering off-hand the last time a core part of the system needed to be rebooted (previously a fairly frequent event). But each deadlock could involve days, weeks, or even months of study and discussion, adding log statements, writing tests, lab tests, analyzing quite a few multi-gigabyte log files (and writing tools to help with that including visualizing internal message flow), and so on. And, same as you mention, hardware and OS issues could interact with it all, making some things hard to duplicate under virtual machines for developers. One thing is that to the end user, a system that is more stable may not look that different than one that is less so -- there are no new features, so it is not obvious what is being paid for.

    Although obviously if the program you support core dumps from a bad address or stack overflow, rather than just freezes up, it is probably something else. Still, even then, a bad pointer address can sometimes come from one thread freeing a data structure when another thread is still using it. The original C++ in the above mentioned project generally was highly reliable, but it still had some odd issues too. In one rare case, memory was freed in an unexpected way under certain conditions by other code running in the same thread but in code nested way deep with essentially recursive calls processing complex messages. I finally also traced part of that too what looked like maybe a bug in a supporting third-party library (a RogueWave data structure). Because that C++ code had been in production for years, and we were loathe to change it at the risk of introducing new issues, we mostly "fixed" that issue by making changes elsewhere in the system to prevent that component from getting the pattern of data that it had trouble handling. But we would not have known exactly what to change elsewhere without a lot of analysis.

    Sadly, just as we got it mostly working well, the new shiny thing of a mostly COTS system that did something similar came along to replace much of it (at a much bigger expense than maintaining the old, but granted with some nice new features).

    As I saw someone else comment recently about a "stable" OS, the end user generally cares more about how much work a system lets them get done, not how "stable" it is. A reboot can be acceptable, depending on the situation and the alternatives, even if not desirable. Erlang code is probably the master at that approach of rebooting code when it fails. :-) Here

    --
    A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.