Undocumented Open Source Code On the Rise
ruphus13 writes "According to security company Palamida, the use of open source code is growing rapidly within businesses. However, the lack of documentation and understanding of how the code works can increase the vulnerability and security risks the companies face. OStatic quotes Theresa Bui-Friday saying, 'In 2007, Palamida's Services team audited between 300M to 500M lines of code for F500 to venture-backed companies, across multiple industries. Of the code we reviewed, Palamida found that applications written within the last five years contain 50% or more open source code, by a line of code count. Of that 50% of open source code, 70% was undocumented. This is up from 30% in 2006.' How can businesses protect themselves and still draw on open source code effectively?"
Comment removed based on user account deletion
The only reason why we don't see an article "Undocumented Commercial Software On the Rise" is because the public cannot see how badly documented the commercial software is.
Well it's better than closed source where you can't even audit the code...
(except nobody is accountable for open source, so you can't even sue the culprit)
The original article is an ad for a service that looks at code for you. But it's a real problem.
A basic problem with open source is that once you get beyond the top 50 or so projects, the quality is usually crap. Look at the source from a few random projects on SourceForge. There aren't that many real "community" projects, where multiple programmers are working on the same code. The long tail isn't very good.
How do you measure something like how well things are documented with a percentage? Some code simply doesn't need documentation. Other code needs plenty. Is 0% a 1:1 relationship between lines of code and lines of comments? That whole thing seems a bit strange. They could certainly back it up if they wanted to, but that'd be too much effort.
-Devin Jeanpierre
They talked about looking at 300m LOC. I'd hope 70% was "undocumented". 70% of most code is just common-everyday stuff that doesn't NEED to be documented in the sense that comments are completely wasteful. It's the "glue code" that needs to be documented, and the non-intuitive stuff, and stuff that is done for a reason that, on first glance, looks like the writer had a brain fart, but, in this special case, makes sense, or "corner case" situations.
Do *NOT* "insert comments like "for (i=0; i
This has NOTHING to do with "multi-national sites" or any of that.
This has EVERYTHING to do with clearly stating the rules and ENFORCING those rules.
The rules do not enforce themselves. Someone, somewhere has to approve the code that goes in.
The problem is that management does NOT understand code and will happily farm out the work to anyone who says that they can produce X lines for $Y. Without oversight. The less oversight, the less expensive the project is. Which means bigger bonuses for those same executives.
They talk about how much of the open-source code is undocumented. I notice that they don't bother to mention how much of the in-house code is also undocumented. My experience as a software engineer is that their in-house code's probably at least as poorly documented as the open-source stuff. And if the business finds this state of affairs acceptable for their in-house code, why's it any more of a problem for the open-source parts?
I've also found that when the business does get a consultant in who demands documentation, they usually demand something that's completely useless for the actual developers. Eg., they demand UML models for all the software. Well, that's nice and all, but most of what's in the UML you can see by glancing at the class definitions. The things a developer needs, like what the methods are supposed to do and what gotcha caused a particular way of doing it to be picked and what assumptions the code's making about it's inputs and outputs, have no place in a UML model.
How much of that "documented" code in the article was documented correctly? Good code is easy to understand by good programmers. Documenting things is just another dependency to fall out of sync. Would you rather fire up a few neurons to grok some code yer working on or spend hours pulling your hair out only to eventually figure out your documentation was wrong?
Documentation should be used sparingly and as tightly woven into the development process as possible. The programmer should document their code when necessary as soon as they think it not at some later pass. Provisions for inline documentation should be used. When a programmer modifies some code they are more likely to also modify the documentation when it is immediately adjacent. The probability that documentation will remain in sync is the inverse of the square of that distance.
hit closer to home perhaps? A quick glance at some of those code snippets and they can be easily missed. Now place them in large applications with thousands upon thousands of lines of code and who knows how long it'll take to find them.
I tried to get into documenting software for Ubuntu. I wanted to help but I didn't really have any programming skills, I think the most complicated stuff I've ever done is scripts for mIRC and some HTML.
After really sitting down with some programs, I realized I just had no idea where to start. There was certainly more to be said than who made the program and what license it was under like many programs have in their 'help' and 'about' menu, but it really does get to be an enormous task and it's a certain amount of responsibility because the few people that will read the documentation first will take everything it says to heart.
I might try again, but I'm going to be sure I really have time to do it and the patience to read through source code. mangu is right, even though I don't know how to program but it's not hard to figure some things out and sometimes there's vital comments 'between the lines'.
I have noticed more programs (included in Ubuntu) have the information I need when I care to look at it now, I generally check documentation for command line arguments and stuff in case --help won't tell me everything or anything at all. At least someone's getting the job done.
"Most people, I think, don't even know what a rootkit is, so why should they care about it?"
That said, the "70%, up from 30%" numbers are absurd. There is no way that the failure rate to document use of open source code more than doubled in 2007.
What I'm listening to now on Pandora...
Undocumented code eh? Better send it back to Mexico.
We need some kind of national fence to keep these undocumented source codes out of the country!
They're taking our jobs!
Even if it was documented, are you going to trust the documentation? If you want to make sure it's doing what you think it's doing, read the damn code, that's what it's there for.
Frequently I want to have a look at somebodies source and normally I find a comment section at the beginning of a file and I'm happy about this. So when reading beyond the first /* I find the GPL blurb which is great. Then upon reading a bit further I come across the */ so I'm thinking ok I'm fine to read his code since it is GPLed. Any description of the code the programmer could have put into the initial comment is not there probably because the code is new.
When going further into the code I would expect to find some explanation about what each function does even if it is just a construction site.
Ok, so that is missing maybe at least the names are descriptive. What do I get? Something not much better than O_oOo_0(int a,...)
This is just not what I need a GPL blurb for. Why don't you omit that too and just keep us guessing.
Don't worry nobody will touch your precious code - ever!
Je me souviens.
I've recently been trying to implement streaming video in a cross-platform system, and the main open-source libraries, ffmpeg and Live555... well, Live555 at least has unstable release packages. That's the best I can say about their project management. :/
I for one welcome our undocumented overlords with open arms.
me: "I installed this module, and it borked the kernel."
buddy: "Did you RTFM?"
me: "I can't. there isn't one."
Anybody want my mod points?
The money shot:
...applications written within the last five years contain 50% or more open source code, by a line of code count
Again, open source is not any more risky than any other kind of code. What is risky is not documenting your use of 3rd party code...here are some quick examples: OpenSSL, phpBB, xt:commerce
So what they're basically saying is that if you use OSS tools in your company, someone should probably be keeping track of them and patching them as needed.
Should this not hold for *all* software you've deployed? Few programs are immune to eventual obsolescence (including ongoing bugs and security problems), so if you think you're safe just because you're running a bought-and-paid-for solution that you've subsequently ignored, you're probably in trouble.
That being said, I wonder about this:
I get the impression that we're not getting the full story here. If their code audit showed that 50% of software X was copypasta from sourceforge, that would be something (you probably have crappy developers plus possible legal hell if there were copyright infractions).
On the other hand, if they figured "hey, your hello world program uses library Y, which is 2 million lines that we don't think is documented properly," then the "application" does not *contain* 50% or more open source code, but rather *references* a certain amount of open source code, which is probably a meaningless statistic.
"Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
I would be interested to know what languages you have used.
I have found Perl to be very well documented, even though it appears to be on a decline or leveled off on the number of developers and active projects.
Meanwhile, I have looked into use Rails and found it a great example of shitty code practices. I've stated this very case to the development community and they pretty much debunked my statements as one belonging to an inexperienced developer unwilling to "go the distance".
I hope this might be slightly helpful in getting people like the Rails community to either understand that they really do need documentation or get companies to throw aside Rails as POS software that is so lacking in documentation that it's a greater burden to have it than to use the alternatives.
There is an excellent case where if you have a highly experienced and knowledgeable developer then you maybe don't care. But if you have to replace this developer with one less knowledgeable or want to expand your development team, you suffer a huge start up cost of trying to bring someone up to speed at your expense.
Specifically, the Rails plug-ins are documented with over simplified tutorials that aren't even available for free and so you have to make an extra effort to find the documentation for the software that you download since they aren't in the same location. Restful Authentication is one example in particular.
Add to that the documentation in Ruby DBI. There isn't any. The documentation says to see Perl DBI for documentation. Considering this is a reference to a different language with different syntax and some of the Perl methods aren't possible in Ruby and likewise Ruby DBI has methods that aren't available in Perl. WTF? This is documentation.
Film at 11.
Seriously, this isn't news. I don't care what context you're talking about, programmers often skip over documenting their work. That's largely due to the pressures of how much time they have to work on something (either imposed by The Boss or other time commitments).
I am officially gone from
Palamida is full of crap. I am a senior employee at a comprehensive security services company, and prior to this job, filled a similar role at an extremely large software company most of us are aware of. One of our services is to review source code for our customers. We can review approx 1200 to 1500 lines of source code per day per analyst. Therefore, *complete* source code reviews are very rare, as they're long and expensive. Therefore, targeted ones are performed. However, 300 to 500 Million lines of source code would take quite a while to review. To be more precise, using the lower bounds of their stated numbers, and the upper bounds of what a highly skilled analyst can achieve, 300,000,000 lines of code / 1500 lines of code reviewed per day = 200,000 man days of review. Assuming that they are fairly stingy with paid time off, their staff will work about 50 weeks per year. 200,000 man days works out to 40,000 work weeks, which at the rate of 50 work weeks per year, would result in 300,000,000 lines of code taking 800 man years. As it was stated this 300 to 500 million lines of source code review occurred in 2007, and not from 1207 to 2007, it can be presumed that they are either: A) time travelers, B) employee more highly skilled security analysts than any other company on Earth C) Doing a piss-poor job using automated source code scanning tools and checklist monkeys D) Full of shit, or E) C and D. My bet is on E. By way of illustration, Windows Vista has somewhere in the neighborhood of 50 million lines of code, and took a the combined resources of multiple security vendors over 6 months to perform even a very limited and targeted source code review. This one company is claiming that they were able to review 10 Vista-sized software projects in one year?
With the proprietary code I have been privy to. So much of it was poorly documented and commented.
But seriously, the trend needs to stop as it creates an excuse ban open source from the workplace.
putting the 'B' in LGBTQ+
Gotta love this place. At the time of this posting, there are 11 comments modded 3 or higher. Of those, only ONE makes any reference to the act of documenting where the code is coming from (which is what the article is about). All the rest are talking about writing documentation for code, or commenting code as its written. Way to miss the ball, guys! This article is addressing you specifically, yet you have no idea what they're even saying because you can't be bothered to try to listen. Nice.
I agree, that metric sounded much too high for open-source. The 70% up from 30% was actually the incidence of undocumented Hispanic workers in the US's past few years. Statistical cross-contamination?
That is the reason why you should always buy all software from Microsoft.
It will come fully documented!
You are right, if you compare commented lines of code
with total lines of code, the number is possibly plausible.
How often do you comment stuff like
{
}
i++;
You are being MICROattacked, from various angles, in a SOFT manner.
Any software product without good documentation is difficult to use. Proprietary programs are potentially a lot worse than open source because you don't have an easy option for figuring out the hard bits for yourself.
My example would be Activant Eclipse (formally Intuit) - the ERP software the company I work for uses. It's expensive, performs poorly (even on the expensive IBM mini that is required), is buggy, is largely undocumented, is hard/expensive to customize and is completely required to keep our business running. We pay for an expensive support contract and support is still often a joke - complicated issues are usually figured out by our staff because their support staff isn't able to do much. A horrible painful experience.
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
The definition of each instruction is very well defined by the CPU's instruction set.
Engineering is the art of compromise.
The more interesting statistic is the number of products being shipped that use open source code (e.g. GPL) where they are violating the license, have not released any of the code they are required to and have not acknowledged their use of open source code (sometimes they will comply after a lot of pressure, sometimes they never comply)
Comments that are wrong or outdated are worse than comments that are missing. Both are crap compared to well written unit tests. I didn't notice anywhere in the article a mention of weather or not this analysis took unit testing into consideration. Perhaps Palamida's VP of Product Marketing should quit telling people to build software like it's 1999 and perhaps rebuild some of the tools they're trying to sell to the public so they check for test coverage. I'll bet open source developers could go out over the next 3 months and start adding: /* Palamida's coverage tools are crap */
To every function and method in C/C++ apps and they could get a glowing reversal on what wonderful improvements have been made to documentation.
I swear. People who think about software like this should be kept away from computers. And perhaps sharp objects for their own good...
*** Sigs are a stupid waste of bandwidth.
Documentation!?
We don't need no steenkin documentation!
Of course I didn't RTFA... why would I do that? You really are new here aren't you? Don't let my UID fool you.
The source can tell you WHAT the program is doing, but it doesn't tell you WHY. Especially when the programmer did something tricky.
Slow down, cowboy! It has been 4 hours since you last posted. You must wait another few hours.
"Meanwhile, I have looked into use Rails and found it a great example of shitty code practices."
As someone using Ruby and Rails on a regular basis, I have to say that this is my experience too.
I got involved in a flamewar with developers of a well-known Rails application over my contention that documentation was required. Their position is that you don't need any documentation if you have unit tests; you can simply read the unit tests to work out what the supported API is.
Riiiiight.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
Let me add PHP to the list! PHP is wonderfully easy to allow inexperienced programmers (read: non-programmers) to build something that looks and feels like it works, but often without the coder knowing HOW (or why) it works.
Well, no, I say tomato(e) but the point is that the story itself doesn't even make it clear what it's about. There's no reason to make the point that "open source" code is poorly, if ever, desribed in plain ENGLISH text (lack of proper training is the fault), but there IS a reason to make a point about using "open source" code in places where you would not expect to find it. Now, don't confused "open source" with "GPL". One is a hideously communistic plot created by a madman, and the outer is, basically, public domain ("BSD" and other "licenses" notwithstanding). Microsoft uses "open source". But then "open source" uses "open source" so this just became a story that has no ending. I'm outta here --- zing !!
Do you think you could document your claim that perl use has "declined or leveled off"?
Perl hype has leveled off, now that O'Rielly is focussing
on selling books about other things, but perl usage remains
pretty high, as far as I can tell.
I do agree with you about the high quality of perl
documentation. Perl has always attracted people who like
to write about their code, and one advantage of having
a reputation as a "write-once" language is that people
bend over backwards to document (both internally and externally).
(You want to stay away from people who use phrases like
"self-documenting code".)
"How can businesses protect themselves and still draw on open source code effectively?"
By paying the open-source developers.
I absolutely cannot document any decline. I do think the hype has leveled off. But I was in a meeting at YAPC::NA two years ago where they were discussing the decline in new programmers coming into the community.
Your examples try to make a point, but miss the mark. The point of the article is not about understanding what the code in question does, it's about maintaining the code - there is a big difference between understanding *what* code does vs. *how* the code does it. A one line comment before a block of unreadable code doesn't help to debug the code. A one line comment in front of a block of semi-readable code at least helps a little.
The real problem is undebuggable and unreviewable code presenting security risk. Did you use your pointers and buffers correctly? Did you check inputs and deal with errors correctly? Does your code follow the contract specified by all the annotations?
If you write a block of code with an annotation which says that you will initialize an output parameter, but on an error fail to initialize that output parameter, and do not throw an exception, then you can cause bugs upstream. Similarly, if you do not understand the contract of a method and do not check it for errors, you also can cause bugs upstream. Documenting these things is a basic requirement that experienced and unexperienced developers make, and code reviewing is an important part of the development process (as is testing).
The decline of Perl is a myth. A graph of CPAN uploads vs. time shows a dramatic increase in the last couple of years, and 2008 is already ahead of the entirety of 2007.
http://blog.timbunce.org/2008/03/08/perl-myths/
I remember that YAPC discussion. As I recall it, the point was that the average age of Perl developers was increasing, indicating a decline in younger programmers entering the community. Perl is rarely a programmer's first language, so this isn't entirely surprising. PHP is taking Perl's place as the newcomer's first web programming language (which is OK in my opinion -- PHP is easier to learn)