How Open Source Could Benefit Academic Research
dp619 writes "Ross Gardler, of Apache Fame, has written a guest post on the Outercurve Foundation blog advocating that universities accelerate the research process through a collaborative sharing and development of research software while examining reasons why many have been reluctant to publish their source code. Quoting: 'These highly specialized software solutions are not rarely engineered for reuse. They are often hacks to answer a specific question quickly. ... What many academic researchers fail to understand is that this specialization problem is not unique to research projects. Most software developers will seek to provide an adequate solution to their specific problem, as quickly as possible. They don't seek to build a perfect, all-purpose, tool set that can be reused in every conceivable circumstance. They simply solve the problem at hand and move on to the next one. The difference is that open source developers will do this incremental problem solving using shared code. They will share that code in incremental steps rather than wait until they've built the complete system they need but is too specific for others to use. Other people will reuse and improve on the initial solution, perhaps generalizing it a little in the process. There is no need to share the details of why one needs a 'green widget' nor is there any reason to prevent someone modifying it so it can be either a 'green widget' or a 'blue widget.'"
Open Source by default has the benefit of many eyes checking for errors, contributing ideas, but things get sour when too many people commit.
Too many chefs etc... etc...
I think this will change when there's an open record of who came up with an idea first. Wouldn't it be quite a bit harder to say "I came up with that" if we don't know about your ideas until after your paper is published?
This is a false premise, IMO. By default, all changes are incremental. Dumps happen when there's poor coordination between parties involved an no one's really sure of what they're working on.
There needs to be oversight by competent, impartial people who can pinpoint conflicts early, look for logical problems (this is harder when you're too involved your own research/programming/souffle) and most importantly, let ideas through that actually contribute to the overall understanding of the subject.
This doesn't apply to code per se, but really any research: If the global warming controversy is any hint, research is prime realestate for astroturfing. Empirically observed fact is no match for the perceived reality of the ignorant when special interests are involved. Open source, with research wouldn't just be beneficial to the programming aspect, but it would also ensure we're not walking into a wall with critical thought.
If computers were people, I'd be a misanthrope.
Shameless plug: http://perk.cs.queensu.ca/software We do exactly this. Our software is open source for anyone to use/test/fix. We do use SVN to maintain some control over the code that is commited, but overall it works quite well. We have just launched some projects on github; it's a new experiment and we're interested to see how it turns out.
Sometimes, when you publish the code you used to develop new Biochemistry or Genetics solutions, you find that other scientists in other countries use your code to reverse engineer what you are working on - your results, if you will - to eliminate dead ends and publish a paper on what you invested years finding a solution for, but before you submit your paper that they "effectively" stole.
We had that happen when we deposited ligand results a few times, until we learned to stop submitting such things until AFTER we were approved for print.
This is one reason for hesitancy that I can agree to. Just because I wrote code, doesn't mean I want you to have it, if I haven't published the end result.
After it's in print, you're welcome to have the code. Not before.
-- Tigger warning: This post may contain tiggers! --
The discoveries, algorithms and parameters generated by publicly-funded research is locked behind the paywalls of for-profit publishers. Those publishers won't publish an article unless the academic SURRENDERS THEM THE COPYRIGHT OF THEIR RESEARCH PAPER FOR FREE. The only reason these publishers have survived is because academics want their research published in the most prestigious (read 'expensive') journal they can find. Academics could benefit from 'open-sourcing' their research too.
"Academic publishers charge vast fees to access research paid for by us."
http://www.guardian.co.uk/commentisfree/2011/aug/29/academic-publishers-murdoch-socialist
"Academic papers are hidden from the public."
http://www.badscience.net/2011/09/academic-papers-are-hidden-from-the-public-heres-some-direct-action/
There are many open-source research software efforts already, and of course it would be good to see this become more widespread. These range from small-scale individual researcher one-off efforts to broad multi-institution efforts that are well-maintained over years. The software that I develop in the course of my mathematical research is available freely from our webpages, with intermittent downloads. And I still get inquiries about using it, to which I just say that it's all on our webpages already.
One barrier to broader efforts in the US is that science agencies (at least the National Science Foundation) generally support research proper, rather than development of tools. Oddly, I am much more likely to get a grant to work out research that perhaps 20 to 50 people may be interested in than I am to get a grant to develop research tools that may be useful in furthering research to a few hundred researchers. Nevertheless, it is more common that universities and funding agencies expect data and software from research to be freely available. Many people drag their feet on these requirements as they are worried that some other researchers will use their tools to scoop them, but I think these instances are very rare.
It's psychosomatic. You need a lobotomy. I'll get a saw.
The software I have written for my odd specialized purposes is similar to the software my colleagues write: It's spaghetti code written with custom libraries which are not better than common ones and it has no documentation at all.
We could open-source it, but then you'd just bitch about how poorly its constructed.
We don't have time to open-source our code. Heck, I've had people ask to use software I've made and I've regretted giving it to them because I then am obligated to explain to them how to use it.
Maybe it is not as simple as pick zero / one / two. The purposes of writing software for research and engineering software for reuse are so different that it doesn't make sense to try and compare them. Going back to the summary:
No. What the author of the article fails to understand is that software is not the point of research - it is a side-effect, and I say that as someone whose field is CS. We do not write software in academia because we want the software - we simply want the data about its behaviour that we can get from it. It doesn't matter if business / hobbyists / academics have in common an approach that builds software for the least effort. In the first two cases the software is being written because there is a need for it to be used. In the latter case it simply needs to exist in some form long enough for some data to be collected and then it is obsolete. This difference is purpose is so vast that it renders the rest of the argument in the article as not even wrong.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
The Journal of Statistical Software is an electronic journal that publishes software. It tends to publish R packages because that's where the development is mostly happening these days, but it will publish any language. The refereeing process checks that the software works as well as that it is a good contribution. It has a reasonable reputation, far above the junk journals on Beall's list (Google it if you don't know what that is), though not as high as the better mathematical journals in the area. The R Journal has a similar goal, but it's newer, and the reputation isn't there yet.
I review grants, and I give a lot more credit to software published somewhere like JSS or the R Journal than to software available on someone's web site.
So some academics do get credit for this.
at least in the particle physics community, practically all anyone uses is open-source code. The most common are GEANT4 for simulating particles interacting with matter, and ROOT which handles data analysis. Both are maintained by dedicated people at CERN.
As to more specialized code, any time I've ever asked someone about their analysis, no matter what institution or relation (or lack of) to me, they've always been happy to share their code source with me. Usually with many caveats about quality, but it's there. The problem for us has always been knowing who to ask, so a dedicated central repository could be interesting.
Maybe a model like the arxiv.org could work. Almost everyone these days puts preprints of upcoming papers on the arxiv. Since there's no review system, you also get lots of garbage from crazies, but it's generally not hard to weed out if you know at least a little about the subject matter of your search, and trivial if you know the relevant big names in your field. In the same vein, a huge code repository where anyone could upload their junky scripts, tagged by name and subject/function/whatever, might work better than it would seem at first glance.
Most academics are under tremendous pressure to keep anything of potential commercial value closed; releasing code as open-source generally requires permission from above. (In fact, I know of one professor of biology who had to fight to get a line in his contract explicitly allowing him to open-source everything.) And it's not like most of them need encouragement; none of us are getting rich off NIH grants (well, most of us aren't) and we effectively hit a salary ceiling early in our careers, so the prospect of a few thousand dollars extra in licensing revenue is more than most can resist. In several cases that I'm aware of, the licensing money is used to support research activities - sometimes enough to pay for an entire employee, or pay for meetings that wouldn't happen otherwise. Note that in many cases the code itself is still available, just not under a license that allows distribution, which usually makes it difficult or impossible for anyone who wants to build on your work to do so.
Of course it's not always this simple - junior researchers have very little control, so many of us end up releasing code under proprietary licenses when we'd much rather open-source everything. I also know of many cases where paranoia and competitiveness, rather than avarice, are at fault - in these cases, the code itself is hidden and the software released as binary-only (which as far as I'm concerned should be unacceptable for anything published in a peer-reviewed journal, regardless of the license used). Regardless, there are simply too many incentives to retain full control.
This is a completely idiotic situation, of course, and it has been holding back science for years - I know of multiple cases where university researchers were effectively doing R&D for private companies (not always willingly!) with very little in return. I've also seen researchers prevent widespread adoption of their work (and hamper their career advancement) because of tight-fisted behavior. One asshole even charges other academics to obtain his software, with the result that some people avoid using it altogether. Frankly, since I have to deal with this bullshit on a near-daily basis, as far as I'm concerned a repeal of the Bayh-Dole act (and its equivalents in Europe), at least where software is concerned, would be a huge leap forward for academic computational research. The bonus I get from licensing fees is simply not worth the trouble and missed opportunities.
So for clarification, I think you missed the point of what the GP was trying to say. The statement in the summary as written suggests that highly specialized software solutions are commonly engineered for reuse.
Based on the context of the summary, it should probably say either:
1. These highly specialized software solutions are not engineered for reuse.
or
2. These highly specialized software solutions are rarely engineered for reuse.
Citeseer and google scholar contain a large amount of scientific papers freely accessible. Many journals have open access policies. Many researchers publish their result on arxiv before sending it anywhere else. IEEE and ACM let their members access papers (IEEE policy at http://www.ieee.org/publications_standards/publications/subscriptions/prod/mdl/mdl_overview.html . ACM's policy at https://campus.acm.org/public/qj/profqj/qjprof_control.cfm?form_type=Professional . SIAM's policy http://www.siam.org/membership/individual/benefits.php ). So ok, it is not free, but that's not really expensive either if you are actually interested. Most researchers publish preprint on their website. If they don't, drop them an email they'll send you a preprint (if I had not put it on my website, I would send a preprint.)
Assuming you could not find it. And the author is a jerk. And you don't want to pay for it. You can still stop by a university libray where you will be able to download it using university subscription or photocopy it if the library has a paper edition.
Finally, we are not looking to send our papers to the most expensive journal. To the most prestigious certainly, but the price has nothing to do with it. Arguably, one of the most prestigious journal in CS is ACM Computing Surveys. It is an ACM journal, so all ACM members can read it online for the price of their subscription. Hardly the most expensive journal.
That being said, I'd rather we only publish in openaccess journal et we ditch the publishers out. But that's not realistically going to happen anytime soon.
Why would researchers publish their code? They have only one target - to get their *papers* published in reputable venues. More often than not, such venues are closed and paywalled, so it is not surprising that they do not enforce (in fact they discourage it, to say the least) opening up bits of research.
Some researchers would be happy to publish their code anyway (as a matter of principles, or to promote themselves through non-academia channels) but at best they would be frowned upon by their superiors for mis-allocating their resources. At worst, they would be accused of undermining team efforts (by disclosing too much information or exposing inconvenient assumptions to competing researchers) or risking legal conflicts with publishers (copyright).
As earlier mentioned, the code written as a part of research is often poor. This is caused by the same underlying mechanism - getting as many papers published with as little work as possible. That is not (only) about procrastination. The effort put into making the code better is better spent on work on another project.
As usual, "you get what you test for". In case of publicly funded academic projects this means "plenty of good enough papers and nothing more".
"What the author of the article fails to understand is that software is not the point of research - it is a side-effect, and I say that as someone whose field is CS."
(disclaimer: I am working as a postdoc for some US university)
The article in general is clueless. You are of course right. Researchers don't care about their code. I want to know if a design work, if an algorithm work or if it does not. That's why I end up writing code. Once my report/paper/thesis/grant application is written I do not care about the software anymore.
I'd love to produce proper software. But most researchers do not have a clue how to make good software. Software engineering is not our job. We typically do not know how to do *really* good software. That type of skills is not commonly found in grad students. You'll need a postdoc or a professor to do it well. PhD time is valuable, it *is* worth a lot of money. None of the money that come from grants pays for software development. Even if it was, my career would certainly advance more if I do research instead of software. (With occasional exception like "This is the holy grail. We need it done well.")
The only other option would be to pay a software engineer. Grants typically do not cover that. Some do, but most don't.
The final option would be to get somebody else to cover the software development cost. That can happen, but that's very rare. You'll need to find a company that need the proper edge the software will bring, that actually want to work with the academia and that is ok publishing the source code (so potentially losing the edge the project bring them.) That can happen, but do not count on it.
Finally, even assuming there is a useful software framework close to something I am interested in. What will be the investment cost for me to get in that software. Recently I was looking at Android programming for adding a calendar type. That stuff is ridiculously complicated with dozens of concepts and objects and all. And I am talking about a freaking calendar. All encompassing software tends to an overly engineered design. If it takes me more time to get into the software than getting my job done, why should I use it?
People should perhaps have a look at where open source actually started. In any case, there are reasons not to publish source that aren't nefarious: you haven't written up all the papers yet and don't want to get scooped, you don't want to spend a lot of time answering questions about it, etc. I think most academics really have these tradeoffs under control.
None of the money that come from grants pays for software development. Even if it was, my career would certainly advance more if I do research instead of software.
This depends on the field. In particle physics where we have massive computational challenges grants can specifically fund software development. In fact when I was a grad student there were even permanent positions called physics programmers and software development certainly can be very good for your career as long as it is combined with physics analysis - at least it has not hurt me so far. As for "needing a postdoc or a professor to do it well" I very much beg to differ - and I say that as a professor! Programming skills vary considerably at all levels but good grad students, while lacking experience, can be a step or two ahead in terms of modern programming savvy than their older colleagues who are sometimes prone to the FORTRAN++ coding style!
Once my report/paper/thesis/grant application is written I do not care about the software anymore.
Again this varies by field. Monte-carlo simulators for particle physics have a life well beyond any one project and in fact can be projects in themselves. In fact you are reading this page using a software technology developed at CERN to assist particle physics research - the world wide web. So even if you don't care about it anymore sometimes software developed for research can be amazingly useful outside that research.
Not sure what the intent of this article is, since academic research already uses a lot of open source software, far beyond use in industry. Knowing how to navigate a posix system is practically a requirement. Researchers also produce a lot of open source software. In my experience, software mostly falls into two categories: quick, hacked together scripts for analyzing data in a specific way, and complex simulations. The quick scripts generally aren't shared because it would take just as long to explain it to someone else as to rewrite it, and making a manual is simply a waste of time. But the quick scripts are written in a high level language which promotes the sharing of snippets of code, like math functions, commonly used analysis, plotting routines... The workstations at a lab are networked together and typically these little snippets get shared around the work group, and seem to find their way to other groups through collaborations and stuff.
Simulation codes are usually written in FORTRAN and are always distributed in source code form, because workstations have diverse architectures and typically a user will have to modify the program to fit his or her needs. Nobody cares about licenses and such, though you should probably include the code author in your coauthor list of a paper you publish using the code.
"Sharing can’t hurt the small fish. Almost nobody sets out to beat Daniel Lemire at some conference next year. I have no pursuer. And guess what? You probably don’t. But if you do, you are probably doing quite well already, so stop worrying. Yes, yes, they will give you a grant even if you don’t actively sabotage your competitors. Relax already!"
The big fish (and I've worked for them) don't, and it's likely they got that way by protecting their turf. Science is cut throat.
46 & 2
Yes - posts written in the middle of the night are not entirely coherent :)
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
This guy, who wrote an extremely useful and powerful piece of OSS software that is widely used in the graphics community, said it very well in his blog:
http://meshlabstuff.blogspot.com/2010/03/assessing-open-source-software-as.html/
Basically, you are an idiot if you invest any time at all in such things. Papers are all that count. OSS software? You wrote something that hundreds of other researchers depend on for their daily work? Get lost, that professorship goes to someone else. Someone else who was a Real Man, and wrote Papers! Lots of them!
I've developed quite a bit of code in the process of my PhD that I'm in the process of open-sourcing on github (backed with a website here I've developed with open-access scientific protocols - no code there yet; getting clearance).
As others have mentioned the big anti- to this is the problem of publication. If I put my software up there free to use, there is nothing to stop someone else swooping in and using it to pre-empt the results I've spent time writing the software to accomplish (I'm helped slightly by working an obscure angle on an equally obscure field). Further, opening software up to outside contributions opens all sorts of issues with authorship, credit, etc. As it stands I can publish a paper on the software with my name on it - but if I had 20 or so contributors are they all going to want their names on there? All solvable problems - there is typically a threshold of contribution for acknowledgement; and the fact of contribution is preserved right there in the git log. But not something most people want to spend time thinking about.
There is also a slight over-enthusiasm for patents - the first reaction I get to showing off my software is "you should patent it!" on the idea that I would get stinking rich. That's unrealistic for most software, but open-sourcing it immediately scuppers that as a future possibility. When you're funded via various grant organisations it can get more complex (everyone has to agree). I'm lucky enough to be funded by the Wellcome Trust who are pro open-access - and I'm hoping that will translate to pro open-source too.
Python coder | PyQt Applications | Writer
We release all of our code (if not written for a company) used in our projects, or created during research. However, we are a software engineering group, and computer scientists more often open source their work. In recent years it has become mandatory to do so, as otherwise your claims are not backed. If you publish results and do not provide the means to reproduce the results, you are a blabber.
But, most of code produced in research is lousy, as it is just used to proof something not to actually use it. If you want to produce better quality code and documentation, the artifacts must be used by others and you have to start to incorporate agile concepts to develop the tools.
"What the author of the article fails to understand is that software is not the point of research - it is a side-effect, and I say that as someone whose field is CS."
(disclaimer: I am working as a postdoc for some US university)
The article in general is clueless. You are of course right. Researchers don't care about their code. I want to know if a design work, if an algorithm work or if it does not. That's why I end up writing code. Once my report/paper/thesis/grant application is written I do not care about the software anymore.
Well, there's s always the CRAPL license that was made for exactly this kind of source code release, and IMNSHO publishing the source code with the paper should be a must, because it's only science if it is reproducible. I work in image processing and more often then not, papers are missing parameters, the description of the implementation is ambiguous, and as a result just reproducing the result of such a paper is impossible without contacting the authors. (The data used is yet another story.) I do not care if the code is production ready of if I would have to rewrite it from scratch, if at least could have a look at the tweaks that are not in the paper because the authors didn't deem them important enough and the reviewers didn't notes that the published algorithms are not really reproducible - or worse, the reviewers told the authors that "these are standard filters, so there is no need to publish the parameters".
...but writing OSS software for education and academia and distributing it to other universities around the country and world has been my job for the last few years.
http://www.xkcd.com/354/
I completely agree with you. And I try to publish code as often as I can. Though, I do not believe the original article is about getting the code out.
I think the original article is getting the code in the shape where it can be reused and built upon in the same way open source software is. Any code released under CRAPL is probably not in a shape where it can reasonably be built upon. Most of the code I published are not in a very good shape.
http://www.pdfernhout.net/open-letter-to-grantmakers-and-donors-on-copyright-policy.html
"Summary: Foundations, other grantmaking agencies handling public tax-exempt dollars, and charitable donors need to consider the implications for their grantmaking or donation policies if they use a now obsolete charitable model of subsidizing proprietary publishing and proprietary research. In order to improve the effectiveness and collaborativeness of the non-profit sector overall, it is suggested these grantmaking organizations and donors move to requiring grantees to make any resulting copyrighted digital materials freely available on the internet, including free licenses granting the right for others to make and redistribute new derivative works without further permission. It is also suggested patents resulting from charitably subsidized research research also be made freely available for general use. The alternative of allowing charitable dollars to result in proprietary copyrights and proprietary patents is corrupting the non-profit sector as it results in a conflict of interest between a non-profit's primary mission of helping humanity through freely sharing knowledge (made possible at little cost by the internet) and a desire to maximize short term revenues through charging licensing fees for access to patents and copyrights. In essence, with the change of publishing and communication economics made possible by the wide spread use of the internet, tax-exempt non-profits have become, perhaps unwittingly, caught up in a new form of "self-dealing", and it is up to donors and grantmakers (and eventually lawmakers) to prevent this by requiring free licensing of results as a condition of their grants and donations."
A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.