How Open Source Could Benefit Academic Research
dp619 writes "Ross Gardler, of Apache Fame, has written a guest post on the Outercurve Foundation blog advocating that universities accelerate the research process through a collaborative sharing and development of research software while examining reasons why many have been reluctant to publish their source code. Quoting: 'These highly specialized software solutions are not rarely engineered for reuse. They are often hacks to answer a specific question quickly. ... What many academic researchers fail to understand is that this specialization problem is not unique to research projects. Most software developers will seek to provide an adequate solution to their specific problem, as quickly as possible. They don't seek to build a perfect, all-purpose, tool set that can be reused in every conceivable circumstance. They simply solve the problem at hand and move on to the next one. The difference is that open source developers will do this incremental problem solving using shared code. They will share that code in incremental steps rather than wait until they've built the complete system they need but is too specific for others to use. Other people will reuse and improve on the initial solution, perhaps generalizing it a little in the process. There is no need to share the details of why one needs a 'green widget' nor is there any reason to prevent someone modifying it so it can be either a 'green widget' or a 'blue widget.'"
"These highly specialized software solutions are [not] [rarely] engineered for reuse."
Pick one. Not both.
Open Source by default has the benefit of many eyes checking for errors, contributing ideas, but things get sour when too many people commit.
Too many chefs etc... etc...
I think this will change when there's an open record of who came up with an idea first. Wouldn't it be quite a bit harder to say "I came up with that" if we don't know about your ideas until after your paper is published?
This is a false premise, IMO. By default, all changes are incremental. Dumps happen when there's poor coordination between parties involved an no one's really sure of what they're working on.
There needs to be oversight by competent, impartial people who can pinpoint conflicts early, look for logical problems (this is harder when you're too involved your own research/programming/souffle) and most importantly, let ideas through that actually contribute to the overall understanding of the subject.
This doesn't apply to code per se, but really any research: If the global warming controversy is any hint, research is prime realestate for astroturfing. Empirically observed fact is no match for the perceived reality of the ignorant when special interests are involved. Open source, with research wouldn't just be beneficial to the programming aspect, but it would also ensure we're not walking into a wall with critical thought.
If computers were people, I'd be a misanthrope.
Shameless plug: http://perk.cs.queensu.ca/software We do exactly this. Our software is open source for anyone to use/test/fix. We do use SVN to maintain some control over the code that is commited, but overall it works quite well. We have just launched some projects on github; it's a new experiment and we're interested to see how it turns out.
Sometimes, when you publish the code you used to develop new Biochemistry or Genetics solutions, you find that other scientists in other countries use your code to reverse engineer what you are working on - your results, if you will - to eliminate dead ends and publish a paper on what you invested years finding a solution for, but before you submit your paper that they "effectively" stole.
We had that happen when we deposited ligand results a few times, until we learned to stop submitting such things until AFTER we were approved for print.
This is one reason for hesitancy that I can agree to. Just because I wrote code, doesn't mean I want you to have it, if I haven't published the end result.
After it's in print, you're welcome to have the code. Not before.
-- Tigger warning: This post may contain tiggers! --
What if your pursuer is the Federal Government and they have every intention of putting you "out of business"?
The discoveries, algorithms and parameters generated by publicly-funded research is locked behind the paywalls of for-profit publishers. Those publishers won't publish an article unless the academic SURRENDERS THEM THE COPYRIGHT OF THEIR RESEARCH PAPER FOR FREE. The only reason these publishers have survived is because academics want their research published in the most prestigious (read 'expensive') journal they can find. Academics could benefit from 'open-sourcing' their research too.
"Academic publishers charge vast fees to access research paid for by us."
http://www.guardian.co.uk/commentisfree/2011/aug/29/academic-publishers-murdoch-socialist
"Academic papers are hidden from the public."
http://www.badscience.net/2011/09/academic-papers-are-hidden-from-the-public-heres-some-direct-action/
There are many open-source research software efforts already, and of course it would be good to see this become more widespread. These range from small-scale individual researcher one-off efforts to broad multi-institution efforts that are well-maintained over years. The software that I develop in the course of my mathematical research is available freely from our webpages, with intermittent downloads. And I still get inquiries about using it, to which I just say that it's all on our webpages already.
One barrier to broader efforts in the US is that science agencies (at least the National Science Foundation) generally support research proper, rather than development of tools. Oddly, I am much more likely to get a grant to work out research that perhaps 20 to 50 people may be interested in than I am to get a grant to develop research tools that may be useful in furthering research to a few hundred researchers. Nevertheless, it is more common that universities and funding agencies expect data and software from research to be freely available. Many people drag their feet on these requirements as they are worried that some other researchers will use their tools to scoop them, but I think these instances are very rare.
It's psychosomatic. You need a lobotomy. I'll get a saw.
This is a great idea, but it won't happen until the NIH requires it. We release our software as open source, but we're definitely not the standard in our field. Almost everyone else keeps the source closed.
Let's see here, am I going to focus on collaborating with others on building re-usable software components, which does NOT get me closer to publishing my paper (the only thing people are rewarded for in academia) or am I going to focus on my research, which is?
*cackles*
What the hell did you think I was gonna do, sonny?
The software I have written for my odd specialized purposes is similar to the software my colleagues write: It's spaghetti code written with custom libraries which are not better than common ones and it has no documentation at all.
We could open-source it, but then you'd just bitch about how poorly its constructed.
We don't have time to open-source our code. Heck, I've had people ask to use software I've made and I've regretted giving it to them because I then am obligated to explain to them how to use it.
> Open Source by default has the benefit of many eyes checking for errors, contributing ideas, but things get sour when too many people commit.
This is why I don't contribute to OS. The barrier to entry is too high. I have useful changes, but I have no idea what the process is to submit in some cases, and people complain about style.
> I think this will change when there's an open record of who came up with an idea first. Wouldn't it be quite a bit harder to say "I came up with that" if we don't know about your ideas until after your paper is published?
Nope. What counts is a paper. Anyone can string together an idea and put it online. The difficult part of research is proving you're right and convincing others that it's a good idea. I don't get grants based on my commits, I get them based on my publications. If I put my stuff online before it's published, others can publish on it first. Sure, I can complain on my blog 'Oh, I had it first', but that means exactly 0 to grant committees.
> This is a false premise, IMO. By default, all changes are incremental. Dumps happen when there's poor coordination between parties involved an no one's really sure of what they're working on.
Right, which is exactly what you have when someone builds something they think is useful to other people. In research you might try a half a dozen ways to do something, and when it works, then you have something to share. At this point, it's probably a big steaming dump.
> There needs to be oversight by competent, impartial people who can pinpoint conflicts early, look for logical problems (this is harder when you're too involved your own research/programming/souffle) and most importantly, let ideas through that actually contribute to the overall understanding of the subject.
You have no idea how real research works. It's very easy to go down a very interesting rabbit hole with problems, things you can't foresee. This results in weird code adaptations to get the job done.
The Journal of Statistical Software is an electronic journal that publishes software. It tends to publish R packages because that's where the development is mostly happening these days, but it will publish any language. The refereeing process checks that the software works as well as that it is a good contribution. It has a reasonable reputation, far above the junk journals on Beall's list (Google it if you don't know what that is), though not as high as the better mathematical journals in the area. The R Journal has a similar goal, but it's newer, and the reputation isn't there yet.
I review grants, and I give a lot more credit to software published somewhere like JSS or the R Journal than to software available on someone's web site.
So some academics do get credit for this.
at least in the particle physics community, practically all anyone uses is open-source code. The most common are GEANT4 for simulating particles interacting with matter, and ROOT which handles data analysis. Both are maintained by dedicated people at CERN.
As to more specialized code, any time I've ever asked someone about their analysis, no matter what institution or relation (or lack of) to me, they've always been happy to share their code source with me. Usually with many caveats about quality, but it's there. The problem for us has always been knowing who to ask, so a dedicated central repository could be interesting.
Maybe a model like the arxiv.org could work. Almost everyone these days puts preprints of upcoming papers on the arxiv. Since there's no review system, you also get lots of garbage from crazies, but it's generally not hard to weed out if you know at least a little about the subject matter of your search, and trivial if you know the relevant big names in your field. In the same vein, a huge code repository where anyone could upload their junky scripts, tagged by name and subject/function/whatever, might work better than it would seem at first glance.
There is a vast body of near open source software that is used for research. I am the lead of a software development group and all of our software is available for free to anyone who wants it. We even provide installation support. Add to that all of the software from CERN, hbook, PAW, cernlib, ROOT, GEANT, the list is long. And don't get me started on XML, browsers, web servers and other software that was developed for research and is freely available.
Most academics are under tremendous pressure to keep anything of potential commercial value closed; releasing code as open-source generally requires permission from above. (In fact, I know of one professor of biology who had to fight to get a line in his contract explicitly allowing him to open-source everything.) And it's not like most of them need encouragement; none of us are getting rich off NIH grants (well, most of us aren't) and we effectively hit a salary ceiling early in our careers, so the prospect of a few thousand dollars extra in licensing revenue is more than most can resist. In several cases that I'm aware of, the licensing money is used to support research activities - sometimes enough to pay for an entire employee, or pay for meetings that wouldn't happen otherwise. Note that in many cases the code itself is still available, just not under a license that allows distribution, which usually makes it difficult or impossible for anyone who wants to build on your work to do so.
Of course it's not always this simple - junior researchers have very little control, so many of us end up releasing code under proprietary licenses when we'd much rather open-source everything. I also know of many cases where paranoia and competitiveness, rather than avarice, are at fault - in these cases, the code itself is hidden and the software released as binary-only (which as far as I'm concerned should be unacceptable for anything published in a peer-reviewed journal, regardless of the license used). Regardless, there are simply too many incentives to retain full control.
This is a completely idiotic situation, of course, and it has been holding back science for years - I know of multiple cases where university researchers were effectively doing R&D for private companies (not always willingly!) with very little in return. I've also seen researchers prevent widespread adoption of their work (and hamper their career advancement) because of tight-fisted behavior. One asshole even charges other academics to obtain his software, with the result that some people avoid using it altogether. Frankly, since I have to deal with this bullshit on a near-daily basis, as far as I'm concerned a repeal of the Bayh-Dole act (and its equivalents in Europe), at least where software is concerned, would be a huge leap forward for academic computational research. The bonus I get from licensing fees is simply not worth the trouble and missed opportunities.
At least for physics...
Solid state
Quantum espresso
Abinit
GPAW
Particle physics
The famous Geant4
Molecular dynamics
LAMMPS
Amber
etc... etc.. etc... I could post links all day.
I'm not really sure who this article is directed at, biologists I guess...
I recall my advisor using code from the popular Numerical Recipes textbook in research. He was surprised when I showed him the introductory notes indicating that the code could be used in research, but only distributed in such projects as binaries (can't show the code).
Citeseer and google scholar contain a large amount of scientific papers freely accessible. Many journals have open access policies. Many researchers publish their result on arxiv before sending it anywhere else. IEEE and ACM let their members access papers (IEEE policy at http://www.ieee.org/publications_standards/publications/subscriptions/prod/mdl/mdl_overview.html . ACM's policy at https://campus.acm.org/public/qj/profqj/qjprof_control.cfm?form_type=Professional . SIAM's policy http://www.siam.org/membership/individual/benefits.php ). So ok, it is not free, but that's not really expensive either if you are actually interested. Most researchers publish preprint on their website. If they don't, drop them an email they'll send you a preprint (if I had not put it on my website, I would send a preprint.)
Assuming you could not find it. And the author is a jerk. And you don't want to pay for it. You can still stop by a university libray where you will be able to download it using university subscription or photocopy it if the library has a paper edition.
Finally, we are not looking to send our papers to the most expensive journal. To the most prestigious certainly, but the price has nothing to do with it. Arguably, one of the most prestigious journal in CS is ACM Computing Surveys. It is an ACM journal, so all ACM members can read it online for the price of their subscription. Hardly the most expensive journal.
That being said, I'd rather we only publish in openaccess journal et we ditch the publishers out. But that's not realistically going to happen anytime soon.
Why would researchers publish their code? They have only one target - to get their *papers* published in reputable venues. More often than not, such venues are closed and paywalled, so it is not surprising that they do not enforce (in fact they discourage it, to say the least) opening up bits of research.
Some researchers would be happy to publish their code anyway (as a matter of principles, or to promote themselves through non-academia channels) but at best they would be frowned upon by their superiors for mis-allocating their resources. At worst, they would be accused of undermining team efforts (by disclosing too much information or exposing inconvenient assumptions to competing researchers) or risking legal conflicts with publishers (copyright).
As earlier mentioned, the code written as a part of research is often poor. This is caused by the same underlying mechanism - getting as many papers published with as little work as possible. That is not (only) about procrastination. The effort put into making the code better is better spent on work on another project.
As usual, "you get what you test for". In case of publicly funded academic projects this means "plenty of good enough papers and nothing more".
...of the year?
I'm in a research group that released an influential piece of software as open source. Making the source code available didn't really improve anything: the software was broken and is still broken as far as I can tell. However, many research projects have built off of this (broken) software and our group has been scooped several times, meaning months of work were lost, by groups using our own software. There is absolutely zero reason to collaborate when the code is widely available. Case in point: the software quality may have gotten better, but there are literally dozens of forks, so it's not as if it's feeding back into the main project.
It's just not worth it.
People should perhaps have a look at where open source actually started. In any case, there are reasons not to publish source that aren't nefarious: you haven't written up all the papers yet and don't want to get scooped, you don't want to spend a lot of time answering questions about it, etc. I think most academics really have these tradeoffs under control.
None of the money that come from grants pays for software development. Even if it was, my career would certainly advance more if I do research instead of software.
This depends on the field. In particle physics where we have massive computational challenges grants can specifically fund software development. In fact when I was a grad student there were even permanent positions called physics programmers and software development certainly can be very good for your career as long as it is combined with physics analysis - at least it has not hurt me so far. As for "needing a postdoc or a professor to do it well" I very much beg to differ - and I say that as a professor! Programming skills vary considerably at all levels but good grad students, while lacking experience, can be a step or two ahead in terms of modern programming savvy than their older colleagues who are sometimes prone to the FORTRAN++ coding style!
Once my report/paper/thesis/grant application is written I do not care about the software anymore.
Again this varies by field. Monte-carlo simulators for particle physics have a life well beyond any one project and in fact can be projects in themselves. In fact you are reading this page using a software technology developed at CERN to assist particle physics research - the world wide web. So even if you don't care about it anymore sometimes software developed for research can be amazingly useful outside that research.
Not sure what the intent of this article is, since academic research already uses a lot of open source software, far beyond use in industry. Knowing how to navigate a posix system is practically a requirement. Researchers also produce a lot of open source software. In my experience, software mostly falls into two categories: quick, hacked together scripts for analyzing data in a specific way, and complex simulations. The quick scripts generally aren't shared because it would take just as long to explain it to someone else as to rewrite it, and making a manual is simply a waste of time. But the quick scripts are written in a high level language which promotes the sharing of snippets of code, like math functions, commonly used analysis, plotting routines... The workstations at a lab are networked together and typically these little snippets get shared around the work group, and seem to find their way to other groups through collaborations and stuff.
Simulation codes are usually written in FORTRAN and are always distributed in source code form, because workstations have diverse architectures and typically a user will have to modify the program to fit his or her needs. Nobody cares about licenses and such, though you should probably include the code author in your coauthor list of a paper you publish using the code.
IS the GRANDFATHER and GRANDMOTHER of 'open source' and largely to this very day ... un-payed.
Why does C, C++ and UNIX exists, and all the technologies therein exits and flourish. Open Exchange.
Remember Mosaic ?
Remember Netscape ?
Remember Spark ?
OH ! All of those pesky ... US Patents !
Want to 'Walk The Internet' ? in first person ? in real time ? Over the Earth city by city and country by country ?
It was done ! And YOU had nothing to do with THAT !
Feeling Posh Posh ... Angry ... Bitter ... Cheated are you ... possums ? :)
"Sharing can’t hurt the small fish. Almost nobody sets out to beat Daniel Lemire at some conference next year. I have no pursuer. And guess what? You probably don’t. But if you do, you are probably doing quite well already, so stop worrying. Yes, yes, they will give you a grant even if you don’t actively sabotage your competitors. Relax already!"
The big fish (and I've worked for them) don't, and it's likely they got that way by protecting their turf. Science is cut throat.
46 & 2
Isn't science about peer review and repeatability of any test settings? How can you do that without source code?
Things like this are what separates real academic science from secret proprietary commercial "science".
This should be one pillar of cleaning up science alongside killing the dirty secretive for profit journals. Fortunately I believe both things are already underway.
This guy, who wrote an extremely useful and powerful piece of OSS software that is widely used in the graphics community, said it very well in his blog:
http://meshlabstuff.blogspot.com/2010/03/assessing-open-source-software-as.html/
Basically, you are an idiot if you invest any time at all in such things. Papers are all that count. OSS software? You wrote something that hundreds of other researchers depend on for their daily work? Get lost, that professorship goes to someone else. Someone else who was a Real Man, and wrote Papers! Lots of them!
I've developed quite a bit of code in the process of my PhD that I'm in the process of open-sourcing on github (backed with a website here I've developed with open-access scientific protocols - no code there yet; getting clearance).
As others have mentioned the big anti- to this is the problem of publication. If I put my software up there free to use, there is nothing to stop someone else swooping in and using it to pre-empt the results I've spent time writing the software to accomplish (I'm helped slightly by working an obscure angle on an equally obscure field). Further, opening software up to outside contributions opens all sorts of issues with authorship, credit, etc. As it stands I can publish a paper on the software with my name on it - but if I had 20 or so contributors are they all going to want their names on there? All solvable problems - there is typically a threshold of contribution for acknowledgement; and the fact of contribution is preserved right there in the git log. But not something most people want to spend time thinking about.
There is also a slight over-enthusiasm for patents - the first reaction I get to showing off my software is "you should patent it!" on the idea that I would get stinking rich. That's unrealistic for most software, but open-sourcing it immediately scuppers that as a future possibility. When you're funded via various grant organisations it can get more complex (everyone has to agree). I'm lucky enough to be funded by the Wellcome Trust who are pro open-access - and I'm hoping that will translate to pro open-source too.
Python coder | PyQt Applications | Writer
We release all of our code (if not written for a company) used in our projects, or created during research. However, we are a software engineering group, and computer scientists more often open source their work. In recent years it has become mandatory to do so, as otherwise your claims are not backed. If you publish results and do not provide the means to reproduce the results, you are a blabber.
But, most of code produced in research is lousy, as it is just used to proof something not to actually use it. If you want to produce better quality code and documentation, the artifacts must be used by others and you have to start to incorporate agile concepts to develop the tools.
...but writing OSS software for education and academia and distributing it to other universities around the country and world has been my job for the last few years.
http://www.xkcd.com/354/
http://www.pdfernhout.net/open-letter-to-grantmakers-and-donors-on-copyright-policy.html
"Summary: Foundations, other grantmaking agencies handling public tax-exempt dollars, and charitable donors need to consider the implications for their grantmaking or donation policies if they use a now obsolete charitable model of subsidizing proprietary publishing and proprietary research. In order to improve the effectiveness and collaborativeness of the non-profit sector overall, it is suggested these grantmaking organizations and donors move to requiring grantees to make any resulting copyrighted digital materials freely available on the internet, including free licenses granting the right for others to make and redistribute new derivative works without further permission. It is also suggested patents resulting from charitably subsidized research research also be made freely available for general use. The alternative of allowing charitable dollars to result in proprietary copyrights and proprietary patents is corrupting the non-profit sector as it results in a conflict of interest between a non-profit's primary mission of helping humanity through freely sharing knowledge (made possible at little cost by the internet) and a desire to maximize short term revenues through charging licensing fees for access to patents and copyrights. In essence, with the change of publishing and communication economics made possible by the wide spread use of the internet, tax-exempt non-profits have become, perhaps unwittingly, caught up in a new form of "self-dealing", and it is up to donors and grantmakers (and eventually lawmakers) to prevent this by requiring free licensing of results as a condition of their grants and donations."
A 21st century issue: the irony of technologies of abundance in the hands of those still thinking in terms of scarcity.