Open Data Needs Open Source Tools
macslocum writes "Nat Torkington begins sketching out an open data process that borrows liberally from open source tools: 'Open source discourages laziness (because everyone can see the corners you've cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it's a great training ground for skills development. I see no reason why open data shouldn't bring the same opportunities to data projects. And a lot of data projects need these things. From talking to government folks and scientists, it's become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there's a poor chain of provenance for the data so it's impossible to figure out what's trustworthy and what's not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don't share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to. So lately I've been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like?'"
Is Eclipse not open source?
The organizational challenges are likely a nasty morass of situation specific oddities, special cases, and unexpectedly tricky personal politics; but OSS technology has clear application.
Most of the large and notable OSS programs are substantially sized codebases distributed and developed across hundreds of different locations. If only by sheer necessity, OSS revision control tools are up to the challenge. That won't change the fact that gathering good data about the real world is hard; but it will make managing a big dataset with a whole bunch of contributors and keeping everything in sync a whole lot easier. Any of the contemporary(ie. post-CVS distributed) revision control systems could do that easily enough. Plus, you get something resembling chain of provenance(at least once the data enter the system) and the ability to filter out comitts from people who you think are unreliable.
it can get bugs fixed or at least identified much faster (many eyes),
So then why were there all those buffer overflow issues, null pointer issues in the Linux kernel before Coverity ran it's scan on the code? Why did that Debian SSH bug exist for over 2 years if this is true?
I perfect example of collaboration with a massive dataset:
http://www.openstreetmap.org/
What if we ran an open data project like an open source project? What would this look like?
Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.
People could start by documenting their data in standardized formats, like DDI 3.
Any one here use Madagascar?
http://www.reproducibility.org/
Interesting problem. Several things come to mind:
1) The Pragmatic tip "Keep knowledge in Plain Text" (fro the Pragmatic Programmer book, that also brought us DRY). You can argue whether XML, JSON, etc are considered 'plain text', but the spirit is simple - data is open when it is usable.
2) tools like diff and patch. If you make a change, you need to be able to extract that change from the whole and give it to other people.
3) Version control tools to manage the complexity of forking, branching, merging, and otherwise dealing with all the many little 'diffs' people will create. Git is an awesoe decentralized tool for this.
4) Open databases. Not just SQL databases like Postgres and MySQL, but other database types for other data structures like CouchDB, Mulgara, etc.
All of these things come with the poer to help address this problem, but come with a barrier to entry in that their use requires skill not just in the tool, but in the problem space of 'data management'.
The problem of data management, as well as the job to point to one set as 'canonical' should be in the hands of someone capable of doing the work. PErhaps there is a skillset worth defining here - some offshoot of library sciences?
High-level: Save your differences from day to day, bittorrent those differences to others, merge back in differences from others. Low-level: OMG, we used different table-names.
Shh.
I just think it is not possible to build such useful data. I am working in parallel computing through a theoretical scheduling perspective.
Each single paper you see is interested in a slightly different model which needs slightly different parameters or have a look at slightly different metrics.
Despite I would love to have a database that provides the instances of all those guys as well as their implementations and results, I do not believe it is going to happen. Since every scientist need different parameters they will all end up with different databases. This will remove the interested of having such a database to begin with.
However, it is obvious to me that we want the data that were used to generated the results available so that reviewers can have a look at them.
The NCBI has a lot of open data sets that they maintain and update regularly. My favorite is MEDLINE, a dataset of medical literature metadata (abstracts, titles, etc). Not quite open source, but available to researchers under a free (essentially) non-commercial attribution license.
There are good analogies between open source and open data. The key one is community participation. Large data sets will likely have problems and inconsistencies. These are going to be exposed by people using the data in odd and unexpected ways, so having a good mechanism for user feedback and improving the data is key, as is versioning and sane schema evolution.
There is a nice series on open source government data in the freedom to tinker blog:
http://www.freedom-to-tinker.com/
Having lots of eyes looking at code is no substitute for using tools like what coverity on your software along with test driven development. Humans can easily miss problems with code that a tool or smoke test can uncover.
Jesus was a compassionate social conservative who called individuals to sin no more.
What if we ran an open data project like an open source project? What would this look like?'"
Every time someone asked about the date, they'd get a reply of RTFM
Whenever someone did like the data they'd fork it with their own approved data
MS would issue a white paper saying why closed source data is better and cheaper
Everytime someone announced some new data, RMS would yell "That's GNU!!!!!>
I'm a consultant - I convert gibberish into cash-flow.
What I've been saying for ages is that the biggest problems for the open data movement are mostly found inside Government agencies. Until the open data promoters can establish a cohesive pitch, based around solving goals for the agency in question, then these technical solutions are a waste of time. Nat's latest 'open source' model for open data will only excite those already sold on the idea.
Most of the people who need convincing as to why they should get on board the open data train, need to be sold on the benefits to *them*, not the benefits to the technical community.
The real problem is the lack of a standardized language between different scientists / agencies. It's really up to the funding sources (such as the NCI) to come up with the standards else you end up with standards, that while technically better, that only a few follow, ie: chembank.broad.mit.edu. Further, having mutiple "standards
There is already an extensive system in place for reviewing and communicating "open" data--peer reviewed publication. If you want to ensure that your data, analysis, and conclusions are part of the collective memory, then publish it in plain language (probably English). "If it isn't published, you didn't do it."
One of the biggest problems is that these datasets are often very large, causing bottlenecks with downloading the data as well as sharing results or variations of the data.
I noticed that BioTorrents is a new open source BitTorrent tracker aimed especially at sharing legal open access datasets and software.
Isn't this what http://sciencecommons.org/ is all about: Freeing data to open up collaboration and revive the sexiness that science is!
If it were open source data, after a while would it have more eye-candy and little added functionality and the mail list would be flooded with flame wars over meaningless minutia? Or not?
What if we ran an open data project like an open source project? What would this look like?
Wikipedia. With all the inherent problems of self-proclaimed authorities
Who do not have commit access.
That is one of the keys to running an open source project well: you, being the giant with some source code, let everybody stand on your shoulders so they can see farther. And you let others stand on their shoulders so they can see even farther still.
But you don't let just about anyone become part of your shoulders. Especially not if that would weaken your shoulders (i.e. bad code or citation-free encyclopaedia entries).
That's the difference between Open Source projects and the Wikipedia project: Wikipedia lets the midgets stand on the shoulders of the giant, even if that makes the giant shorter rather than taller. Well-run open source projects don't let that happen. And poorly run open source projects don't exist due to survivor bias ;-)
Semantic Web technologies (in particular RDF, a graph-structured data format) are ideally suited for publishing data. Also, these technologies facilitate the integration of separate pieces of information; integration is what you want to do if thousands of people start publishing structured data. Linked Data (RDF using HTTP URIs to identify things) is already used by the NYT and the UK government to publish data online.
Looked at the CKAN software (www.ckan.net)? They run their own knowledge archive,a nd the software also powers the UK data.gov.uk site. RESTful API and python client.
Open source encourages laziness (because there are 1mil others out there who can fix it later/better, so good enuf is enuf for now), it can get /interesting/ bugs fixed or at least identified much faster (many eyes), it promotes collaboration /in a clique, outside of which you just get told to 'fix it yourself'/, and it's a terrible training ground for skills development as there is just code, no doco.
The main point of the openDAP project is to facilitate remote collaboration on data, and there are already a few organizations that use it to share data. I've used the python variant for NetCDF files and found it pretty happy and the web interface is clean. The best part of the OpenDAP project is probably that the data doesn't need to be downloaded/copied to be processed, which is really important for anyone who can't afford the racks of harddrives some of these datasets need.
open source modern art: laser taggi
Is what we do on the fusor forum for amateur high energy scientists. It's not perfect, but we basically share in the same manner as open source software all that we do, and it's working fine for us. We help the newbies when we can, or tell them to search the extensive archives for when that question has been asked and answered before, post data, pictures of our gear and all that. It's a good crowd, but a small site, so don't all go there at once....it won't take it and this isn't funded by some large outfit, it's just our own money. Real names are universally used there -- this site is for real work, not kiddie flame wars. There's not much moderation, but jerks lose the ability to log in quickly. Here is the open source fusor forum for you to check out. This is mostly a bunch of old guys having some fun, and helping some new guys get into the game. All sorts of advice and data shared openly and all in one place. Far from perfect, but a good start, I'd say. Check out the "recent threads" link which is as close to slashdot format as it gets on that site.
Why guess when you can know? Measure!
Perhaps /. could lead the way by providing an open database of their stories and comments (license changes would be needed with opt-out).
Then again, I might just think that because I'd rather have a different interface to the same info rather than the one I'm stuck with.
They lost me when I read "Open source discourages laziness (because everyone can see the corners you've cut)".
Whoever said that hasn't seen a lot of open source GUI's lately. Then they had the nerve to say open source products make bugs more likely to be identified because more people are looking at it. But how many of those people know what they're looking at? And is the core group, that knows what they're looking at, any bigger than some for-profit's programming team?
Where did it come from, and what is it supposed to represent?
It's probably just cause I'm an electronics geek with a fondness for "hollow state", but that thing sure looks like the business end of a "magic eye tube" to me.
For those who have no idea what a magic eye tube is:
http://www.magiceyetubes.com/eye02.jpg
http://en.wikipedia.org/wiki/Magic_eye_tube
Remember "News for Nerds, Stuff that Matters"? Help make it a reality again! http://soylentnews.org
Collaboration, archiving, openness, trolls.