Open Data Needs Open Source Tools
macslocum writes "Nat Torkington begins sketching out an open data process that borrows liberally from open source tools: 'Open source discourages laziness (because everyone can see the corners you've cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it's a great training ground for skills development. I see no reason why open data shouldn't bring the same opportunities to data projects. And a lot of data projects need these things. From talking to government folks and scientists, it's become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there's a poor chain of provenance for the data so it's impossible to figure out what's trustworthy and what's not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don't share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to. So lately I've been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like?'"
Is Eclipse not open source?
The organizational challenges are likely a nasty morass of situation specific oddities, special cases, and unexpectedly tricky personal politics; but OSS technology has clear application.
Most of the large and notable OSS programs are substantially sized codebases distributed and developed across hundreds of different locations. If only by sheer necessity, OSS revision control tools are up to the challenge. That won't change the fact that gathering good data about the real world is hard; but it will make managing a big dataset with a whole bunch of contributors and keeping everything in sync a whole lot easier. Any of the contemporary(ie. post-CVS distributed) revision control systems could do that easily enough. Plus, you get something resembling chain of provenance(at least once the data enter the system) and the ability to filter out comitts from people who you think are unreliable.
I perfect example of collaboration with a massive dataset:
http://www.openstreetmap.org/
What if we ran an open data project like an open source project? What would this look like?
Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.
People could start by documenting their data in standardized formats, like DDI 3.
Interesting problem. Several things come to mind:
1) The Pragmatic tip "Keep knowledge in Plain Text" (fro the Pragmatic Programmer book, that also brought us DRY). You can argue whether XML, JSON, etc are considered 'plain text', but the spirit is simple - data is open when it is usable.
2) tools like diff and patch. If you make a change, you need to be able to extract that change from the whole and give it to other people.
3) Version control tools to manage the complexity of forking, branching, merging, and otherwise dealing with all the many little 'diffs' people will create. Git is an awesoe decentralized tool for this.
4) Open databases. Not just SQL databases like Postgres and MySQL, but other database types for other data structures like CouchDB, Mulgara, etc.
All of these things come with the poer to help address this problem, but come with a barrier to entry in that their use requires skill not just in the tool, but in the problem space of 'data management'.
The problem of data management, as well as the job to point to one set as 'canonical' should be in the hands of someone capable of doing the work. PErhaps there is a skillset worth defining here - some offshoot of library sciences?
High-level: Save your differences from day to day, bittorrent those differences to others, merge back in differences from others. Low-level: OMG, we used different table-names.
Shh.
Having lots of eyes looking at code is no substitute for using tools like what coverity on your software along with test driven development. Humans can easily miss problems with code that a tool or smoke test can uncover.
Jesus was a compassionate social conservative who called individuals to sin no more.
What if we ran an open data project like an open source project? What would this look like?
Wikipedia. With all the inherent problems of self-proclaimed authorities
Who do not have commit access.
That is one of the keys to running an open source project well: you, being the giant with some source code, let everybody stand on your shoulders so they can see farther. And you let others stand on their shoulders so they can see even farther still.
But you don't let just about anyone become part of your shoulders. Especially not if that would weaken your shoulders (i.e. bad code or citation-free encyclopaedia entries).
That's the difference between Open Source projects and the Wikipedia project: Wikipedia lets the midgets stand on the shoulders of the giant, even if that makes the giant shorter rather than taller. Well-run open source projects don't let that happen. And poorly run open source projects don't exist due to survivor bias ;-)
Looked at the CKAN software (www.ckan.net)? They run their own knowledge archive,a nd the software also powers the UK data.gov.uk site. RESTful API and python client.
The main point of the openDAP project is to facilitate remote collaboration on data, and there are already a few organizations that use it to share data. I've used the python variant for NetCDF files and found it pretty happy and the web interface is clean. The best part of the OpenDAP project is probably that the data doesn't need to be downloaded/copied to be processed, which is really important for anyone who can't afford the racks of harddrives some of these datasets need.
open source modern art: laser taggi