Slashdot Mirror


Open Data Needs Open Source Tools

macslocum writes "Nat Torkington begins sketching out an open data process that borrows liberally from open source tools: 'Open source discourages laziness (because everyone can see the corners you've cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it's a great training ground for skills development. I see no reason why open data shouldn't bring the same opportunities to data projects. And a lot of data projects need these things. From talking to government folks and scientists, it's become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there's a poor chain of provenance for the data so it's impossible to figure out what's trustworthy and what's not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don't share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to. So lately I've been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like?'"

20 of 62 comments (clear)

  1. eclipse? by toastar · · Score: 3, Informative

    Is Eclipse not open source?

    1. Re:eclipse? by Monkeedude1212 · · Score: 3, Informative

      Who modded him offtopic?
      Eclipse has an open source Data Tools Platform

  2. Well... by fuzzyfuzzyfungus · · Score: 2, Insightful

    The organizational challenges are likely a nasty morass of situation specific oddities, special cases, and unexpectedly tricky personal politics; but OSS technology has clear application.

    Most of the large and notable OSS programs are substantially sized codebases distributed and developed across hundreds of different locations. If only by sheer necessity, OSS revision control tools are up to the challenge. That won't change the fact that gathering good data about the real world is hard; but it will make managing a big dataset with a whole bunch of contributors and keeping everything in sync a whole lot easier. Any of the contemporary(ie. post-CVS distributed) revision control systems could do that easily enough. Plus, you get something resembling chain of provenance(at least once the data enter the system) and the ability to filter out comitts from people who you think are unreliable.

  3. Open Street Map by Anonymous Coward · · Score: 3, Informative

    I perfect example of collaboration with a massive dataset:

    http://www.openstreetmap.org/

  4. Already being done by kiwimate · · Score: 4, Insightful

    What if we ran an open data project like an open source project? What would this look like?

    Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.

    1. Re:Already being done by viralMeme · · Score: 3, Informative

      > Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about ..

      Wikipedia isn't an open source project, it's an online collaborative encyclopedia. Mediawiki on the other hand is the software project that powers Wikipedia.

    2. Re:Already being done by mikael_j · · Score: 3, Insightful

      I don't think kiwimate was saying that Wikipedia is an open source project, just that Wikipedia is a great example of an open data project run like an open source project.

      /Mikael

      --
      Greylisting is to SMTP as NAT is to IPv4
    3. Re:Already being done by musicalmicah · · Score: 4, Insightful

      What if we ran an open data project like an open source project? What would this look like?

      Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.

      Gee, you make it sound so terrible when you put it like that. It also happens to be an amazing source of information and the perfect resource for an initial foray into any research topic. It's a shining example of what happens when huge amounts of people want to share their knowledge and time with the world. Sure, it's got a few flaws, but in the grand scheme of things, it has made a massive body of information ever more accessible and usable.

      Moreover, I've seen all the flaws you've listed in closed collaborative projects as well. Like all projects, Wikipedia is both a beneficiary and a victim of human nature.

    4. Re:Already being done by Hurricane78 · · Score: 3, Interesting

      I've said this a thousand times before: Make Wikipedia a P2P project without a single control, and build a cascading network of trust relationships on top of it (think CSS rules, but on articles instead of elements, and one CSS file per user, perhaps including those of others), and you solve all problems with then not-existing central authorities, and so also with censorship.

      The only caveat: People have to learn again, who to trust and who not. (Example of where this fails: Political parties and other groups with advanced social engineering / rhetorics / mass psychology skills, like marketing companies.)

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    5. Re:Already being done by wastedlife · · Score: 2, Insightful

      Unlike most open source projects, Wikipedia accepts anonymous contributions and then immediately publishes them without review or verification. That seems like a very strong difference to me.

      --
      Said, "It's just like dice but it's got more sides And it tells me who lives and who dies"
    6. Re:Already being done by lennier · · Score: 2, Interesting

      I've said this a thousand times before: Make Wikipedia a P2P project without a single control, and build a cascading network of trust relationships on top of it (think CSS rules, but on articles instead of elements, and one CSS file per user, perhaps including those of others), and you solve all problems with then not-existing central authorities, and so also with censorship.

      I agree wholeheartedly. If I understand correctly, this is very like what David Gelernter is saying with his datasphere/lifestreams concept: a fully distributed system with no centre where any node can absorb and retransmit its own view of the data universe. Twitter and 'retweets' is a sort of lame, struggling, misbegotten attempt to shamble towards this idea.

      What would happen, I think, is that such a distributed Wikipedia would converge on a few 'trusted super-editors' who produced their own authorised versions - like Linux kernel forks or distributions - since the pressure to join a 'good enough' peer group would force forking to only happen where necessary. And yes, there'd probably emerge separate political factions: a Mainstream Wikipedia, a Citizendium, a Conservapedia, an Encyclopedia Dramatica, a UFOpedia, a Treknopedia, each of which has their own idea of what subjects are/are not 'noteworthy' or which sources are well-attested... but that's fine, we have that already, what we'd win in a truly distributed system is not the ability the ability to fork (which the GPL already gives us) but the ability to easily remerge which is currently a real pain.

      There's no reason, for instance, why Citizendium, TVTropes, Encyclopedia Dramatica, C2, MeatballWiki, etc all couldn't share the same technical base and content and link to and import/export from each other, and just provide different editorial policies or views. And I think we'd all win hugely if we could bring that about.

      --
      You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC
  5. Use Open Standards by The-Pheon · · Score: 4, Informative

    People could start by documenting their data in standardized formats, like DDI 3.

  6. Open data needs open data structure and owner by bokmann · · Score: 4, Insightful

    Interesting problem. Several things come to mind:

    1) The Pragmatic tip "Keep knowledge in Plain Text" (fro the Pragmatic Programmer book, that also brought us DRY). You can argue whether XML, JSON, etc are considered 'plain text', but the spirit is simple - data is open when it is usable.

    2) tools like diff and patch. If you make a change, you need to be able to extract that change from the whole and give it to other people.

    3) Version control tools to manage the complexity of forking, branching, merging, and otherwise dealing with all the many little 'diffs' people will create. Git is an awesoe decentralized tool for this.

    4) Open databases. Not just SQL databases like Postgres and MySQL, but other database types for other data structures like CouchDB, Mulgara, etc.

    All of these things come with the poer to help address this problem, but come with a barrier to entry in that their use requires skill not just in the tool, but in the problem space of 'data management'.

    The problem of data management, as well as the job to point to one set as 'canonical' should be in the hands of someone capable of doing the work. PErhaps there is a skillset worth defining here - some offshoot of library sciences?

    1. Re:Open data needs open data structure and owner by GrantRobertson · · Score: 5, Informative

      Perhaps there is a skillset worth defining here - some offshoot of library sciences?

      That offshoot is called "Information Science." Most "Library Science" programs now call themselves "Library and Information Science" programs. There is now even a consortium of universities that call themselves "iSchools." In my preliminary research while looking for a graduate program in "Information Science" it seems as if the program at Berkeley has gone the farthest in getting away from the legacy "Library Science" and moving toward a pure "Information Science" program.

      I personally think that the field of "Information Science" is really where we are going to find the next major improvements in the ability of computers to actually impact our daily lives. We need entirely new models of how to look at dynamic, "living" data and track changes not only to the data but to the schema and provenance of that data. That is how "data" becomes "information" and then "knowledge." I won't write my doctoral thesis here, but suffice it to say that simply squeezing data into a decades old model of software version control is not quite going to cut it. In software version control you don't have as much of a trust problem. Yes, you do care if someone inappropriately copies code from a proprietary or differently-licensed source. However, you don't have as much incentive for people to intentionally fudge the code/data one way or another. In addition, data can be legitimately manipulated, transformed, and summarized to harvest that "information" out of the raw numbers. This does not happen with code. Yes, there is refactoring, but with code it is not as necessary to document every minute change and how it was arrived at. With data, the equations and algorithms used for each transformation need to be recorded along with the new dataset. In addition, the reason for those transformations and the authority of those who did the transformation.

      Throw into the mix that there will be many different sets of similar data gathered about the same phenomena but with slightly different schemas and different actual data points which will all have different provenances but will need to be manipulated in ways to bring their models into forms that are parallel to all the other data sets associated with those phenomena while still tracking how they are different ... and you will see that we don't just need a different box to think outside of, we need an entirely different warehouse. (You know, the place where we store the boxes, outside of which we will do our thinking.)

      Many of the suggestions posted here are a start, but only a start.

  7. Standards by Domain needed. by headkase · · Score: 3, Interesting

    High-level: Save your differences from day to day, bittorrent those differences to others, merge back in differences from others. Low-level: OMG, we used different table-names.

    --
    Shh.
    1. Re:Standards by Domain needed. by oneiros27 · · Score: 2, Insightful

      You're assuming that the differences are something that someone can keep up with in real time. If someone makes a change in calibration that results in a few month's worth of data changing, it might take weeks or even months to catch up (as you're still trying to deal with the ingest of the new data at the same time). As for bittorrent, p2p is banned in some federal agencies -- and as such, we haven't had a chance to find out how well it scales to dealing with lots (10s of millions) of small files (1 to 16MB).

      As for the low-level issues -- it's not even close. The problem is that people build their catalogs to handle the type of science they want to do; they often don't revolve around the same concepts, and they might have one or thousands of tables. See my talk Data Relationships: Towards a Conceptual Model of Scientific Data Catalogs from the 2008 American Geophysical Union.

      I've been working for years with people who want to search the data from the systems I maintain, but the way that they want me to describe the data to make it searchable aren't easy to define -- even terms like 'instrument' mean something different between their system and mine. (and I have a paper submitted for the Journal of Library Metadata's special 'eScience' issue, dealing with issues in terminology and other problems that the library field doesn't typically run into, but we have to deal with in science informatics)

      Disclaimer : If it's not apparent from the message, I work in this field.

      --
      Build it, and they will come^Hplain.
  8. Parent not a troll. by aristotle-dude · · Score: 3, Informative

    Having lots of eyes looking at code is no substitute for using tools like what coverity on your software along with test driven development. Humans can easily miss problems with code that a tool or smoke test can uncover.

    --
    Jesus was a compassionate social conservative who called individuals to sin no more.
  9. Wikipedia == Anarchy != Open Source by jonaskoelker · · Score: 2, Insightful

    What if we ran an open data project like an open source project? What would this look like?

    Wikipedia. With all the inherent problems of self-proclaimed authorities

    Who do not have commit access.

    That is one of the keys to running an open source project well: you, being the giant with some source code, let everybody stand on your shoulders so they can see farther. And you let others stand on their shoulders so they can see even farther still.

    But you don't let just about anyone become part of your shoulders. Especially not if that would weaken your shoulders (i.e. bad code or citation-free encyclopaedia entries).

    That's the difference between Open Source projects and the Wikipedia project: Wikipedia lets the midgets stand on the shoulders of the giant, even if that makes the giant shorter rather than taller. Well-run open source projects don't let that happen. And poorly run open source projects don't exist due to survivor bias ;-)

  10. Metadata handling with CKAN by Bazman · · Score: 2, Informative

    Looked at the CKAN software (www.ckan.net)? They run their own knowledge archive,a nd the software also powers the UK data.gov.uk site. RESTful API and python client.

  11. OpenDAP by story645 · · Score: 2, Informative

    The main point of the openDAP project is to facilitate remote collaboration on data, and there are already a few organizations that use it to share data. I've used the python variant for NetCDF files and found it pretty happy and the web interface is clean. The best part of the OpenDAP project is probably that the data doesn't need to be downloaded/copied to be processed, which is really important for anyone who can't afford the racks of harddrives some of these datasets need.

    --
    open source modern art: laser taggi