Slashdot Mirror


The Importance — and Limits — of Very Large Data Sets

New submitter kodiaktau writes "A recently presented paper discusses how large data sets can improve learning algorithms, but points out that researchers still need to account for bias and incompleteness before drawing conclusions. The paper also goes into the need for responsible business practices to manage these data sets. 'There's been the emergence of a philosophy that big data is all you need. We would suggest that, actually, numbers don't speak for themselves.' The full paper is available through SSRN. Of particular importance is their assertion that even huge data sets can and will be affected by filters or the analyst who is interpreting it. '[Study co-author Kate Crawford] notes that many big data sets — particularly social data — come from companies that have no obligation to support scientific inquiry. Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.'"

17 comments

  1. It's honest about a "main tenet" of stats by Anonymous Coward · · Score: 0

    "Big data sets are never complete," Crawford says (from source article -> http://www.technologyreview.com/computing/38775/page1/ ).

    APK

    P.S.=> Thus, you can NEVER, EVER have a perfect dataset, because you're never going to have every possible sampleable item, period...

    ... apk

  2. This is the most obv article ever on /. by Anonymous Coward · · Score: 0

    Ever.

    1. Re:This is the most obv article ever on /. by Anonymous Coward · · Score: 2, Funny

      How sure are you your data-set is adequate to make that determination?

    2. Re:This is the most obv article ever on /. by jsnipy · · Score: 1

      Did you know when you google 'google', you get google.com?

      --
      -- if you mod me down, I will become more powerful than you can possibly imagine
  3. There's lots of data by MadKeithV · · Score: 2

    There's lots of data to support this article.

  4. oh come on... by Anonymous Coward · · Score: 0

    oh come on... an article on data availability not being available with a simile url. Meh!

  5. This is a problem with most data! by garcia · · Score: 3, Insightful

    From the blurb:

    Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.'"

    Even if you're using data from public institutions you still may have to pay for it (to cover staff time to procure the data--especially if you're asking for something they don't normally provide, which is quite often). While there won't be any limitations on what you can do with the data once you have it, because of lack of knowledge of their own data/bases the provider may simply provide you with incomplete or likely inaccurate data anyway.

    So yeah, welcome to the world of using data. Move along, nothing to see here.

    1. Re:This is a problem with most data! by oneiros27 · · Score: 1

      And even if you collect it yourself, if you're at an educational institution, you likely have to comply with IRB (institutional review board) rules if it involves people.

      They often don't like you looking for certain types of patterns, or using the data in a way that might harm the people you're studying.

      There's medical privacy rules, general privacy rules, etc. And even when not dealing with people, there's lots of moral issues in how you use the data. (and there's moral issues in sharing data -- some groups don't want to reveal info about endangered special location in too much detail, as it helps poachers. ... but if you have a dataset that resulted in the loss of lives to collect (maybe not intentionally), if you share it, it means people don't have to repeat the process to collect it.)

      Of course we're still coping with the issues of providing proper credit & attribution for data, and standards for publishing data so that it can be re-used. I've been to lots of meetings in the last year that covered those sorts of issues -- DCC, BRDI, RDAP, DataCite, etc.

      --
      Build it, and they will come^Hplain.
  6. At least there IS very large social data sets by G3ckoG33k · · Score: 2

    At least there IS very large social data sets.

    Most sociologists today tend to describe the world using 'deep' interviews of 36 people in the surroundings of the campus, because that way they will get the result they wish to get.

    A cynic description, yes, but not too far the truth. So, it is good to see there IS large data sets, somewhere.

    1. Re:At least there IS very large social data sets by Anonymous Coward · · Score: 1

      IS a set, ARE sets... Doesn't saying what you wrote out loud trigger any warning sirens? Also, are you trying for "a cynic's description" or "a cynical description" ?

      Also, it would be nice if you had ended "A cynic description, yes, but not too far the truth" with a rationalization. Like, "based on my experience as a graduate assistant working in the sociology dept" or "based on my own exhaustive research" or even "based on what the voices in my head are telling me." Just how far from the truth is "too far" ?

    2. Re:At least there IS very large social data sets by Anonymous Coward · · Score: 0

      Based on experiences reading sociological papers.

      Try http://scholar.google.se/scholar?hl=sv&q=sociology+deep+interview&as_ylo=2000&as_vis=0

    3. Re:At least there IS very large social data sets by Anonymous Coward · · Score: 0

      [needs citation]

  7. Nothing new for me. by Anonymous Coward · · Score: 0

    I work as a biostatistician and often analyse data from so-called next-generation sequencing technologies. The amount of data per biological sample from these fantastic machines is absurd - which makes the analysis a fun (computational) challenge in itself on an (almost) regular PC. A great example of this issue is - in my experience - that the biologists are hell-bent on getting more "data" per sample and not less data per sample and more samples. This, in reality, increases the signal to noise ratio instead of the contrary. The money simply goes into using the newest (and most expensive) machine instead of using older and cheaper technologies and getting more samples. This is just yet another point that "more data" does not always equal more confident conclusions - it sometimes has to be the right kind of "more data".

  8. Forget about bias and incompleteness for a moment by tinkerton · · Score: 1

    This statement
    'There's been the emergence of a philosophy that big data is all you need. We would suggest that, actually, numbers don't speak for themselves.'

    is not about bias and incompleteness. The person who is looking at the data needs to have the necessary concepts and it's a bad idea to call that bias. The data won't do the thinking for him(her). They've just found 3 new exoplanets in old Hubble data. The data hasn't changed and ha, but the people who are looking at them have.

  9. Not a surprise by gweihir · · Score: 1

    Those that claim a large dataset is all you need are typically bad scientists that happen to have access to such a dataset. Large datasets eliminate one thing, namely noise (random variations). Large datasets can be just as biased, incomplete and contaminated with data you do not suspect of being in there as small datasets. They are not in any way a better approximation of "the truth" than smaller datasets.

    But every good scientist knew that anyways.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.