Slashdot Mirror


Amazon Launches Public Data Sets To Spur Research

turnkeylinux writes "Amazon just launched its Public Data Sets service (home). The project encourages developers, researchers, universities, and businesses to upload large (non-confidential) data sets to Amazon — things like census data, genomes, etc. — and then let others integrate that data into their own AWS applications. AWS is hosting the public data sets at no charge for the community, and like all of AWS services, users pay only for the compute and storage they consume with their own applications. Data sets already available include various US Census databases, 3-D chemical structures provided by Indiana University, and an annotated form of the Human Genome from Ensembl."

25 of 82 comments (clear)

  1. Finally! by jornak · · Score: 4, Funny

    Now I have somewhere I can store the index of my massive porn collection. Thanks, Amazon!

  2. Re:Check off privacy by kellyb9 · · Score: 2, Insightful

    One more step to a non private world CHECK

    Depends on what you upload. Census data isn't private.

  3. Re:Privacy? by russotto · · Score: 4, Insightful

    It is my understanding that this data was already obtainable in the first place.

    This is true. But the easier it is to obtain datasets like these, the easier it is for anyone to do data mining and correlate the public (presumably non-identified) datasets with any private data they do happen to have.

  4. Selling EC2 service? by bonyari · · Score: 4, Insightful

    This just looks like a way to sell there cloud computing services. They provide the free data and you provide the monthly service fee.

    1. Re:Selling EC2 service? by dubl-u · · Score: 2, Insightful

      This just looks like a way to sell there cloud computing services. They provide the free data and you provide the monthly service fee.

      I'd bet that's not quite how they think about it.

      I once had the fortune to work on a small project for a guy who had built a pretty large software company and then sold it. He said that he always looked to do something interesting first, and then figured out how to make it not lose money, because money-losers aren't sustainable.

      I don't know anybody at Amazon anymore, but from my pals who did work there, my guess is that AWS has a similar culture: they seek out the useful and interesting, and actually do the ideas they can make pay for themselves.

      If they had a culture that was mainly revenue-focused, I'd expect this idea to get shot down, because some penny-pincher would argue that they'd make more money from people uploading duplicates of these giant data sets over and over.

    2. Re:Selling EC2 service? by John+Hasler · · Score: 2, Interesting

      > If they had a culture that was mainly revenue-focused, I'd expect this idea to get shot
      > down, because some penny-pincher would argue that they'd make more money from people
      > uploading duplicates of these giant data sets over and over.

      And a clever marketing man would counter that this is an opportunity to achieve lock-in by establishing exclusive access to a large number of datasets. Once people have built large, complex applications that use a number of these datasets in Amazon's environment and format it will very difficult for them to move elsewhere. To marketing people "community"=="locked-in customers".

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
  5. Re:Privacy? by Frosty+Piss · · Score: 4, Informative

    The US Census Bureau charges to access much of their datasets.

    --
    If you want news from today, you have to come back tomorrow.
  6. Catch 22 by Anonymous Coward · · Score: 3, Insightful

    Note that on Amazon's website they say that you can only access the data if you're paying them to crunch numbers on their cloud computers.
    That is, you can't just download the data off their sites, which would be the nice thing to do.
    As such, this article is nothing more than a slashvertizement.

    1. Re:Catch 22 by dubl-u · · Score: 3, Interesting

      Note that on Amazon's website they say that you can only access the data if you're paying them to crunch numbers on their cloud computers.
      That is, you can't just download the data off their sites, which would be the nice thing to do.

      And you know what you can do with a cloud computer, my little rocket scientist? You can set up a frickin' web server. And then you can download anything your precious heart desires.

  7. Patent pending... by owlnation · · Score: 4, Funny

    Expect a new slew of Amazon patents...

    "1-Sick" -- Health Data
    "1-Mick" -- Irish Census Data
    "1-Dick" -- Porn Movies Database
    "1-Lick" -- Lesbian Porn Movies Database
    "1-Fick" -- German Porn Movie Database
    "1-Hick" -- The George W. Bush Presidential Library catalog.
    "1-Kick" -- Pharmaceutical Index
    "1-Nick" -- Crime Data
    "1-Prick" -- Copyright Law Legal database
    "1-Trick" -- List of iKea-nu Reeves Movies.
    "1-Tick" -- Camping Places Data set.
    "1-Brick" -- The Lego Catalog.
    "1-Thick" -- Obesity Index.

  8. Re:Check off privacy by johnsonav · · Score: 4, Insightful

    One more step to a non private world CHECK

    Privacy, as we have experienced in the last hundred years, is on its way out anyway. The sheer volume, immortality, and interconnection of, even publicly available, datasets inadvertently reveal information most of us would rather keep private. Much like how most people don't have a problem with beat cops regularly patrolling an area, but feel threatened by cameras monitoring, recording, analyzing, and storing information about the same public area.

    That said, its here to stay. The data's here as long as we use credit cards for most purchases, use I-Pass(or similar) toll paying systems, carry GPS enabled cell phones, and expect the police to protect us from 100% of terrorist and criminal bogeymen. We might as well get some private research done, rather than leave it all to the government and big business.

    --
    ... and that's when the C.H.U.D.'s came at me.
  9. Re:Check off privacy by Chyeld · · Score: 4, Interesting

    The less privacy we have, the less we have to worry about our privacy. That sounds flip, and along the lines of "if you have nothing to hide..." but it isn't.

    We want privacy primarly due to shame.

    We have shame because we wear masks almost 100% of the time.

    We wear masks don't want people to realize who we 'really are' either mentally or phyically.

    We don't want people to really know us because we have been convinced to hold ourselves to standards that no one actually meets.

    We hold ourselves to these standards because everyone else is wearing masks and while we can tell ourselves that 'they are just like us', it's hard to grasp that cognatively without actual proof.

    If there were no privacy, no one could wear a mask. If no one were wearing a mask, we would realize that the standards we hold ourselves to are unrealistic. If we realize the standards we hold ourselves to are unrealisitic, we are freed from shame. If we are freed from shame, we no longer find privacy necessary.

  10. Re:Check off privacy by truthsearch · · Score: 2, Insightful

    Privacy, as we have experienced in the last hundred years, is on its way out anyway.

    It was only recently on its way in. For most of history people lived in small communities where everyone knew each others' business. Privacy only seemed to become a major concern when technology let us share information across large distances and with many more people.

    I'm not commenting on whether that's a good or bad thing.

  11. Re:Check off privacy by tylerni7 · · Score: 5, Insightful

    We (or at least some of us) also want privacy to prevent annoyances and for protection.

    I certainly don't want to have to answer to the government anytime I say the word "bomb" or "terrorist" on the telephone, in email, or in an IM.
    I also don't want some company complaining anytime they see me buy a product from one of their competitors.
    I also don't want to have everyone on the internet knowing my social security number, address, license plate number, or telephone number.

    That isn't because of "shame" that's because people can be assholes, and some people will abuse information. I don't care if people that I trust know these things, but I don't think shame or masks or whatever has anything to do with getting one's identity stolen, or having the government ensure you don't say anything bad about them.

    That said, I don't think this public dataset business really affects individual privacy. This is more a database of already public, but hard to find, data, that doesn't contain personally identifiable anything in it.
    Let's just hope they keep it that way.

  12. History repeats by sukotto · · Score: 2, Funny

    >users pay only for the compute and storage they
    >consume with their own applications

    Everything old is new again!
    Ah the good old days... when you had to PAY for cycles.... not like the young whippersnappers today with their "desktops" and "laptops" and more cycles than they know what to do with.

    --
    Come play free flash games on Kongregate!
  13. Sounds like "Give us data so we can charge you" by Morgaine · · Score: 4, Insightful

    If the uploaded data is not available for download, but is only available to AWS applications running on Amazon's (paid for) compute service, then Amazon deserves nothing but contempt and an "Up yours" for this.

    It seems that working for a living is out of fashion at Amazon. They expect people to supply them with resources so that they can charge them and others for their use. It's creative business bullshit, and not even remotely funny.

    Amazon, how about you PAY BACK for the privilege of having the datasets uploaded to you by hosting them freely for the Internet community, and only on the back of that you charge for local, higher-speed access by AWS applications? Or would that be too "fair" for an Amazon business practice?

    --
    "The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
    1. Re:Sounds like "Give us data so we can charge you" by cecille · · Score: 2, Interesting

      Oh, agreed...it's totally a business move for them, wrapped in the veneer of a good deed. On the other hand...if this is implemented correctly, it could be amazing. I say this as a researcher who has spent more time than necessary gathering data sets. Just as a quick (and painful) example...during my Master's degree, I was doing CI research for a hearing aid application. Without boring you with the details, the idea was to create a system to classify the audio background environment so it could be more effectively removed. For this, I needed a large set of ~1-sec clips of background noise with as much variety as possible. I didn't want to use what we normally call a "toy" data set because this was intended to be actually used. So I wanted variety, but I also wanted combo sounds - it's easy to tell a highway from a room of people, but what about a cityscape, with cars AND people AND a bah-zillion other sounds. Anyway, the result was that I spent MONTHS in a sound booth splitting audio files and listening to EACH 1-sec clip individually and recording exactly what sounds were in the clip and then parsing audio features. It SUCKED.

      Anyway, now that it's done, putting something like this on Amazon would be great (if I had the rights to the original clips). Not only would it save someone else the work, but researchers would be using a real, tough data set. Plus, it might get corrections (no way I didn't make at least a few mistakes in all those clips), and it might get added to (there are so many different sounds in this world, no way is this data set complete). Alternately, if I was a researcher now and I got my hands on this, it would save months of work, months of pay to an RA, a semester's tuition, even I did have to pay for cycles.

      On the other hand, I think there are a few places that do this, possibly for free. I want to say...Wolfram maybe? Plus, there's specialty ones. I think there's a big facial recognition set etc.

      --
      ...no two people are not on fire.
    2. Re:Sounds like "Give us data so we can charge you" by dubl-u · · Score: 4, Informative

      If the uploaded data is not available for download, but is only available to AWS applications running on Amazon's (paid for) compute service, then Amazon deserves nothing but contempt and an "Up yours" for this.

      Seriously? Or did somebody just put sand in your pancakes this morning?

      As an AWS user, I think this is great. It means I don't have to waste time and money copying over a public dataset. When I read about this I fired up a virtual Linux box, attached the census data as /dev/sdb, and spent a couple hours rummaging. Total cost: $0.70. If I had had to copy everything over first, it would have been $20 in bandwidth, plus a long time waiting for the 200 GB to transfer.

      You realize that these datasets are public, right? For the census one, you can already download it for free. Do you want Amazon to make it extra-super-free or something?

      I presume it's the same for the others. But if not, you should put your money where your very active mouth is. It would take maybe 15 minutes work to get an Amazon server up and running, attach all the public datasets, and set up a web server.

      I'm so very tired of people who say "somebody should do X!" but aren't willing to be that somebody.

    3. Re:Sounds like "Give us data so we can charge you" by Slashdot+Parent · · Score: 2, Insightful

      A) Home Bandwidth is a sunk cost. Transferring it wouldn't ahve cost you more then a penny more then you are paying. Assuming you pay a flat rate.

      My time is not a sunk cost.

      B) Transferring the data would made it available to you for free, anytime.

      Most of these datasets are hundreds of GB in size. That's going to take a long time to download and it's going to mean buying a new hard disk and/or deleting your pornography collection.

      The whole idea here is that if you are an AWS customer, and you're crunching a bunch of numbers, and need to crunch some census/genome/whatever data, you can type 'ec2-create-volume --snapshot <snapshotId>' and now that dataset can be attached to any EC2 instance. You don't have to wait to transfer the data in, and you don't have to pay the $0.10/GB to transfer the data in. The data sets are there for you when you need them.

      If you are not an AWS customer, then this isn't for you. Move along, now.

      --
      They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
  14. Comment removed by account_deleted · · Score: 2, Insightful

    Comment removed based on user account deletion

  15. Not the same by Nerdposeur · · Score: 2, Insightful

    [Privacy] was only recently on its way in. For most of history people lived in small communities where everyone knew each others' business.

    Which is very different from a large society in which some people know everybody else's business.

    Even if this stuff is public, the time and money and knowledge necessary to use it will not be evenly distributed.

    1. Re:Not the same by johnsonav · · Score: 3, Insightful

      Which is very different from a large society in which some people know everybody else's business.
      Even if this stuff is public, the time and money and knowledge necessary to use it will not be evenly distributed.

      Information has never been evenly distributed. In small communities it was the neighborhood gossip, the corner pharmacist, the village priest, or the county sheriff who knew everybody's business. The replacement of social capital with monetary capital is the only difference.

      Those small communities had, however, a fast-acting, closely monitored feedback system. If someone abused their position of power and trust, it was caught quickly and it was easy to remove them from the loop. A similar system is needed now, only on a national, or worldwide scale. I think the only way to accomplish this, without going back to a pre-computer society, is to make sure that as much information about the watchers is as publicly accessible as possible. Hopefully, the same spirit that makes the OSS community so vibrant and quick to act will transfer to this new domain.

      --
      ... and that's when the C.H.U.D.'s came at me.
  16. What's the license? by SanityInAnarchy · · Score: 2, Informative

    You'll recall that Amazon's "cloud computers" (ugh) are by the hour, and are pretty much root access to a VM. Unless there's a specific legal reason you can't, it's always possible to just download the data -- you'd just pay a bit for the time that instance must be up, and for the data transferred.

    However, for those of us who already are using EC2, it's nice to not have to download the whole set -- which can be terabytes, for some of these -- and instead be able to simply mount it from wherever it is and work with it right away. Especially when you consider the cost of downloading terabytes worth of data from Amazon's web services, at 17 cents per gigabyte -- reasonable, but still probably more than you wanted to just query the stuff.

    I suspect, also, that at least some of these will be made available via a web service of some sort, maybe even free, by some of those people using that service.

    --
    Don't thank God, thank a doctor!
  17. Re:Privacy? by dubl-u · · Score: 2, Insightful

    Yes, but at least now we are all able to do data mining in large databases.

    This is absolutely the case.

    The web has made vast amounts of information available, so you would think it would play into the "computers will bring about the age of big brother" that was so prominent during the 60s. But it hasn't. Instead, because everybody can afford computers and bandwidth, is had distributed power rather than concentrating it.

    The rich and powerful already have access to vast datasets, and the computing and human power necessary to mine them. Things like Google and Wikipedia and blogs have given everybody a taste of that power, and I'm in favor of anything that helps level the playing field.

  18. Re:Check off privacy by tylerni7 · · Score: 2, Insightful

    If my phone number and address were available, then people could easily contact and harass me. It's true that they could do the same to anyone, but that doesn't mean they will stop harassing people all together. Instead what would (probably) happen, is people would just choose who they want to harass. (Just think about 4chan, for instance, they don't do it because it's difficult, they do it to harass people)

    Likewise, the government wouldn't just change laws, instead they would (probably) just use the information they have to go after people they don't like.

    I am just speculating of course, and you do have a lot of valid points, like with SSNs for isntance. But I don't agree that if society was completely open, people would suddenly stop abusing their power and stop being assholes to other people. Instead, it would just be easier for them to do these things.