Amazon Launches Public Data Sets To Spur Research
turnkeylinux writes "Amazon just launched its Public Data Sets service (home). The project encourages developers, researchers, universities, and businesses to upload large (non-confidential) data sets to Amazon — things like census data, genomes, etc. — and then let others integrate that data into their own AWS applications. AWS is hosting the public data sets at no charge for the community, and like all of AWS services, users pay only for the compute and storage they consume with their own applications. Data sets already available include various US Census databases, 3-D chemical structures provided by Indiana University, and an annotated form of the Human Genome from Ensembl."
Now I have somewhere I can store the index of my massive porn collection. Thanks, Amazon!
One more step to a non private world CHECK
https://www.speakservers.com/
From the service description:
"They can then access, modify and perform computation on these volumes directly using their Amazon EC2 instances and just pay for the compute and storage resources that they use. "
It is not expensive, but it ain't free either.
It is my understanding that this data was already obtainable in the first place. So technically it isn't huge a huge invasion of privacy; it is just becoming more readily/easily available. One of the public data sets provided is from The US Census Bureau, and those were for the public anyway.
The most perfidious way of harming a cause consists of defending it deliberately with faulty arguments. - Nietzche
This just looks like a way to sell there cloud computing services. They provide the free data and you provide the monthly service fee.
One good thing is that it will be possible to standardize statistical tests and results against such a common database. One big problem with a lot of statistical analysis is the skewing of data due to insufficient size, vastly different population sample sets, and the presence of colored noise.
Take a simple radix sort algorithm applied to a telephone directory. Radix sort works if the pre-allocation of slots matches the data. One example where it breaks down is if you used a Boston matrix [having large concentrations of Irish names] on San Francisco's population [having clusters of Asian names].
Given the tremendous progress in genomic research, I would be interested in comparing my DNA with Craig Venture's. I guess one drawback might be with what I call the white lab mouse issue. White lab mice's DNA are becoming a laboratory benchmark because they are so well studied and breed to keep providing consistency of data by the money making labs which furnish them. However, white mice are very rare in nature (easily spotted in nature). So, we have an entire industry focused on investigating a marginal population.
However, as my mother makes fun about, we spend more money subsidizing Viagra [probably due to the large white male population of "elected officials"] than we do on Alzheimer medication, so progress is good!
Amazon Launches Public Data Sets To drive sales of their AWS.
Note that on Amazon's website they say that you can only access the data if you're paying them to crunch numbers on their cloud computers.
That is, you can't just download the data off their sites, which would be the nice thing to do.
As such, this article is nothing more than a slashvertizement.
Expect a new slew of Amazon patents...
"1-Sick" -- Health Data
"1-Mick" -- Irish Census Data
"1-Dick" -- Porn Movies Database
"1-Lick" -- Lesbian Porn Movies Database
"1-Fick" -- German Porn Movie Database
"1-Hick" -- The George W. Bush Presidential Library catalog.
"1-Kick" -- Pharmaceutical Index
"1-Nick" -- Crime Data
"1-Prick" -- Copyright Law Legal database
"1-Trick" -- List of iKea-nu Reeves Movies.
"1-Tick" -- Camping Places Data set.
"1-Brick" -- The Lego Catalog.
"1-Thick" -- Obesity Index.
Can someone make these datasets available to download as aggregate torrents, or are they available only once someone writes an application and gets it working on AWS?
With hard drives getting bigger and bigger, to me it makes sense to have lots and lots of local mirrors of this sort of data.
"I object to doing things that computers can do." -- Olin Shivers, lispers.org
Am I the only person who read "to upload large (non-confidential) data sets to Amazon -- things like census data, gnomes, etc --"?
[offtopic]Great, but how about they start hurrying up with making their MP3 downloads available worldwide? And update their mp3 album downloader for openSUSE (which is for 10.3)? Those 2 things would make me an Amazon customer.
Here's the secret to immortality:
They are copying the "computable data" initiative in Mathematica http://www.wolfram.com/products/mathematica/newin6/content/LoadOnDemandCuratedData/ and http://www.wolfram.com/products/mathematica/newin7/
>users pay only for the compute and storage they
>consume with their own applications
Everything old is new again!
Ah the good old days... when you had to PAY for cycles.... not like the young whippersnappers today with their "desktops" and "laptops" and more cycles than they know what to do with.
Come play free flash games on Kongregate!
How many developers here have had to hunt around for a list of countries to populate a select box? Or chained select boxes for country -> county/state -> town? How nice would it be to have a central repository where you can download all (in any mixed selection of languages) in cvs/xml/etc?
Phillip.
Property for sale in Nice, France
If the uploaded data is not available for download, but is only available to AWS applications running on Amazon's (paid for) compute service, then Amazon deserves nothing but contempt and an "Up yours" for this.
It seems that working for a living is out of fashion at Amazon. They expect people to supply them with resources so that they can charge them and others for their use. It's creative business bullshit, and not even remotely funny.
Amazon, how about you PAY BACK for the privilege of having the datasets uploaded to you by hosting them freely for the Internet community, and only on the back of that you charge for local, higher-speed access by AWS applications? Or would that be too "fair" for an Amazon business practice?
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
Comment removed based on user account deletion
So Amazon says: We'll host the raw data for your study! I say: who vouches for the validity of the data set itself? I understand that some of the sets are already publicly available, but that doesn't mean all. Will Amazon provide information on who/where the datasets came from? If I can't trust my data set, then I can't trust my results...
I can no longer read Dilbert. It's too depressing, because it is too real. -- Hyperhaplo
Which is very different from a large society in which some people know everybody else's business.
Even if this stuff is public, the time and money and knowledge necessary to use it will not be evenly distributed.
You'll recall that Amazon's "cloud computers" (ugh) are by the hour, and are pretty much root access to a VM. Unless there's a specific legal reason you can't, it's always possible to just download the data -- you'd just pay a bit for the time that instance must be up, and for the data transferred.
However, for those of us who already are using EC2, it's nice to not have to download the whole set -- which can be terabytes, for some of these -- and instead be able to simply mount it from wherever it is and work with it right away. Especially when you consider the cost of downloading terabytes worth of data from Amazon's web services, at 17 cents per gigabyte -- reasonable, but still probably more than you wanted to just query the stuff.
I suspect, also, that at least some of these will be made available via a web service of some sort, maybe even free, by some of those people using that service.
Don't thank God, thank a doctor!
I say: who vouches for the validity of the data set itself? I understand that some of the sets are already publicly available, but that doesn't mean all.
Did you even look at their public data sets page? Every one of the public data sets they've listed has source information. They are already, at this very moment, providing information on who/where they came from.
The one I looked at also had extensive sets of README files explaining the source and format of the data, and I'd imagine it's true for the others as well.
Actually - I just want to seed my pirated movie collection, download midget porn, and read my boss's email and be the only person to know about it.
Just disrupt the deflector shield with a tachyon burst.
How is there "no charge to the community" when the data is accessible only to paying Amazon customers? I have no objection to them doing this, but the hype is a bit much. I guess the only "community" that matters to Amazon is the one consisting of Amazon customers.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
I wonder if AOL will be submitting any "non-confidential" data...
plus much much more at:
at
http://genome.ucsc.edu/
http://www.ensembl.org/index.html
this is just a way to access it from amazon compute cloud.
"We don't want people to really know us because we have been convinced to hold ourselves to standards that no one actually meets."
Yeah...well, I do have a bigger dick than everyone else.
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
This is just a benefit that they are giving to their customers.
They are storing huge, public, commonly-used datasets for their customers, free of charge. If you are a customer and want to use, say, census data, you don't have to waste your time uploading the data, and you don't have to pay the $0.10/GB to upload the 200GB of data. Amazon already hosts the data for you. You just run a simple command, and the data are now instantly available for you to use however you want.
If you are not an AWS customer, then this service probably will not do you any good. Just download the data from census.gov and be done with it.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
I have been promised some simple survey data (bipolar survey) - what is the best way to analyze and display - I am into java/database/oracle and would like to use business intelligence techniques to test growing my career that way.
I realize there are lots of ways to do this, most of which would increase my skillset.
Be Free: Free Software Tuition
OK so you have the public data from amazon - how do you analyze it? (I realize the problem here is the amount of options available rather than being constrained down one path)
Be Free: Free Software Tuition