Google's Academic TB Swap Project

← Back to Stories (view on slashdot.org)

Google's Academic TB Swap Project

Posted by ryuzaki0 on Wednesday March 7, 2007 @04:00AM from the hey-look-it's-chris dept.

eldavojohn writes "Google is transferring data the old fashioned way — by mailing hard drive arrays around to collect information and then sending copies to other institutions. All in the name of science & education. From the article, 'The program is currently informal and not open to the general public. Google either approaches bodies that it knows has large data sets or is contacted by scientists themselves. One of the largest data sets copied and distributed was data from the Hubble telescope — 120 terabytes of data. One terabyte is equivalent to 1,000 gigabytes. Mr. DiBona said he hoped that Google could one day make the data available to the public.'"

39 of 190 comments (clear)

Should we be continuing this fallacy? by garcia · 2007-03-07 04:02 · Score: 3, Informative

One terabyte is equivalent to 1,000 gigabytes.

Uhh, no it isn't. It's really 0.9765625 terabytes.
1. Re:Should we be continuing this fallacy? by Cristofori42 · 2007-03-07 04:06 · Score: 5, Funny
  
  umm a terabyte is really 1 terabyte. Though 1 terabyte = 1024 gigabytes not 1000... but whatever.
  
  --
  "Is that dad? Either that or Batman's really let himself go."
2. Re:Should we be continuing this fallacy? by wizzard2k · 2007-03-07 04:07 · Score: 2, Informative
  
  From wikipedia:
  (a contraction of tera binary byte) is a unit of information or computer storage, abbreviated TiB.
  
  1 tebibyte = 240 bytes = 1,099,511,627,776 bytes = 1,024 gibibytes
  
  The tebibyte is closely related to the terabyte, which can either be an (inaccurate) synonym for tebibyte, or refer to 1012 bytes = 1,000,000,000,000 bytes, depending on context.
3. Re:Should we be continuing this fallacy? by garcia · 2007-03-07 04:10 · Score: 2, Informative
  
  Thanks for pointing out that I should have been hitting Preview instead of getting First Post :)
  
  1000GB = 0.9765625 TB, not 1TB.
4. Re:Should we be continuing this fallacy? by Professor_UNIX · 2007-03-07 04:53 · Score: 4, Insightful
  
  * 1 Terabyte = 1000 Gigabyte * 1 Tebibyte = 1024 Gibibyte
  Yea, yea, yea. And you also believe a hacker isn't someone who maliciously breaks into computer systems, it's just a curious innocent person right... crackers are the criminals! Give it up. The general public is never going to adopt "Tebibyte" into the language because terabyte sounds much more fucking cool.
5. Re:Should we be continuing this fallacy? by wolff000 · 2007-03-07 05:03 · Score: 2, Insightful
  
  WHO CARES?!? I have worked with mathematicians that did not squabble over these terms so why the hell are we?!? My mother who can hardly turn a computer on knows damn well that 1000 megabytes is roughly 1 gigabyte. Now lets get back to the topic. It seems Google would have some brilliant way to push a terabyte through the "tubes" instead of just mailing drives, how archaic.
  
  --
  WTF?
6. Re:Should we be continuing this fallacy? by servoled · 2007-03-07 06:13 · Score: 2, Insightful
  
  Makes you wonder why some morons decided to do it in the first place when they tried to redefine kilo, mega, giga, etc... to be 2^x instead of 10^y.
  
  --
  "I have a porkchop, you have a porkchop. I have a veal, you have a veal".
7. Re:Should we be continuing this fallacy? by Anpheus · 2007-03-07 06:47 · Score: 3, Insightful
  
  That's not the problem, the problem is, when you buy a X GB drive, you don't know what you're getting until you find the fine print. Some manufacturers provide different sizes of the same labeled drive, differing only in whether it's "1 GB = 1,000,000 KB" or "1 GB = 1,000,000,000 B"
  
  So if you buy a set for RAID one day, the next day they may no longer stock the drive you need and your vital information is put at unnecessary risk because... what, because the hard drive manufacturers can't decide whether they want to screw you out of 7% (using 1 GB = 1 billion bytes) or 5% (using 1 GB = 1 million kilobytes, which they curiously agree on equaling 1024 billion bytes. What a coincidence that KB is 2^10, but GB is 10^9?)
  
  Think about that for a moment before you lambast the argument for proper labeling of drives.
8. Re:Should we be continuing this fallacy? by jonbryce · 2007-03-07 10:06 · Score: 2, Informative
  
  The other primary place where the prefixes are in use is RAM chips, and they do use 2^10 rather than 10^3.
Large datasets by BWJones · 2007-03-07 04:03 · Score: 4, Informative

This is absolutely the most cost effective way of transferring large amounts of data like this. If you do the calculations on terrabyte size files, sneakernet (of FedEx net) is actually faster and less expensive. We also went to one of Jim Grey's seminars when he was here giving an Organick Memorial Lecture and he made an incredibly compelling demonstration using a variety of data types. We ended up talking with him for some time after about new projects we are engaging in that will also be generating terrabytes of data and his suggestion was to pass applications rather than data which was interesting.

This is becoming more and more the norm in scientific research and Google's work is quite welcome.

--
Visit Jonesblog and say hello.
1. Re:Large datasets by Sobrique · 2007-03-07 04:09 · Score: 4, Funny
  
  Never underestimate the bandwidth of a lorryload of backup tapes traveling at 60 miles an hour.
  Latency may leave something to be desired though :)
2. Re:Large datasets by dmayle · 2007-03-07 04:36 · Score: 2, Insightful
  
  I remember an article I read on this I think back in the year 2000. The was a research scientist who built a standardized platform (That is to say, a specific PC case with a certain number of hard drive bays, and certain network cards) so that he could exchange data with other universities. They would fill up the data on the networked PC, and they could ship it to any of the participating projects, knowing that they'd get back the same hardware in return.
  
  I remember at the time thinking it was just one of those smart little details that just make working together easier. It's not some great leap of genius, but enough of a well crafted idea that it could really help.
3. Re:Large datasets by BWJones · 2007-03-07 04:46 · Score: 2, Insightful
  
  Yeah, there have been a number of folks using variations on this theme for a while now. It's been interesting that network performance really has not followed the same performance curve as storage and CPU throughput. Add to that the growing amount of data being pushed through "consumer" pipes from people obtaining broadband and pushing sources such as YouTube and company and you have the makings for a bandwidth crunch. This of course is the reason for separate academic and government Internet paths, but it is still a limited commodity. In fact, at some universities engaging in data intensive projects, it is not uncommon for them to occupy the entire bandwidth of the university in off hours to transfer data around the country to various collaborators.
  
  --
  Visit Jonesblog and say hello.
4. Re:Large datasets by Agent+Orange · 2007-03-07 05:09 · Score: 2, Informative
  
  Yup. There was a paper a few years back entitled "terascale sneakernet", by jim gray and a couple of guys at MSFT research division on this. You can find it in the arxiv.
  
  This concept has also been applied to such things as the Sloan Digital Sky Survey. Astronomers do tend to generate a lot of data with large surveys such as this.
so.. by mastershake_phd · 2007-03-07 04:08 · Score: 2, Interesting

Whos going to own the data? I hope Google isnt going to say they do like they want to with the old books theyre scanning. Everytime you download a hubble picture will it have a google watermark?

--
Libertarian Leaning Political Discussion Forum.
1. Re:so.. by cfulmer · 2007-03-07 04:23 · Score: 2, Interesting
  
  The ownership of data is presumably a case-by-case thing that depends on what the data is and how it was acquired.
  
  For example, Google does not own the copyright on out-of-copyright books that it scans in (nobody does, by definition.) At best, it might own the copyright on the scan that it did, but that's really unlikely--copyright protects creative expression and a straight scan doesn't add any.
  
  However, they probably have some rights under unfair competition law because they have gone through a lot of work acquiring all this data and it would be unfair for somebody else to piggyback on that work to compete with them.
  
  Recognize also that many of the "Hubble Pictures" you see are colorized versions of raw data that incorporates non-visible parts of the EM spectrum, assigning colors to things you can't see with your eyes. That assignment of colors to create something pleasing to the eye is certainly creative expression. So, if Google takes the raw data and does that color assignment itself, well, the result is theirs.
2. Re:so.. by oneiros27 · 2007-03-07 04:53 · Score: 2, Informative
  
  So, if Google takes the raw data and does that color assignment itself, well, the result is theirs.
  I'm not so sure that the result in theirs, necessarily. They'd need to properly attribute it. Many science archives have rules about how to properly attribute their work.
  
  Don't get me wrong -- many of the scientists want people to use their data (eg, see The Astronomer's Data Manifesto), but they also want to know who's using it, because it's how they justify the value of their projects, and the costs incurred from distributing the data (especially for non-active projects).
  
  The science community is also working on the Science Commons (an equivalent of the Creative Commons for marking scientific data) and various federated search engines (eg, night time (astronomy) virtual observatories, as well as other space and earth science discipline specific VOs.).
  
  --
  Build it, and they will come^Hplain.
Never underestimate ... by boyfaceddog · 2007-03-07 04:09 · Score: 2, Interesting

The bandwidth of a moving van full of disks.

Looks like Google is hoarding data. Seems they at least are equating information with power and money. And them that has the power and money makes the rules.

--
Here will be an old abusing of God's patience and the king's English.
In Other News by UnknowingFool · 2007-03-07 04:11 · Score: 4, Funny

FedEx delivered what appeared to be a ton of broken office chairs to Google headquarters this morning. When asked for the sender's ID, the severely beaten FedEx courier would only reply that the sender wished to remain anonymous.

--
Well, there's spam egg sausage and spam, that's not got much spam in it.
Other Uses for Mass Data Transfer by Anonymous Coward · 2007-03-07 04:12 · Score: 4, Funny

Moe: Say, Barn, uh, remember when I said I'd have to send away to NASA to calculate your bar tab?
Barney: Oh ho, oh yeah, you had a good laugh, Moe.
Moe: The results came back today. (reading a printout) You owe me seventy billion dollars.
Barney: Huh?
Moe: No, wait, wait, wait, that's for the Voyager spacecraft. Your tab is fourteen billion dollars.
Re:1TB = 1024 GB by 91degrees · 2007-03-07 04:12 · Score: 5, Insightful

Why?

Why is a Kilobyte 1024 bytes, if "Kilo" means 1000, both according to the SI and the greeks (Kilo is derived from khilioi). If 1 kg = 1000g, 1 kV = 1000V, 1 km = 1000m, why should hard disks break the pattern?

When we're talking about addressable computer memory, approximating the kilobyte to 1024 is a convenience, but since Terabyte gives such a huge error, and makes absolutely no sense for data transfer or disk sizes, it's really time we stopped this illogical naming convention just because some engineers found a term convenient 40 years ago.
Hubble Data by Ikyaat · 2007-03-07 04:13 · Score: 2, Funny

120 TB of data from the Hubble telescope? I wish I was paid to go through that. And this picture is of a...star and this one is a star And a star another star OMG its a FRICKIN STAR

--
"Luck is a tag given by the mediocre to account for the accomplishments of genius." -Heinlein
1. Re:Hubble Data by nharmon · 2007-03-07 05:56 · Score: 2, Funny
  
  "That's no moon"
Re:fixed by Macthorpe · 2007-03-07 04:19 · Score: 2, Funny

Wrong. One tebibyte is equal to 1024 gibibytes. One tarabyte equals 1000 gigabytes. If you're going to correct someone, do it right.

You meant 'terabyte', not 'tarabyte'. If you're going to correct someone, do it right.

--
"It does not do to leave a live dragon out of your calculations, if you live near him." - Tolkien
Re:1TB = 1024 GB by NinjaTariq · 2007-03-07 04:28 · Score: 2, Interesting

Use the kibibyte if you have a big problem with it.

But I have long since buried my problem with using the SI prefix with byte to mean a power of 2, actually not sure i ever had one, I just accepted it. I am happy with the 1024b=1Kb, 1024Kb=1Gb and 1024Gb=1Tb. The usable space is lower in the case of non-volatile storage anyway, 1Tb never means 1024Gb might be closer to 1000Gb (i don't know).
Bark! Bark! Bark! by ColdWetDog · 2007-03-07 04:35 · Score: 4, Funny

I'm so tired of this stuff. Byte me!

--
Faster! Faster! Faster would be better!
1. Re:Bark! Bark! Bark! by AchiIIe · 2007-03-07 06:36 · Score: 4, Funny
  
  > I'm so tired of this stuff. Byte me!
  
  I'm sorry, that's wrong too:
  
  * 1 byte == 2 nibbles
  * 1 byte != 1 bite
  
  --
  Byte nazi police, proudly serving since 2^1025
  
  --
  Nature journal lied in Britannica vs Wikipedia Ask to retrac
Mod parent up by ari_j · 2007-03-07 04:38 · Score: 2, Informative
Here's what happened when I FedExed my RMA to Newegg, packed very carefully. Note the bent motherboard - I didn't even know you could do that. The good news is that FedEx paid part of my claim ... they paid $100 plus the $8.33 that the FedEx store charged me to fax in the claim forms. The bad news is that they did not refund my original shipping or pay more than $100 on the over $280 of damage that they did. It also took about 4 hours of phone calls to even convince FedEx that I was not the seller, and then they lost my claim in their e-mail system (and did not reply to my e-mails) and closed it out for inactivity after a month or so, until I called them and asked what happened.
- box front - doesn't look bad from this angle
- box rear - run over by a truck?
- formerly a CPU - this was packed inside the box with lots of noodles
- motherboard top - not terrible-looking, but note the USB/Ethernet header
- motherboard end view - here you can see the curvature of the motherboard
- another picture of the CPU and motherboard boxes - almost certainly run over by a truck
On a side note, don't bother with UPS insurance. I insured something when I sent it to myself once, and they broke it and the insurance remedy was to return it to the origination address and ask to see an original purchase receipt to award the insurance claim. If you happened to make something yourself or even received something as a gift, don't insure it when you ship it. And hire a private courier (unless someone has found a common carrier that doesn't suck).
1. Re:Mod parent up by MajinBlayze · 2007-03-07 07:26 · Score: 5, Informative
  As a former UPS employee, (I worked as a package handler, the guy that beats the shit out of your boxes as he loads them on the truck) I will never ship anything of value without paying extra for the insurance. when you do that, a couple of things happen:
  
  the item goes into a big bag (by itself, not mixed with other items) with red/white stripes, so employess know not to mess with it)
  
  it gets hand-carted to the destination truck, and is the last thing to be loaded, and first unloaded
  
  only seasoned workers ever touch your package, and generally care about the state that it's in
  
  finally, they are good about paying up if the item arrives damaged.
  
  did I forget to include ???? and Profit!
  --
  "Hate is baggage. Life's too short to be pissed off all the time." Danny Vinyard -American History X
2. Re:Mod parent up by evilviper · 2007-03-07 07:49 · Score: 2, Informative
  
  Besides, insurance is meant to cover damage due to normal mishandling, such as dropping a box by mistake, not the kind of (at least nearly) intentional damage that must have been involved in my case. Or maybe you have a theory of how my box got squashed that badly in the normal course of FedEx's business.
  
  I still don't know where you get that idea. Insurance is meant to handle any kind of damage, including being completely destroyed in plane crashes, car accidents, train derailments, theft, loss, and anything else that could possibly occur.
  
  For your package, I imagine it was run over by one piece of equipment or another. Forklifts, tractor trailers, etc. Or it may have been some sort of freak accident with equipment in their automated package handling system. I certainly don't have any reason to believe it was intentional, unless you have some reason to believe you've seriously pissed-off your local fedex office employees beforehand...
  
  --
  Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:1TB = 1024 GB by 91degrees · 2007-03-07 04:56 · Score: 2, Insightful

It's not illogical it makes perfect sense to anyone who programs, well anyone who dose lower level programming. If computers were to work in base 10... Sorry I can not even go there.

If we want to worry about that then use KiB and MiB. But that doesn't make a huge amount of sense. 1KiB = 400h bytes. 1MiB = 100000h bytes. Powers of 256 would make a lot more sense.
Re:Like days of old by meringuoid · 2007-03-07 05:06 · Score: 3, Interesting

This sounds almost like stories of scholars trading/copying books from long long ago.
According to what I'm told every time I watch a DVD, these scholars were in fact stealing books.

--
Real Daleks don't climb stairs - they level the building.
...why not tapes? by Penguinisto · 2007-03-07 05:18 · Score: 3, Interesting

I understand the whole "HDD w/ a common filesystem = more compatibility" thing, but wouldn't it be easier to simply send along some tapes of a type appropriate to the format/type that the scientific institution uses? LTO-3 can do 800GB compressed, SDLT can do up to 600... and neither is susceptible to data loss when it gets bounced too hard by FedEx/UPS/DHL/Whatever. (plus it would make for a lighter package, wouldn't require some poor IT schmuck to disassemble a server or wait forver for USB to transfer all of it, etc...)
I'm not criticizing or anything; just curious is all.
/P

--
Quo usque tandem abutere, Nimbus, patientia nostra?
1. Re:...why not tapes? by kulover · 2007-03-07 06:29 · Score: 2, Interesting
  
  The reason for not using tapes is exactly because of the compression. The time it takes to compress that data and then send the data to the tape takes a lot of time. That same process would have to be repeated on the other end.
  
  Besides, using HDD for transfer means immediate access to the same data on the other end with speeds that are unmatched with tape backup systems. It might also be worthy to note that data sets that large usually are stored on large RAID systems like this one from LSI Logic, http://www.lsilogic.com/storage_home/products_home /external_raid/6998_storage_system/index.html, and are not installed into a computer like you may be thinking. It provides unmatched speed and reliability. A single rack system can sustain 1,600 MB of transfer to attached hosts, which is how Google will probably use it anyway. I highly doubt a single computer will be looking at that much information.
2. Re:...why not tapes? by K8Fan · 2007-03-07 06:40 · Score: 2, Interesting
  
  The "TeraScale SneakerNet" paper posted earlier anticipates and answers that. They ship a fully assembled computer with processor, RAM, OS and network interface. Plug it in to the wall, plug it in to the network and assuming you had previously agreed on a networking protocol, you're rolling as soon as it boots! No restoration, no decompressing, immediate access to the data.
  
  Does anyone have a Linux distro for this specific purpose? Preferably tiny enough to fit onto a USB key and optimized for bandwidth, preferably with a web server interface for configuring the discs and network?
  
  --
  "How perfectly Goddamn delightful it all is, to be sure" Charles Crumb
Re:1TB = 1024 GB by vidarh · 2007-03-07 05:41 · Score: 2, Insightful

Byte isn't an SI unit, so what makes you think we care?
Real geeks have no problem with overloading.
Nope by sheldon · 2007-03-07 06:32 · Score: 2, Informative

How you measure a terabyte depends on whether you are buying disk, or monitoring disk usage on your server.

The disk manufacturers define it as 1000 megabytes which is 1000 kilobytes which is 1000 bytes.

The OS measures it as 1024 megabytes, which is 1024 kilobytes, which is 1024 bytes

Why? Because when you're buying a drive, 750 Gigs sounds bigger than 698.5 gigs.
Re:1TB = 1024 GB by 91degrees · 2007-03-07 06:50 · Score: 3, Informative

Well, the IEC and IEEE as well as the CIPM and NIST all agree thatthere are 1000 bytes to a Kilobyte and 1024 bytes tothe kibibyte. So there:P
Not acording to NIST by Ernesto+Alvarez · 2007-03-07 08:51 · Score: 3, Interesting

If you want to be strict, the SI defines the "tera" prefix as 10^12, so 1 terabyte = 1000 gigabytes.

If you want to use the binary values, you might as well use the correct "tebi" prefix. NIST says you should, and it looks like the IEC, IEEE and BIPM agree.

--
GPG 0x1B479C78