Google Sorts 1 Petabyte In 6 Hours

Kudos to Google by Anonymous Coward · 2008-11-23 04:54 · Score: 5, Funny

for knowing how important the Library of Congress metric is to us nerds!

Re:Kudos to Google by Zencyde · 2008-11-23 06:25 · Score: 1

Who cares how many Libraries of Congress it takes. I want to know how long it will take to hack the Gibson!

--
What day is it? Could you please tell me?
Re:Kudos to Google by canuck57 · 2008-11-23 06:30 · Score: 5, Funny

for knowing how important the Library of Congress metric is to us nerds!
But at least now we know Google can sort out petafiles.
Re:Kudos to Google by shutdown+-p+now · 2008-11-23 07:46 · Score: 4, Funny

Bah! To pay true homage, they need to add it to the list of units in Google Calc!
Re:Kudos to Google by tyrione · 2008-11-23 10:15 · Score: 1

True, but when the actual Library of Congress entire Library is converted digitally then they can brag on comparisons. However, I doubt they will want to seeing as how that will be far larger than a petabyte of data.
Re:Kudos to Google by LingNoi · 2008-11-23 11:02 · Score: 3, Funny

So Google can sort through 12 LoCs in 6 hours.
Wow, that's 2 LoC/pH
Re:Kudos to Google by SEWilco · 2008-11-23 14:08 · Score: 1

But they didn't sort the Library of Congress by relevance to me.
Re:Kudos to Google by Anonymous Coward · 2008-11-23 16:01 · Score: 0

Mabye next they can sort out the pedofiles!
Re:Kudos to Google by Anonymous Coward · 2008-11-23 16:02 · Score: 1, Funny

Woooosh, dipshit.
Re:Kudos to Google by Anonymous Coward · 2008-11-23 16:52 · Score: 0

2 LoC per pH? What have logarithms and hydrogen ions got to do with it?
Re:Kudos to Google by Hal_Porter · 2008-11-23 18:01 · Score: 1

0.55 milliLoC per second.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Re:Kudos to Google by Anonymous Coward · 2008-11-24 00:51 · Score: 0

A slash and a p? Wouldn't that make it "2 Libraries of Congress per per hour"?
My life is in no way hollow and tragic.

Unit conversion by Zarhan · 2008-11-23 04:59 · Score: 4, Funny

Yay! We finally have unit conversion from 1 LoC to bytes! So...20 PB = 6LoC, means that 1 LoC = 3,333... PB :)

Re:Unit conversion by xZgf6xHx2uhoAj9D · 2008-11-23 05:15 · Score: 3, Informative

Don't you mean 1PB = 12LoC?
Re:Unit conversion by Neon+Aardvark · 2008-11-23 05:22 · Score: 4, Informative

No, 1 PB = 12 LoC, so 1 LoC = 0.0833... PB
Also, I'd like to make some kind of swimming pool reference.

--
Azural - instrumentals
Re:Unit conversion by neoform · 2008-11-23 05:39 · Score: 1

What format are they using for the books when doing this calculation as to the size of the LoC?
Raw Text?
PDF?
JPEG? ....

--
MABASPLOOM!
Re:Unit conversion by Anonymous Coward · 2008-11-23 05:45 · Score: 2, Interesting

Assuming it was written in binary in a font that allows for 1 digit per 2mm, the length of the data would be 183251938 m, or 1145324 times the perimeter of an olympic-sized swimming pool.
Re:Unit conversion by UltraAyla · 2008-11-23 06:10 · Score: 1

I like your thinking, but would like to modify it (I realize it was a joke). Considering the rate at which LoC archives data, we should put some datestamps on it so that, including the other correction, 1PB = 12 081123LoC. Just a thought
Re:Unit conversion by owlnation · 2008-11-23 06:34 · Score: 1

No, 1 PB = 12 LoC, so 1 LoC = 0.0833... PB. Also, I'd like to make some kind of swimming pool reference.
Yes, but how much is that in football fields?
Re:Unit conversion by Zarhan · 2008-11-23 06:47 · Score: 2, Informative

Oh darn. Clearly I was converting pound-congresses to kilos first.
Re:Unit conversion by Anonymous Coward · 2008-11-23 06:50 · Score: 1, Interesting

Yes, but how much is that in football fields?
You silly sod, you can't measure something in football fields! There's internationalization to take into account!
Canadian football fields are 100x59m, American football fields are 109x49m, and the rest of the world doesn't even play the same game on a football field. And THEIR sport has a standard range, anywhere from 90-120m by 45-90m (Thank you wikipedia).
You've now introduced variable-variables! We can't get an absolute number!
Re:Unit conversion by Yvan256 · 2008-11-23 08:04 · Score: 1

Can we get an absolute variable instead?
Re:Unit conversion by Tubal-Cain · 2008-11-23 08:07 · Score: 1

I vote for i.
Re:Unit conversion by ewanm89 · 2008-11-23 09:29 · Score: 2, Informative

Well, American's don't even play *foot*ball with there feet.
Re:Unit conversion by RancidPeanutOil · 2008-11-23 10:17 · Score: 1

can we work elephant volume into it as well? Assuming a spherical elephant of course... QED
Re:Unit conversion by xZgf6xHx2uhoAj9D · 2008-11-23 10:41 · Score: 1, Funny

This is an excellent point. No American football player has used his feet since the NFL adopted hoverchairs into the rules in 1974.
Re:Unit conversion by AmonTheMetalhead · 2008-11-23 11:55 · Score: 1

This is an excellent point. No American football player has used his feet since the NFL adopted hoverchairs into the rules in 1974.
Yea, everybody loved Captain Pike's playing, however, the incessant bleeping at the referee's ruined his career.
Re:Unit conversion by Gazzonyx · 2008-11-23 12:36 · Score: 1

Just to avoid confusion, we're using base 2 units, correct? Where a KB = 1024 bytes and a MB is 1024 KB, etc. At the PB level, the difference between a KB being 1000 bytes adds up.

--
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Re:Unit conversion by Anonymous Coward · 2008-11-23 12:48 · Score: 0

Obviously we need to know how many olympic-sized swimming pools filled with 8GB flash drives this would be.
Re:Unit conversion by buchner.johannes · 2008-11-23 13:19 · Score: 1

How do you sort a swimming pool?

--
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
Re:Unit conversion by Anonymous Coward · 2008-11-23 13:51 · Score: 0

Very wrong. You are short by an order of 10,000,000 times. Assuming each PB = 1024 TB, we have:
1 LoC = 0.0833 PB = 85.3 TB = 85.3 * 1024 GB = 85.3 * 1024 * 1024 *1024 * 1024 Bytes
1 LoC = 96,076,792,050,570,581 Bytes.
Each Byte is 8 binary digit, so:
1 LoC = 768,614,336,404,564,648 digits.
Now, each digit is 2mm, so we have:
1 LoC = 1,537,228,672,809,129 meters or about 10,248,191,152,060 times the perimeter of an Olympic-sized swimming pool (150 meters). That's right: 10 million million times.
Re:Unit conversion by Anonymous Coward · 2008-11-23 14:46 · Score: 0

Can we get an absolute variable instead?
if it was absolute it wouldnt be variable, now would it?
Re:Unit conversion by zeromorph · 2008-11-23 23:06 · Score: 1

This is an excellent point. No American football player has used his feet since the NFL adopted hoverchairs into the rules in 1974.
If that is enough foot for you, I really want to see the American handball team.

--
"Hannibal's plans never work right. They just work." Amy/A-Team
Re:Unit conversion by dwpro · 2008-11-24 01:38 · Score: 1

Yes, yards are close enough for us, but by God, we're still using English measurements.

--
Millions long for immortality who do not know what to do with themselves on a rainy Sunday afternoon. -- Susan Ertz
Re:Unit conversion by GeneralSense · 2008-11-24 04:20 · Score: 1

That chances of that reference surviving is 0.08333...repeating of course.

That's Easy by Lord+Byron+II · 2008-11-23 05:04 · Score: 4, Interesting

Consider a data set of two numbers, each .5 petabyte big. It should only take a few minutes to sort them and there's even a 50% chance the data is already sorted.

Re:That's Easy by Blakey+Rat · 2008-11-23 05:08 · Score: 5, Insightful

I came here to post the same thing. If they sorted a petabyte of Floats, that might be pretty impressive. But if they're sorting 5-terabyte video files, their software really sucks.
Not enough info to judge the importance of this.

--
Comment of the year
Re:That's Easy by farker+haiku · 2008-11-23 05:16 · Score: 5, Informative

I think this is the data set. I could be wrong though. The article (yeah yeah) says that
In our sorting experiments we have followed the rules of a standard terabyte (TB) sort benchmark.
Which lead me to this page that describes the data (and it's available for download).

--
Your sig(k) has been stolen. There is a puff of smoke!
Re:That's Easy by Anonymous Coward · 2008-11-23 05:16 · Score: 5, Informative

From TFA: they sorted "10 trillion 100-byte records"
Re:That's Easy by sakdoctor · 2008-11-23 05:18 · Score: 4, Funny

And yet google don't even convert petabytes to libraries of congress in the google calculator.
Or perhaps I got the syntax wrong.
Re:That's Easy by sakdoctor · 2008-11-23 05:19 · Score: 4, Funny

Huh? This isn't the parent post I was trying to reply to.
Re:That's Easy by nebulus4 · 2008-11-23 05:27 · Score: 0

Consider a data set of just one number, about 1 petabyte in size, then it shouldn't really take much time to sort it, since we already know the data is sorted. Perfect excuse for using 4000 computers to beta-test Duke Nukem Forever.

--
"It would be wrong to refuse to face the fact that everything is fundamentally sick and sad."
Re:That's Easy by mR.bRiGhTsId3 · 2008-11-23 05:51 · Score: 1

I dunno, it depends on what criteria they are using to sort video files. If by file name, then yeah, not so impressive, but if their sorting based on a measure of relevance of the contained contents, my jaw would drop and my eyes would pop out.
Re:That's Easy by Anonymous Coward · 2008-11-23 07:01 · Score: 0

I suppose in your model those numbers are stored on dvds and you're moving them around with a truck.
Re:That's Easy by JamesP · 2008-11-23 08:20 · Score: 1

Chances are now they are going to ask potential employees being interviewed there how to do it using half the time and one tenth of the machines...

--
how long until /. fixes commenting on Chrome?
Re:That's Easy by jtgd · 2008-11-23 09:07 · Score: 1

For it to take minutes to sort two numbers they would have to be identical for the first few gigabytes. If they differed in the first byte, it would only take a microsecond to sort them.

--
J
Re:That's Easy by halcyon1234 · 2008-11-23 16:14 · Score: 1

In our sorting experiments we have followed the rules of a standard terabyte (TB) sort benchmark.
Which lead me to this page that describes the data (and it's available for download).

For the record, you can download a file that will generate the data for it. Because otherwise, well, posting a link to a 1TB file on Slashdot might melt the entire Internet.

--
UTF-8: There and Back Again
Re:That's Easy by Anonymous Coward · 2008-11-23 17:54 · Score: 0

that describes the data (and it's available for download).
All 1 petabyte of it?!
Re:That's Easy by Anonymous Coward · 2008-11-23 22:45 · Score: 0

Consider two peta-sandwiches. I can pretty stack them up pretty fast. Eat that Google!
Re:That's Easy by sincewhen · 2008-11-23 23:18 · Score: 1

That's because you're in the metric section of the thread, LoCs are an imperial measurement!

--
-- Braden's law of data: All data spends some of its lifetime in an excel spreadsheet.

Need to benchmark against the best sorts by Animats · 2008-11-23 05:12 · Score: 4, Insightful

Sorts have been parallelized and distributed for decades. It would be interesting to benchmark Google's approach against SyncSort. SyncSort is parallel and distributed, and has been heavily optimized for exactly such jobs. Using map/reduce will work, but there are better approaches to sorting.

Re:Need to benchmark against the best sorts by perlchild · 2008-11-23 05:49 · Score: 1

And Google is trying to make money off mapreduce(as an api of sorts), so now you're surprised they're using their massive resonance over the market, especially geeks, in order to heighten awareness of their product?
On the other hand, what they're trying to prove is mapreduce's worth, as a workload divider(how to break-up 20PB for sorting), not necessarily how optimal it is in the current situation. They have a better test/sample of mapreduce, but it's a trade secret to them(how it's used to index the pages for google search), so they can't release that. I imagine they'll try another test, until they get a big name signing up to use mapreduce as an api.
Re:Need to benchmark against the best sorts by Anonymous Coward · 2008-11-23 10:10 · Score: 1, Insightful

I guess it's up to SyncSort to run a benchmark and publish the results, no?
Re:Need to benchmark against the best sorts by Pinball+Wizard · 2008-11-23 10:36 · Score: 2, Interesting

Parallel/distributed sorting doesn't eliminate the need for map/reduce, it just helps spread the problem set across machines.
Here's the thing though...its the distributing of the problem set and the combining of the results that is the hard part - not map/reduce.
Map and reduce are simple functional programming paradigms. With map, you apply a function to a list - which could be either atomic values or other functions. With reduce, you take a single function(like add or multiply, for instance) and use that to condense the list into a single value or object.
That's my understanding of map/reduce from my functional language classes in school and that's exactly how Google describes it. I don't really see what the big deal is with map/reduce in itself.
Like I said, its the distributing the problem among thousands of machines that is the hard part.

--
No, Thursday's out. How about never - is never good for you?
Re:Need to benchmark against the best sorts by Anonymous Coward · 2008-11-23 14:28 · Score: 0

Here's the thing though...its the distributing of the problem set and the combining of the results that is the hard part - not map/reduce.
Absolutely true, but I think you're confused.
"MapReduce" doesn't refer to your "map" and "reduce" functions; it refers to the framework into which you plug your "map" and "reduce" functions. It spawns workers, handles input sources and output sinks, assigns shards to map workers close to the data, shuffles and sorts, retries on worker death/slowness, reports progress, etc. In short, it's what you referred to as the "hard part".
Re:Need to benchmark against the best sorts by James+Youngman · 2008-11-23 22:44 · Score: 1

I suspect maybe you don't quite understand how MapReduce works. Take a look at the references section of the MapReduce paper; the paper's authors are well aware of research in the parallel sorting field. In particular their reference 1 is most relevant.
Re:Need to benchmark against the best sorts by ShakaUVM · 2008-11-24 00:22 · Score: 3, Insightful

>>Using map/reduce will work, but there are better approaches to sorting.
It kinda bugs me that Google trademarked (or, at least, what they named their software) after a programming modality that has been in parallel processing for ages. In fact, MPI has a mapreduce() function that, well, does a map/reduce operation. I.e., farms out instances of a function to a cluster, then gathers the data back in, summates it, and presents the results to someone.
It kind of bugs me (in their Youtube video linked in TFA, at least) that they make it seem that this model is their brilliant idea, when all they've done is write the job control layer under it. There's other job control layers that control spawning new processes, fault tolerance, etc., and have been for many, many years. Maybe it's nicer than other packages, in the same way that Google Maps is nicer than other map packages, but I think most people like it just because they don't realize how uninspired it is.
It'd be like them coming out with Google QuickSort(beta) next.
Re:Need to benchmark against the best sorts by poot_rootbeer · 2008-11-24 03:01 · Score: 1

Using map/reduce will work, but there are better approaches to sorting.
I think we can safely assume that the hordes of egghead computer scientists are already exploring the alternate approaches.
Perhaps SyncSort has better theoretical performance, but Map/Reduce yields better results in Google's real-world scenarios? I don't know, it's all way above my head.

Finally... by aztektum · 2008-11-23 05:17 · Score: 5, Funny

I will be able to catalog my pr0n in my lifetime:

Blondes, Brunettes, Red heads, Beastial^H^H^H^H^H "Other"

--
:: aztek ::
No sig for you!!

Re:Finally... by Pugwash69 · 2008-11-23 05:37 · Score: 2, Funny

How do you catalogue the topics? I mean "Clown" and "Monkey" are so different, but something with both elements could be difficult to sort.

--
Pro Coffee Drinker
Re:Finally... by Fumus · 2008-11-23 06:49 · Score: 1

For the love of puppies. Learn to spell "bestiality". Half the population can't spell it right :/
Re:Finally... by Anonymous Coward · 2008-11-23 07:40 · Score: 0

"For the love of puppies" indeed, you sick fuck.
Re:Finally... by Anonymous Coward · 2008-11-23 08:52 · Score: 0

and the other half doesn't want to...
Re:Finally... by troll8901 · 2008-11-23 11:15 · Score: 1

You've just made me waste 2 hours looking at ... (I mean, *ahem* doing RESEARCH on the net), ... you inconsiderate clod!
Re:Finally... by Anonymous Coward · 2008-11-23 17:41 · Score: 0

The fact that you know and care speaks to your interest in the subject.

One ups Yahoo & Hadoop by DaveLatham · 2008-11-23 05:23 · Score: 3, Interesting

It looks like Google saw Yahoo crowing about winning the 1 TB sort contest using Hadoop and decided to one up them!

Let's see if Yahoo responds!

Re:One ups Yahoo & Hadoop by Anpheus · 2008-11-23 05:52 · Score: 1

Hadoop uses MapReduce :) From their site:

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
Re:One ups Yahoo & Hadoop by iwein · 2008-11-23 06:28 · Score: 1

With a larger dataset outscaling efficiency becomes more important than sorting efficiency. Sorting 1PB is different than sorting 1TB.
Since were relating to human proportions today, I'll compare your comparison to comparing running 100m to running a marathon. Apply story telling skills and score.

--
Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.
Re:One ups Yahoo & Hadoop by gfody · 2008-11-23 06:42 · Score: 1

MapReduce isn't something invented by Google. It's a design pattern.

--

bite my glorious golden ass.
Re:One ups Yahoo & Hadoop by Patrick+May · 2008-11-23 07:21 · Score: 3, Informative

It's older than design patterns. Lisp has provided map and reduce functions for literally decades. It's a standard functional programming idiom.
Re:One ups Yahoo & Hadoop by jollyplex · 2008-11-23 11:49 · Score: 5, Interesting

Exactly. It's unclear if their better time was a software engineering or algorithmic feat, though. Hadoop was able to finish sorting the 1 TB benchmark dataset in 209 s; TFA states Google pulled the same event off in 68 s. The Yahoo blog post you linked to says their compute nodes each sported 4 SATA HDDs. Note TFA mentions Google's 1 PB dataset sort used 48,000 HDDs split between 4,000 machines, or 12 HDDs to a machine. If Google used the same machines to perform their 1 TB sort, then they had 3 times as many HDDs on each compute node, and could probably pull data from storage 3 times as fast. 209 s / 68 s ~ 3.1 -- coincidence, or not? =)

Sort? Sort what? by mlwmohawk · 2008-11-23 05:24 · Score: 1, Insightful

One quadrillion bytes, or 1 million gigabytes.

How big are the fields being sorted. Is it an exchange sort or a reference sort?

It is probably very impressive, but without a LOT of details, it is hard to know.

Re:Sort? Sort what? by nedlohs · 2008-11-23 05:34 · Score: 5, Informative

I realize, slashdot..., but maybe you could glance at the article which states:
10 trillion 100-byte records
Re:Sort? Sort what? by neoform · 2008-11-23 05:49 · Score: 1

Odds are they're using the mythical "google algorithm", so they're probably going to keep what they're doing quiet.

--
MABASPLOOM!
Re:Sort? Sort what? by Anonymous Coward · 2008-11-23 05:58 · Score: 0

I believe it was a Bubble-Sort of some type...O(n^2) FTW!
Re:Sort? Sort what? by Anonymous Coward · 2008-11-23 06:06 · Score: 0

One quadrillion bytes, or 1 million gigabytes. How big are the fields being sorted. Is it an exchange sort or a reference sort?
It's a bubble sort.
Re:Sort? Sort what? by Dpaladin · 2008-11-23 06:12 · Score: 5, Funny

Sorting a petabyte sounds pretty impressive, but I don't think it was a whole yotta work.

--
Bad puns gave me bad karma. =(
Re:Sort? Sort what? by mlwmohawk · 2008-11-23 06:21 · Score: 1

10 trillion records across 4,000 computers comes to 2.5 billion records per computer.
It took 6 hours for a computer to sort 2.5 billion records? 250G?
Yawn.
Re:Sort? Sort what? by nedlohs · 2008-11-23 06:46 · Score: 2, Insightful

You do have to merge them all back together at the end...
But I'm sure you can do better tonight.
Re:Sort? Sort what? by chaim79 · 2008-11-23 06:46 · Score: 2, Insightful

right, so it's 250gb sorted in 6 hours... now where does the sorting and integration of the 4000 250gb blocks of sorted data come in? :)

--
DEMETRIUS: Villain, what hast thou done?
AARON: Villain, I have done thy mother.
Shakespeare invents 'your mom'
Re:Sort? Sort what? by mlwmohawk · 2008-11-23 07:23 · Score: 0, Flamebait

You do have to merge them all back together at the end...
Technically speaking, that's not true. In fact, you wouldn't want too.
Assuming some sort of search paradigm, you'd keep the records on their 4000 separate servers, each server doing its on search functionality, and *only* merge the results of the searches as needed and cache them in the web layer.
Re:Sort? Sort what? by mlwmohawk · 2008-11-23 07:30 · Score: 1

right, so it's 250gb sorted in 6 hours... now where does the sorting and integration of the 4000 250gb blocks of sorted data come in?
You wouldn't merge it in to one set, you'd keep it all on their own servers and only merge the results as needed.
Re:Sort? Sort what? by Anonymous Coward · 2008-11-23 07:51 · Score: 0

*groan*
Re:Sort? Sort what? by Anonymous Coward · 2008-11-23 09:51 · Score: 0

That's about the size of emacs, right?
Re:Sort? Sort what? by chaim79 · 2008-11-23 13:44 · Score: 1
if you sort 4000 blocks of random data into an actual order, but don't combine the data in any serious way, what you have is tons of overlap in all these seperate blocks of data. Just talking about 1-20 on 4 servers:
- Server 1: 1, 5,6,9,13
- Server 2: 3,11,12,17,19
- Server 3: 2,4,7,15,20
- Server 4: 8,10,14,16,18
That data may be sorted but it's a mess, and doing this type of sort for a competition is nothing more then getting fast servers and sticking them in the same room and have them all sort random blocks of data, without putting them together it's still fairly random data, you're doing nothing more then single server sort many times over... kinda like lifting 10lb 10 times and saying you can lift 100lb, it may be technically correct but absolutely pointless.
To have any meaning the result of the sort needs to be in the form of:
- Server 1: 1,2,3,4,5
- Server 2: 6,7,8,9,10
- Server 3: 11,12,13,14,15
- Server 4: 16,17,18,19,20
Now the data is fully sorted, now you can actually use the sorted data (I need data item 7, that's on server 2) vs (I need data item 7, is it on server 1? no. is it on 2? no, is it on 3? yes)
You may still keep it all on their own servers, but there is sorting and combining going on between servers in order to get the data properly sorted.
--
DEMETRIUS: Villain, what hast thou done?
AARON: Villain, I have done thy mother.
Shakespeare invents 'your mom'
Re:Sort? Sort what? by GumphMaster · 2008-11-23 15:28 · Score: 1

It wasn't meant to be a yotta work. They were simply trying to zetta exa-mple to those that express tera at the thought of such giga-ntic numbers. Of course, they might have just done it for giga-ls.
(If bad puns gave you bad karma then I'm going straight to pungatory.)

--
Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button
Re:Sort? Sort what? by Paiev · 2008-11-23 17:53 · Score: 1

I was under the impression that it was Bogosort.
Re:Sort? Sort what? by mlwmohawk · 2008-11-24 02:24 · Score: 1

How did someone see this as flamebait?
Re:Sort? Sort what? by Anonymous Coward · 2008-11-24 03:42 · Score: 0

It might be due to you consistently oversimplifying things in each post you made in the thread, but I'm not sure. It might be something else too. My bet is on the oversimplification, though.
Re:Sort? Sort what? by mlwmohawk · 2008-11-24 05:07 · Score: 1

you consistently oversimplifying
That's just this issue, isn't it? I mean, seriously, all the great theoretical work on sorting algorithms is done. No one is going to come along and give us an order of magnitude better performance in a general purpose algorithm. It just isn't going to happen.
So, it *is* a simple problem for which there are ample tools from which to choose. The challenge is not the *sort* but the scale. This too is pretty pedestrian as well. The cluster "divide and conquer" approach is not new, there are many tools from which to choose, or you could write your own. You are still taking the performance of the sort algorithm and spreading it across 4000 machines.
6 hours doesn't sound astonishing in any way.
It reminds me of an EMC press release a number of years ago where they touted their engineering prowess at doubling their RAID storage capacity. Whoo Hoo! BFD, they started buying a newer Seagate SCSI disk with twice the capacity of the previous model. Did that deserve any credit at all?
This sorting thing doesn't sound astounding, it could very well be (and probably is) something pretty pedestrian on 4000 of the latest and greatest machines with quad CPUs, 10,000 RPM disks, and a gigabit switch backplane.

tagging by Hao+Wu · 2008-11-23 05:34 · Score: 4, Interesting

I will be able to catalog my pr0n in my lifetime:

It's not enough to sort by blond, black, gay, scat, etc. Some categories are a combination that don't belong in a hierarchy.

That is where tagging comes in. Sorting can be done on-the-fly, with no one category intrinsically more important.

--
I suggest you read Slashdot

Re:tagging by gardyloo · 2008-11-23 05:53 · Score: 5, Funny

pr0n for Geeks, volume 18: Sorting On-the-Fly
Re:tagging by AbRASiON · 2008-11-23 14:40 · Score: 1

I swear I shouldn't admit this but good lord you're right - I _REALLY_ wanted WinFS to come out and I wanted it to be good.
A database driven filesystem would be so goddamned useful, it would change the way we work with computers but noooo Microsoft screwed up (furthermore WinFS was a hack, on top of NTFS anyhow I heard it was SQL server or something, sounds messy)
Porn is a fantastic example, I realise it's kind of immature but I mean it would be genuinely useful.
Set tags like:
threesome
oral
brunette
money shot
anal
Sad, I know but could genuinely save time.
On to something which might be more applicable in the workplace or something though you could set documents for example like so:
Finance
Letter
Superannuation (401k for the US readers)
Invoice
Final example, perhaps for your photo collection:
Family
Girlfriend
Holiday
Christmas
Drinks
Road trip
Fishing
Really, tags on files would be lovely, difficult to manage but lovely.
I think my real concern with this is the ability for the tags to STAY WITH the file, even when copied to USB key or given to someone.
It's a pretty complex little problem and I don't know of a solution to this without having the operating system constantly adding / editing the contents of a file, which could cause all kinds of problems for the file itself (especially triggering virus scanners etc)
(You could do a 'thumbs.db' style index per folder but that's not ideal either)
Sorry to go off topic but I find it a fantastic concept for filesystems and computers, a real potential leap forward but likely difficult to impliment properly as well as to backup and transport to new machines when formatting or migrating servers etc, none the less I do wish it was here now, it's a bit of a shame it's taking so long.
Anyhow back to E:\Porn - I'll keep using folders the old fasioned way.
*zzzzziiipp*
Re:tagging by Rockoon · 2008-11-23 15:57 · Score: 1

All my prOn can be sorted into various scatagories. Softscat, Hardscat, and Holyscat.

--
"His name was James Damore."

Its About Time.... by Anonymous Coward · 2008-11-23 05:36 · Score: 2, Funny

Finaly... A system with enough power to run vista efficiently.

Re:Its About Time.... by poetmatt · 2008-11-23 05:55 · Score: 3, Informative

Are you sure? It wasn't marked Vista capable.
Re:Its About Time.... by peragrin · 2008-11-23 06:09 · Score: 3, Funny

Not only that the extra processors aren't covered under the EULA and require special extra licenses.

--
i thought once I was found, but it was only a dream.
Re:Its About Time.... by not+already+in+use · 2008-11-23 13:46 · Score: 0, Flamebait

My computer runs Vista just fin.... Wait a second. I see what you did there! You are making a joke! That is hilarious! I have never heard this joke before. How witty and original! I got one for you... watch this.... M$. See that! I put a $ instead of an S! You saw it here first!

--
Similes are like metaphors

Not impressive... by g0dsp33d · 2008-11-23 05:41 · Score: 4, Funny

Not a big deal, that's just the data they have on you.

--
lol: You see no door there!

Is it new data by moteyalpha · 2008-11-23 05:49 · Score: 1

As memory gets cheaper and I can store more locally, what I really need to know is whether it is unique or new to me. I can read Frits P0st a million times and never get tired of it. There was a very good article on slashdot the other day and it got over 2000 comments, some of which were very insightful and useful. I need a way to know for myself what is new to me. I would be nice if the browser interacted more with Google to help me with that. I just looked, and RTFM is indexed 4.5 million times which of course includes xkcd#293, and that is really all I need to know.

0s and 1s by johno.ie · 2008-11-23 05:56 · Score: 2, Funny

That's a lot of computing power to use just to get 4,000,000,000,000 0s and 4,000,000,000,000 1s.

--
872835240

BCA's? by Anonymous Coward · 2008-11-23 06:01 · Score: 0

Can we convert that to number of bad car analogies?

Re:BCA's? by SEWilco · 2008-11-23 14:10 · Score: 2, Funny

Can we convert that to number of bad car analogies?

Sure, it's -4.15 Edsels.
Re:BCA's? by cerberusss · 2008-11-23 23:19 · Score: 1

Can we convert that to number of bad car analogies?
Sure, it's -4.15 Edsels.
That's rounding it off a bit generous, don't you think?

--
8 of 13 people found this answer helpful. Did you?

nice one, Google... by Tastecicles · 2008-11-23 06:03 · Score: 2, Funny

...fancy doing my mp3 collection?

--
Operation Guillotine is in effect.

Re:nice one, Google... by andy_t_roo · 2008-11-23 12:27 · Score: 1

only if each mp3 is 100 bytes ...
(they sorted 10 trillion 100-byte records)
Re:nice one, Google... by Tastecicles · 2008-11-24 07:57 · Score: 1

hmmm... not having audited them for a long while, I glance at my shelf and see eight 320GB pocketdrives and know they're all jam packed. Average bitrate is 160 and duration is 7m, so figure an average filesize of 10MB. There's a lot of live stuff in there as well as my entire CDA collection and a fair few audiobooks and vinyl/minicassette rips. That's 31,000 tracks per drive, or 248,000 tracks total (give or take). With my hardware and assuming I could be arsed, that's a month's work, although I really should get on with it instead of posting obscenely large numbers to slashdot...

--
Operation Guillotine is in effect.

Libraries of congress? by TinBromide · 2008-11-23 06:03 · Score: 2, Insightful

First of all, this isn't a straight up "Libraries of Congress" (better known and mentioned in prior posts as a LoC). Its the web archiving arm of the LoC. I call for the coining of a new term, WASoLoC (Web Archival System of Library of Congress) which can be defined as X * Y^Z = 1 WASoLoC where X is some medium that people can relate to (books, web pages, documents, tacos, water, etc), Y is a volume (Libaries, Internets, Encyclopedias, end to end from A to B, swimming pools, etc) and Z is some number that marketing drones come up with because it makes them happy in their pants.

Honestly, How am i supposed to know what "..the amount of archived web data in the US Library of Congress as of May 2008." Looks like!? I've been to the library of congress, i've seen it, its a metric shit-ton of books (1 shit-ton = Shit * assloads^fricking lots), but i have no clue what the LoC is archiving, what rate they're going at it, and what the volume is of it.

--
Is it sad that I am more likely to recognize you and your posts by your sig than your name or UID?

Re:Libraries of congress? by Anonymous Coward · 2008-11-23 17:05 · Score: 0

If you can solve that.. then they can shut down the LHC

Wow by ice_nine6 · 2008-11-23 06:04 · Score: 1

That must have taken a lot of monkeys.

Re:Wow by tomhuxley · 2008-11-24 16:27 · Score: 1

That must have taken a lot of monkeys.
Pigeons, son, Google uses pigeons.

clever strategy by stimpleton · 2008-11-23 06:12 · Score: 2

Good.

They clearly have the ability to respond to emergencies. And this puts it out there that they can...

eg;
1) Foot n mouth out break in cattle
2) A supliment to census data
3) Finding information of dissidents/traitors(bloggers)

--

In post Patriot Act America, the library books scan you.

Re:clever strategy by Anonymous Coward · 2008-11-23 07:11 · Score: 0

4) Profit!

Re:How is this flamebait? by iwein · 2008-11-23 06:14 · Score: 1, Interesting

There is a thing called meta humor, I'll give you an example:

You got baited into a flame in a very elaborate scheme to mock your intelligence (or lack thereof).

There is no category meta-flamebait, so you're proving the mods right I'd say.

I hope this helps.

--
Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.

20,111 Servers ?? by johnflan · 2008-11-23 06:17 · Score: 1, Interesting

With a little bit of excel, if takes 4,000 servers 362 minutes to calculate a 1PB job It takes 1440 (24 hours) on 20,111.11 server to sort 20pb (if it was just plain sorting they were doing). And just on a side note, from their number one of their servers can compute 741.5 MB per minute!

Re:20,111 Servers ?? by chaim79 · 2008-11-23 06:50 · Score: 3, Insightful

Yah, but you gotta wonder at the computing cost of integrating all those datasets into one complete sorted block of data. It could be that those servers can sort at 1gb per min but the overhead for combining is 25% of the computing time.

--
DEMETRIUS: Villain, what hast thou done?
AARON: Villain, I have done thy mother.
Shakespeare invents 'your mom'
Re:20,111 Servers ?? by johnflan · 2008-11-23 07:04 · Score: 2, Informative

Agreed, but even if it takes 40,000 servers with losses and extra overhead to calculate their daily workload. It makes you wonder what their other estimated 410,000 servers are doing? (2006 estimate)
Re:20,111 Servers ?? by smallfries · 2008-11-23 07:21 · Score: 3, Insightful

Oh dear. 4000*362 ~= 1440*20111 / 20. So you assumed that the sorting would scale linearly. fail.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php

just in perspective... by wjh31 · 2008-11-23 06:38 · Score: 2, Interesting

i make this about 48GB/s, my hard drive manages about 20MB/s, even my mid-range ram manages only ~6.4GB/s, and top end ram will reach only ~13GB/s (according to wiki) so even ignoring the ability to process that much data in that time, the ability to simply move that much data is quite impressive (at time of print, may not hold one year down the line)

Re:just in perspective... by Kent+Recal · 2008-11-23 10:54 · Score: 1

They probably didn't hold the source data on a single machine in first place (or did seagate break the Petabyte barrier, yet?).
48GB/s broken down over 4000 servers boils down to "only" 12 Mbyte/s.
So indeed, impressive aggregate performance, but the individual nodes were "only" performing at (roughly) the throughput of Gigabit Ethernet.

Holy shit... by Taken07 · 2008-11-23 06:50 · Score: 1

That's a lot of data...

Re:Holy shit... by troll8901 · 2008-11-23 11:12 · Score: 1

Loop 10,000,000,000,000 times Generate 100 bytes of random data Store into database End of loop ;)

I'm surprised by TheSpoom · 2008-11-23 07:37 · Score: 0, Redundant

...that Google hasn't implemented the Libraries of Congress metric into their auto-calculator.

C'mon Google, get on the ball(s)!

--
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs

I assume it was... by Anonymous Coward · 2008-11-23 07:45 · Score: 0

bubblesort?

Is this our standard of measurement? by moniker127 · 2008-11-23 08:19 · Score: 1

Just like I measure my distance to work (452.75 football fields) i measure the data on my computer by libraries of congress?

Amazing feat... by Duncan3 · 2008-11-23 08:32 · Score: 5, Funny

Today from Google, the god of all things and doer of all things good in the universe, many millions of dollars in computer equipment were able to sort lots of things, in about the amount of time you would think it would take for millions of dollars of equipment to sort things.

In other news, a woodchuck was found chucking wood as fast as a woodchuck could chuck wood.

Congrats Google, you have a HUGE data set, and an even bigger wallet.

--
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/

MapReduce = map + reduce by Bitmanhome · 2008-11-23 08:45 · Score: 3, Interesting

If you feel the urge to play with MapReduce (or reade the paper), you don't need a fancy Linux distro to do it. MapReduce is simply the map() and reduce() functions, exactly as implemented in Python. Granted, Google implementation can work with absurdly large data sets, but for small data sets, Python is all you need.

--
Not that this wasn't entirely predictable.

Re:MapReduce = map + reduce by boyter · 2008-11-23 09:39 · Score: 3, Informative

True, but not quite the point. The map and reduce functions as you say are implemented in python (and a great many other languages), but what makes MapReduce special is that you replace the Map function with one which distributes it out to other computers. Because any map function can be implemented in parallel you get a speed boost for however many machines you have (dependant on network speeds etc....).
So yeah, you can do it in Python but you arent going to be breaking any records untill you implement your own infrastructure that lets you span it out to thousands of computers. The nice thing being you dont need to write any new code to take advantage of the speed when you do.
Re:MapReduce = map + reduce by Varun+Soundararajan · 2008-11-23 09:54 · Score: 1

Also, one of the biggest problem in distributing tasks is *handling failures*. MR does it as part of the library, which greatly simplifies distributed computing tasks.
Re:MapReduce = map + reduce by Pinball+Wizard · 2008-11-23 10:42 · Score: 2, Informative

Exactly. There is nothing special to map and reduce.
Here's an example. Map and reduce are functional programming tools that work with lists. So we'll start with a simple list.
1 2 3 4 5
Now we'll take a function - x^2, and map it to the list. The list now becomes:
1 4 9 16 25.
Now, we'll apply a reduce function to our list to combine it to a single value. I'll use "+" to keep it simple. We end up with:
55
And that is pretty much all there is to map and reduce.

--
No, Thursday's out. How about never - is never good for you?
Re:MapReduce = map + reduce by Pinball+Wizard · 2008-11-23 10:47 · Score: 1

But its the distributing part that is special, not the map/reduce part.
You're basically just dividing up a huge list and sending each part to a different machine. Tacked on to each list are the map and reduce functions themselves so each machine knows what to do with the list.
Its the parallelization of the problem that is the hard part. Map does not mean the mapping of the problem to thousands of machines - it means the mapping of a function to a list, and that is not a terribly difficult problem.

--
No, Thursday's out. How about never - is never good for you?
Re:MapReduce = map + reduce by adpowers · 2008-11-23 17:33 · Score: 2, Informative

Almost, but not quite. MapReduce has a slightly different format than just map() and reduce(). Here is the signature of map and reduce from a theoretical functional language:
map(): A* -> B*
reduce(): B* -> C
Whereas in MapReduce:
map: (K, V)* -> (K1, V1)*
reduce: (K1, (V1)*)* -> (K2, V2)*
I think that is mostly accurate. Read more accurate/detailed report in MapReduce revisited[PDF].
Re:MapReduce = map + reduce by TeknoHog · 2008-11-24 02:33 · Score: 1

IMHO it's noteworthy that the language contains a keyword for parallel operations. So you can start using map() right now, and when the implementation and hardware improves, your existing code will scale up. (I've experienced a similar development with matrix operations in Fortran.)

--
Escher was the first MC and Giger invented the HR department.

But.... by VonSkippy · 2008-11-23 09:07 · Score: 1

It really only took Two Hours - the rest of the time was used stuffing in paid ads.

Clear the trolls out by troll8901 · 2008-11-23 10:52 · Score: 1

Whoever mods you "-1" is getting an extra small turkey this Thursday!

And the AC troll shall get only turkey bones and lumps of coal!

Re:How is this flamebait? by Anonymous Coward · 2008-11-23 10:53 · Score: 0

You should know, that people always block something that would destroy their reality or self-respect too much. It does not matter if it's true. This is, because else, they would become insane. Ask a studied psychologist about it.

That's the height of stupidity and maybe arrogance that a person would go insane before they would build their reality and self-respect on a more solid foundation, such as truth and the capability of dealing with it. I reject this notion whether it's popular or not and whether it comes from "studied psychologists" or not. Some of us are willing and able to think for ourselves.

If you really want him to change his mind, you must form your words in a way, that allows him to still accept himself and his reality. The best way is, to tell him he can make his life even better by changing his mind that way, and that not he was wrong but he fell for a trick or something like that. Yeah, it's distorting reality. But it's his distorted reality that you use. And do you want it to work, or just flame? :)

So the best way to challange blatant failure and inaccuracy is to incorporate that failure and that inaccuracy in my own response? I just don't buy it. It's one of those things that sounds so off-kilter that people probably assume it must contain some obscure wisdom, even though it's self-evident that it doesn't. Let's get one thing straight: if someone tells you how you failed and how you can do better, how you were wrong and how you may correct that wrong, they are doing you a gigantic favor. You can decide not to appreciate that favor and you can decide not to experience gratitude on the basis of some petty grievance, like maybe that person cares about truth far more than he/she cares about whether you are offended by how it is presented. But if you decide not to correct your failures because your fevered inflated ego couldn't get past the fact that there are human beings who won't cater to your every sensibility, well, that doesn't harm the person who tried to tell you something. It is strictly your loss.

I know that our instinct tells us to back-attack like you did. But this does not work if he does not respect your opinion and does not listen to you. The above way on the other hand works nicely if done right. You can even make new friends out of enemies and fix their distortions somewhat.

That wasn't an attack of any sort. If you think it is, be grateful and celebrate the fact that you have never truly been attacked and while you celebrate this, try entertaining the notion that you might not be the best authority on what does or does not constitute an attack. It was a disagreement. Also, truth is not a matter of opinion or consensus or respect; far more important than these things is a simple idea called falsifiability. I don't visit Slashdot.org so I can make friends and frankly, I would consider myself a sad person if I did. Furthermore, I am under no illusion that I am going to fix anyone's "distortions". All I can do is point them out; it is up to that individual to make the decision and carry out the effort of fixing his or her own "distortions". That sort of personal enrichment is not something another person can give to you. Not the real thing, anyway.

By the way: Every human has distortions. But an alpha-male knows how to convince others, that his views are the best. ;)

And what has that gained us? Consumerist culture? Marketing? The nanny state? The easily offended? The apathy and disregard for real truth and real well-being that is so widespread, indeed, the same one you are trying to convince me embrace so that I may feel like I've been more convincing? The only thing worse than the initial problem, which is mostly ignorance, are the ten thousand excuses for it and with them, the expectation that healthy people who need no one to think for the

MapReduce by laddiebuck · 2008-11-23 11:24 · Score: 1

It's always bugged me that they've been heralding MapReduce, something any functional programmer has known for the past 50 years, as something revolutionary and new. The worst of it is how all the self-styled geeks, who by rights ought to be familiar with the concept, have been lapping it all up.

Re:MapReduce by adpowers · 2008-11-23 17:43 · Score: 4, Informative

The individual functions map and reduce are quite standard. The innovation here is the systems work they've done to make it work on such a large scale. All the programmer needs to worry about is implementing the two functions, they don't have to worry about distributing the work, ensuring fault tolerance, or anything else for that matter. That is the innovation.
They mention in the article that if you try and sort a petabyte you WILL get hard disk and computer failures. Hell, you can only read a terabyte hard disk a few times before you encounter unrecoverable errors. The system for executing those maps and reduces is what is important here. The important parts are in the design details, such as dealing with stragglers. If you have 4000 identical machines, you won't necessarily get equal performance. If a few of those machines have a bit flipped and started without disk cache, they might see a huge decrease in read/write performance. The system needs to recognize this and schedule the work differently. That can make a huge difference in execution time. If you graph the percentile complete of a MR job, you'll often see that it quickly reaches 95% and then plateaus. The last 5% may take 20% of the time, and good scheduling is required to bring this time down.
But like I said, the innovation isn't in the idea of using a Map and Reduce function, it is the system that executes the work.
Re:MapReduce by Just+Some+Guy · 2008-11-24 05:05 · Score: 0, Flamebait

Hell, you can only read a terabyte hard disk a few times before you encounter unrecoverable errors.
Umm, what?

--
Dewey, what part of this looks like authorities should be involved?
Re:MapReduce by adpowers · 2008-11-24 06:01 · Score: 1

I should have been more specific/clear. If you read do a full read of a terabyte disk a dozen times, you are likely to see an unrecoverable read error:
"Typically, [Unrecoverable Error Rate (UER) for read operations] will be 1 per 10^14 bits read for consumer class drives and 1 per 10^15 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read."
That quote is from a Sun blog that has lots of information about Mean Time To Data Loss. His other posts are interesting as well.
Re:MapReduce by Just+Some+Guy · 2008-11-24 07:32 · Score: 1

As the owner of terabyte drives who hasn't had unrecoverable errors (yet), I was expressing my skepticism that such a thing was inevitable after only a few reads. That's not flamebait, but a request for further support of what I considered to be an unlikely statement.

--
Dewey, what part of this looks like authorities should be involved?
Re:MapReduce by Anonymous Coward · 2008-11-24 12:08 · Score: 0

When's the last time you read the whole thing?

This did it by Anonymous Coward · 2008-11-23 12:53 · Score: 0

I cannot believe how something along the trivial lines of parallelizable problems parallelize made it into /. twice. I now consider /. broken, and will remove it from my bookmarks. I guess a massive company like google has enough manpower to vote crap up here, so thinking people will need to find some backwaters for talking.

Also I was always waiting for the point in time when google would start to be taken over by the usual business s... hats. Negative karma is growing.

I love how people are moaning over this. by Anonymous Coward · 2008-11-23 13:10 · Score: 0

Its not that they are saying "OH MY GOD MAPREDUCE IS SO AWSUM WE TOTALLY OWNED YOU N00BS"

They just wondered how long it would take to sort 1PB dataset, after they saw Yahoo! perform a sort with the 1TB set.

All it is showing is improvements of technology.

We can now use these figures as a benchmark in 5 / 10 years to see how better it has gotten since then.

I, for one, welcome our new PB sorting personal supercomputer overlords.

Re:Kudos to Niggers by Nazlfrag · 2008-11-23 14:08 · Score: 1

Try the low bandwidth view and/or disabling the dynamic comments, then filter at 1. Oh and hand in your geek card for not being able to circumvent censorware at work.

hadoop by voxner · 2008-11-23 15:45 · Score: 1, Interesting

I developed a search engine in c++ in grad school and I remember applying the concepts of vector space model, okapi model in it. Looking at the relatively snail pace speed I got from the c++ stl datastructures (with limited hardware resources), I still can't help but wonder how google pulls off the "magic" of split second results.
MapReduce is a powerful concept and a good starting point is hadoop. Google uses its own proprietary system called GFS.
We once had a session with Mr.Raghavan who was the yahoo search chief.He advised that students in general should learn to work in projects that involve a large corpus of data and that phds are especially valued in the field.

Thats almost as BIG as... by Anonymous Coward · 2008-11-23 17:44 · Score: 0

That's almost as big as my downloaded MP3 library!

Re:Kudos to Niggers by mini+me · 2008-11-24 03:41 · Score: 1

Why don't you address the problems with your "censorship" program instead? It appears to be completely broken.

USSR by Anonymous Coward · 2008-11-24 04:34 · Score: 1, Funny

In Soviet Russia, 6 petabytes sort YOU in ONE hour.

Re:USSR by triso · 2008-11-25 12:40 · Score: 1

In Soviet Russia, 6 petabytes sort YOU in ONE hour.
In Soviet Russia, jokes are allowed to die after being repeated endlessly.

Not even close. by Hillgiant · 2008-11-24 05:58 · Score: 1

Dude, that's barely half my porn stash.

--
-

New Unit for Nerds by noppy · 2008-11-24 07:06 · Score: 1

Next time we have to reference amounts of data sorted as n Google Sort.

Re:Kudos to Niggers by Anonymous Coward · 2008-11-26 09:09 · Score: 0

Fucking piece of shit negroes.

Slashdot Mirror

Google Sorts 1 Petabyte In 6 Hours

166 comments