Microsoft Research Introduces Record-Beating MinuteSort Tech
mikejuk writes "A team from Microsoft Research has taken the lead in the MinuteSort data sorting test using a specially-devised technology: Flat DataCenter Storage. The figures are impressive — 1401 gigabytes in 60 seconds, using 1033 disks across 250 machines. Not only is this three times as much as the previous record, but also, it uses only one sixth of the hardware resources, according to a blog post about the test from Microsoft. One thing that's interesting about the success is the technology used. While solutions such as Hadoop and MapReduce are traditionally used for working with large data sets, Microsoft Research created its own technology called the 'Flat Datacenter Storage,' or FDS for short. This isn't just academic research, of course. The team from Microsoft Research has already been working with the Bing team to help Bing accelerate its search results, and there are plans to use it in other Microsoft technologies."
Their support for research and innovation is top-notch. They are pretty much the only one of the large companies that fund this kind of research and they fund it with billions. Their work does lots of good for the world. Good job guys.
Sorted by Microsoft
between their ass and a hole in the ground..
/smart people working for dumb people working for smart people
...yet MinuteSort still takes a minute!
My God can beat up your God. Just kidding...don't take offense. I know there's no God.
"using 1033 disks across 250 machines"
They sorted a terrabyte and a half in 60 seconds using 1000 disks and 250 machines.
You realize that sort is one of those things I can simply pre-section the data into smaller chunks (e.g. a,b,c,da-df,dg-dz,...etc) and then sort the smaller chunks?
With 192MB PC's it's trivial to pre-filter during the load, then sort in memory.
The team from Microsoft Research has already been working with the Bing team to help Bing accelerate its search results, and there are plans to use it in other Microsoft technologies.
So Bing is going to scrape their search results from Google *and* other search engines? :-)
It must have been something you assimilated. . . .
The team from Microsoft Research has already been working with the Bing team to help Bing accelerate its search results,
Ah, so now you can get irrelevant results even faster!
Did they actually do anything or is this just from them using better hardware. The article says the record holder is yahoo from 2009 3 years better technology and unmentioned (from what I read) amount of funds I would assume anyone could beat it.
And the filesystem was remote...
I would almost bet its the updated SMB protocol on Server 2012.
Did they actually do anything or just build a machine using todays hardware and lots of funding. A team from yahoo got the record in 2009 hardware has changed alot in the 3 years and when money is not a object couldnt anyone do about the same?
It only works using IE6.
More irrational Microsoft hatred from the peanut gallery. Interesting accomplishment from Microsoft Research (a group which has produced all kinds of useful advances in computing and software development, and which has very little to do with shipped products like Outlook, IE6, etc.); Average /. luser interpretation? LOL SHILL ARTICLE FROM TEH MICRO$OFT FAGGORTZ YOU SUCK LOL.
Good to see that a nerd site is inundated with droves of empty-headed group-think religious fanatics!
When you're done masturbating to your imaginary universe, maybe you'd like to sit down with the likes of Simon Peyton-Jones and discuss some of the finer points of the terrible work he and his peers have been doing.
Baa-hahahaha. Right.
The important part is not that this is a new approach, but that they beat the previous record using less hardware.
Non-expert here, but they require a "full bisection bandwidth network" in which every computer gets full bandwidth to every other computer.
I understand that communication adds time and complexity, but doesn't this number seem quite low? I wrote a simple External Sort program in Java that can sort 1GB in 2 mins restricted to 150MB of memory on a mid-range laptop, so 1401GB (and especially the previous record of 500GB) doesn't seem that impressive! I understand that the complexity is more than linear, but I really would've thought the world record would be higher than that...
Website: http://sortbenchmark.org/
PDF: MinuteSort with Flat Datacenter Storage
The sorts were accomplished using a heterogeneous
cluster consisting of 256 computers and 1,033 disks, di-
vided broadly into two classes: storage nodes and com-
pute nodes. Notably, no compute node in our system
uses local storage for data; we believe FDS is the first
system with competitive sort performance that uses re-
mote storage. Because files are all remote, our 1,470 GB
runs actually transmitted 4.4 TB over the network in un-
der a minute. No strong assumptions are made around
key or record lengths; keys and records of other lengths
can be handled with only a performance-neutral config-
uration change.
Summary
FDS is a general-purpose scalable parallel blob store
that exploits a full-bandwidth interconnect to expose the
entire cluster’s disk bandwidth to remote clients. The
sort performance results in this paper demonstrate the
power of the architecture: in both Daytona and Indy
sorts, the system reads the data remotely to the sort ma-
chines, sorts the data across the network, and writes it
remotely back to storage.
Performant remote file access imparts a flexibility ab-
sent in contemporary distributed storage systems. Be-
yond sort, FDS supports a broad variety of scalable large-
data applications. It does so without demanding that
cluster nodes balance compute and disk performance;
more importantly, it does so without demanding that ap-
plications observe locality constraints.
tomorrow who's gonna fuss
Could someone knowledgeable comment on their "tract locator table" (or TLT) metadata system and it's possible relation to P2P protocols? If Bittorrent didn't focus on peer-speed as measured by reads and writes, couldn't it gain an advantage using this? TLT is expected to have consistent membership, but if it was updated once a minute (say), wouldn't that be enough to get the advantages without it taking to long to join a group?
tomorrow who's gonna fuss
I think it's unfair to say that they are the only company funding this sort of research. Plenty of research is done by other companies such as Intel, IBM and Google. Granted, since (as usual) it seems the real issue being debated here is whether Microsoft is evil or not. I'd have to say that the answer is a resounding No. I applaud this accomplishment. I still despise their products and general philosophy, but credit should be given where credit is due, and this deserves credit. I think this development sounds really cool and I hope that their research department continues to delve into interesting issues like this. Time alone will tell what will come from this. Whether I like the company or not.
~theCzar
It's rare that I seem to hear much about Microsoft does in the basic research areas.
They used 10 GigE with a very advanced set of switches that support OpenFlow so that they could get the full bisectional bandwidth. They could have use InfiniBand and probably done much better with FDR adapters capable of 56 gigabit per second. Even "old" IB adapters were faster. Most of the IB switches supported full bisectional bandwidth right out of the box. MS should look at the High Performance Computing world. They need to do handle large amounts of data with low latency.
-- soldack
MSFT Research has been a leader there for a decade. the technical programs was just announced Tuesday.