Getting Students To Think At Internet Scale
Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."
Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.
I worked for one of the detectors at CERN, and I strongly agree with the notion of Science being a data management problem. We (intend to :-) pull a colossal amount of data from the detectors (about 40 TB/sec in case of the experiment I was working for). Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales. In short, there is a lot that happens with the data before it is even seen by a physicist.
Having said that, I agree that very few people have a real appreciation and/or understanding of these kinds of systems and even fewer have the required depth of knowledge to build them. But this tends to be a highly specialized area, and I can't imagine it's easy to study it as a generic subject.