Getting Students To Think At Internet Scale
Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."
Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.
They just need to think. That's what they study for (ideally). Thinking people with open minds can tackle anything, including the "scale of the internet".
When I was in high school, I used a slide rule. When I entered university, I got me a calculator. Did maths or problem solving abilities change or improve because of the calculator? no. Student today can jolly well learn about networking on small LANs, or learn to manage small datasets on aging university computers, so long as what they learn is good, they'll be able to transpose their knowledge on a vaster scale, or invent the next Big Thing. I don't see the problem.
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
Summary uses data and information as if they are synonyms. They are not.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
"Science these days has basically turned into a data-management problem," says Jimmy Lin.
This is about the grossest misstatement of the issue that I could imagine. Science is not a data-management problem at all. But it does, and will, most certainly, depend on data management. They are two very different things, no matter how closely they must work together.
I wrote up some notes from a NASA lunch meeting on this, titled (not too originally, I admit) 'The Petabyte Problem'. It's at
http://www.scientificblogging.com/daytime_astronomer/petabyte_problem. It's not just a question of thinking on the 'Internet scale', but about massive data handling in general.
What makes it different from previous eras (where MB was big, where GB was big) is that, before, the storage was expensive, yes, but bandwidth wasn't as much of a trouble for transmitting, if even locally. You could store MBs or GBs on tape, ship it, and extract the data rapidly-- bus and LAN speeds were high. Now, with PB, there's so much data that even if you ship a rack of TB drives and hook it up locally, you can't run a program on it in reasonable time. Particularly for browsing or inquiries.
So we're having to rely much more on metadata or abstractions to sort out which data we can then process further.
A.