Getting Students To Think At Internet Scale

← Back to Stories (view on slashdot.org)

Getting Students To Think At Internet Scale

Posted by kdawson on Monday October 12, 2009 @09:08PM from the peta-here-a-peta-there dept.

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."

13 of 98 comments (clear)

Min score:

Reason:

Sort:

Data management problem by razvan784 · 2009-10-12 21:19 · Score: 5, Insightful

Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.
1. Re:Data management problem by Trepidity · 2009-10-12 21:55 · Score: 2, Insightful
  
  I agree, and don't think it's anywhere near the science/CS-education bottleneck either. It's true that it can be useful to work with some non-trivial data even in relatively early education: sifting through a few thousand records for patterns, testing hypotheses on them, etc., can lead to a way of thinking about problems that is hard to get if you're working only toy examples of 5 data points or something. But I think there's very little of core science education that needs to be done at "internet-scale". If we had a generation of students who solidly grasped the foundations of the scientific method, of computing, of statistics, of data-processing, etc., but their only flaw was that they were used to processing data on the orders of a few megabytes, and needed to learn how to scale up bigger--- well that'd be a good problem for us to have.
  Apart from very specific knowledge, like actually studying scaling properties of algorithms to very-large data sets, I don't see much core science education even benefiting from huge data sets. If your focus in a class isn't on scalability of algorithms, but on something else, is there any reason to make students deal with an unwieldy 30 TB of data? Even "real" scientists often do their exploratory work on a subset of the full data set.
  
  --
  10 PRINT CHR$(205.5+RND(1)); : GOTO 10
2. Re:Data management problem by Interoperable · 2009-10-12 23:37 · Score: 4, Insightful
  
  Yeah no kidding. I don't know if maybe that quote ('Science these days has basically turned into a data-management problem') was taken out of context, but I'm surprised a professor would say something that ignorant. I recently did a Master's in physics and it certainly didn't involve huge quantities of data; I ended up transferring much of my data off a spectrum analyzer with a floppy drive. (When we lost the GPIB transfer script I thought it would take too long to learn the HP libraries to rewrite it. That was a mistake, after 4 hours of shoving floppies in the drive I sat down and wrote a script in 2 hours, ah well.)
  But the point is, a 400 data point trace may be exactly what you need to get the information your looking for. Just because we can collect and process huge quantities of data doesn't mean that all science requires you to do so, nor is simply handling the data the critical part of analyzing it.
  
  --
  So if this is the future...where's my jet pack?
3. Re:Data management problem by FlyingBishop · 2009-10-12 23:49 · Score: 3, Insightful
  
  It's also useless to say 'hey I'm analyzing this graph' if you're analyzing it wrong. I think you're missing the big picture. It's incredibly naive to think that the fundamental laws are simple enough to be grasped without massive datasets. It is possible, but all the data gathered thus far suggests that the fundamental laws of nature will not be found by someone staring at an equation on a whiteboard until it clicks. That is why Cern's data capacity is measured in terabytes, and they want to grow it as much as possible. That's why we have so much genetic data.
  Scientific method and principles count, but they are not enough.
4. Re:Data management problem by WarpedMind · 2009-10-13 01:51 · Score: 2, Insightful
  
  I'm afraid you are limited by a short time horizon. I remember working and computing on systems where 100MB was just as difficult and expensive to deal with as 100PB is today. 2MB was the amount of mountable storage on small systems. Anything larger and you had to go to "big iron".
  Real work was done on those small systems and good scientific principles and methods were they key then and the key now.
  Just remember that the "laptop" 10 years from now will have over 8TB local SSD.
  I operate an archive for the university. 10 years ago when we started it, a 10MB file was considered a pretty big file. Today it is the smallest size file we like to see stored in the archive. We store several PB's and I consider ours a small archive. A 100PB in a few years will be nothing. But those 100Exabyte files... now those will be difficult to work with. It will be "difficult to find hardware capable of storing that much data."
5. Re:Data management problem by Anonymous Coward · 2009-10-13 01:52 · Score: 2, Insightful
  
  I don't think you're going against the spirit of normalized tables. You've added a persistent cache which happens to be implemented in the database, that's all. Most high-end databases support what you're doing via materialized views (or materialized query tables, or summary tables, or whatever; the name varies). The RDBMS basically just writes the triggers for you, but provides the added benefit of using the MQTs for optimization somewhat like an index. Properly done, you can write your queries against the (normalized) base tables, and the query planner will use the MQT instead if it can.
  Really, the reason to push normalized tables is the whole "Code first; optimize later, if at all" thing. Put all your source data in the database because you never know exactly how much of it you need or can benefit from using. Normalize the tables because you never know exactly how you will be using them. Only when your code is quite stable will you know what queries are too slow or complex, and then you can optimize them by creating summary tables. Optimizing too soon will result in a lot of wasted effort and make your job harder down the road.
6. Re:Data management problem by Anonymous Coward · 2009-10-13 03:20 · Score: 1, Insightful
  
  Google has invented things like Map/Reduce
  Yikes. Google has been good about applying existing parallel and distributed computing concepts into their engineering, but they didn't invent the CS fundamentals. Map-reduce constructs are a basic idiom of most functional programs and parallel programs (whether functional or not) in scientific computing. What Google may have invented was a way to finally teach such basics to the hipsters who otherwise think the CS literature starts with their own first programming task.
  Similarly, their Python guru Guido did not invent a bunch of programming language concepts so much as cherry pick and apply some into his own bastard language. In this regard, he has more in common with Larry Wall creating Perl than with the real programming language theorists who made all the breakthroughs since the early days of the Lambda calculus.
7. Re:Data management problem by Hognoxious · 2009-10-13 04:52 · Score: 2, Insightful
  
  Surely a chemist should know about chemistry, a biologist about biology and so on.
  If either needs to do computation beyond his own capabilities, he needs to get a CS person to help him. That's what specialists do, they specialise.
  
  --
  Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Students don't need to think at internet scale by Rosco+P.+Coltrane · 2009-10-12 21:48 · Score: 2, Insightful

They just need to think. That's what they study for (ideally). Thinking people with open minds can tackle anything, including the "scale of the internet".
When I was in high school, I used a slide rule. When I entered university, I got me a calculator. Did maths or problem solving abilities change or improve because of the calculator? no. Student today can jolly well learn about networking on small LANs, or learn to manage small datasets on aging university computers, so long as what they learn is good, they'll be able to transpose their knowledge on a vaster scale, or invent the next Big Thing. I don't see the problem.

--
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
1. Re:Students don't need to think at internet scale by adamchou · 2009-10-12 23:17 · Score: 2, Insightful
  
  A LOT of research has been put into improving algorithms for working on large scales. By not teaching our youth all that we have learned in school, they are just going to have to figure it out themselves an continue to reinvent the wheel. How are we supposed to advance if we don't put them in a situation to learn and apply our new found knowledge?
Wrong by Hognoxious · 2009-10-12 21:54 · Score: 2, Insightful

Summary uses data and information as if they are synonyms. They are not.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Huge Misstatement by Jane+Q.+Public · 2009-10-12 22:11 · Score: 3, Insightful

"Science these days has basically turned into a data-management problem," says Jimmy Lin.

This is about the grossest misstatement of the issue that I could imagine. Science is not a data-management problem at all. But it does, and will, most certainly, depend on data management. They are two very different things, no matter how closely they must work together.
The Petabyte Problem by ghostlibrary · 2009-10-12 22:12 · Score: 4, Insightful

I wrote up some notes from a NASA lunch meeting on this, titled (not too originally, I admit) 'The Petabyte Problem'. It's at
http://www.scientificblogging.com/daytime_astronomer/petabyte_problem. It's not just a question of thinking on the 'Internet scale', but about massive data handling in general.
What makes it different from previous eras (where MB was big, where GB was big) is that, before, the storage was expensive, yes, but bandwidth wasn't as much of a trouble for transmitting, if even locally. You could store MBs or GBs on tape, ship it, and extract the data rapidly-- bus and LAN speeds were high. Now, with PB, there's so much data that even if you ship a rack of TB drives and hook it up locally, you can't run a program on it in reasonable time. Particularly for browsing or inquiries.
So we're having to rely much more on metadata or abstractions to sort out which data we can then process further.

--
A.