Slashdot Mirror


Getting Students To Think At Internet Scale

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."

9 of 98 comments (clear)

  1. Data management problem by razvan784 · · Score: 5, Insightful

    Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.

    1. Re:Data management problem by adamchou · · Score: 4, Informative

      thats absolutely not true. the process is vastly different when it comes to working with 100 MB or 10 petabytes. lets take databases for instance. if you have 100MB of data, you can just store the entire database on one server. when it comes to 100 PB of data, its even difficult to find the hardware capable of storing that much data. you need to start looking at distributed systems and distributed systems is such a broad field in itself.

      when i graduated in 2005, a lot of the techniques i was taught worked great for working with database systems that handled a few hundred thousand rows. then i got a job at an internet company that had tables with over 80 million rows. all that normalization stuff i learned in school had to be thrown out. times may have changed now, but when i was in school, not only did i not learn how to handle "internet scale" data sets, i was taught the wrong methods to handle large data sets.

      undergrad college students should at least get a basic intro to large data sets, if not have a class completely dedicated to learning on how to work with those data sets. school is supposed to prepare you for the work force. at least give the students the option to take a class that covers those topics if they want to go into those industries. i sure wish i had that option

    2. Re:Data management problem by Interoperable · · Score: 4, Insightful

      Yeah no kidding. I don't know if maybe that quote ('Science these days has basically turned into a data-management problem') was taken out of context, but I'm surprised a professor would say something that ignorant. I recently did a Master's in physics and it certainly didn't involve huge quantities of data; I ended up transferring much of my data off a spectrum analyzer with a floppy drive. (When we lost the GPIB transfer script I thought it would take too long to learn the HP libraries to rewrite it. That was a mistake, after 4 hours of shoving floppies in the drive I sat down and wrote a script in 2 hours, ah well.)

      But the point is, a 400 data point trace may be exactly what you need to get the information your looking for. Just because we can collect and process huge quantities of data doesn't mean that all science requires you to do so, nor is simply handling the data the critical part of analyzing it.

      --
      So if this is the future...where's my jet pack?
    3. Re:Data management problem by FlyingBishop · · Score: 3, Insightful

      It's also useless to say 'hey I'm analyzing this graph' if you're analyzing it wrong. I think you're missing the big picture. It's incredibly naive to think that the fundamental laws are simple enough to be grasped without massive datasets. It is possible, but all the data gathered thus far suggests that the fundamental laws of nature will not be found by someone staring at an equation on a whiteboard until it clicks. That is why Cern's data capacity is measured in terabytes, and they want to grow it as much as possible. That's why we have so much genetic data.

      Scientific method and principles count, but they are not enough.

    4. Re:Data management problem by Hal_Porter · · Score: 3, Interesting

      That's not true. The way you solve the problem changes radically depending on the amount of data you have. Consider

      100 KB - You could use the dumbest algorithm imaginable and the slowest processor and everything is fine.

      100 MB - most embedded systems can happily manage it. A desktop system can easily, even in a rather inefficient language. Algorithms are important.

      100 GB - Big ass server - you'd definitely want to make sure you were using an efficient language and had an algorithm that scaled well, certainly to 2 processors and most likely to 4 processors. Probably should be 64 bit for efficiency.

      100 PB+ You'd want a Google like system with lots of nodes. Actually I think at this point the code would look nothing like the 10 MB case. I remember someone saying that Google is "just a hash table". Now I think that misses the point. Google has invented things like Map/Reduce and has custom file systems. They've also spent a lot of time trying to cut costs by studying the effects of temperature on failure rates.

      Now I think these guys are spouting buzzwords. But if you want to process 100PB of data on

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
  2. The LSST? by aallan · · Score: 4, Informative

    Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night.

    Err no it doesn't, and no they aren't. The telescope hasn't been built yet? First light isn't scheduled until late in 2015.

    Al.

    --
    The Daily ACK - Eclectic posts by yet another hacker
  3. Indeed by saisuman · · Score: 5, Interesting

    I worked for one of the detectors at CERN, and I strongly agree with the notion of Science being a data management problem. We (intend to :-) pull a colossal amount of data from the detectors (about 40 TB/sec in case of the experiment I was working for). Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales. In short, there is a lot that happens with the data before it is even seen by a physicist. Having said that, I agree that very few people have a real appreciation and/or understanding of these kinds of systems and even fewer have the required depth of knowledge to build them. But this tends to be a highly specialized area, and I can't imagine it's easy to study it as a generic subject.

  4. Huge Misstatement by Jane+Q.+Public · · Score: 3, Insightful

    "Science these days has basically turned into a data-management problem," says Jimmy Lin.

    This is about the grossest misstatement of the issue that I could imagine. Science is not a data-management problem at all. But it does, and will, most certainly, depend on data management. They are two very different things, no matter how closely they must work together.

  5. The Petabyte Problem by ghostlibrary · · Score: 4, Insightful

    I wrote up some notes from a NASA lunch meeting on this, titled (not too originally, I admit) 'The Petabyte Problem'. It's at
    http://www.scientificblogging.com/daytime_astronomer/petabyte_problem. It's not just a question of thinking on the 'Internet scale', but about massive data handling in general.

    What makes it different from previous eras (where MB was big, where GB was big) is that, before, the storage was expensive, yes, but bandwidth wasn't as much of a trouble for transmitting, if even locally. You could store MBs or GBs on tape, ship it, and extract the data rapidly-- bus and LAN speeds were high. Now, with PB, there's so much data that even if you ship a rack of TB drives and hook it up locally, you can't run a program on it in reasonable time. Particularly for browsing or inquiries.

    So we're having to rely much more on metadata or abstractions to sort out which data we can then process further.

    --
    A.