Slashdot Mirror


Getting Students To Think At Internet Scale

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."

25 of 98 comments (clear)

  1. Data management problem by razvan784 · · Score: 5, Insightful

    Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.

    1. Re:Data management problem by Trepidity · · Score: 2, Insightful

      I agree, and don't think it's anywhere near the science/CS-education bottleneck either. It's true that it can be useful to work with some non-trivial data even in relatively early education: sifting through a few thousand records for patterns, testing hypotheses on them, etc., can lead to a way of thinking about problems that is hard to get if you're working only toy examples of 5 data points or something. But I think there's very little of core science education that needs to be done at "internet-scale". If we had a generation of students who solidly grasped the foundations of the scientific method, of computing, of statistics, of data-processing, etc., but their only flaw was that they were used to processing data on the orders of a few megabytes, and needed to learn how to scale up bigger--- well that'd be a good problem for us to have.

      Apart from very specific knowledge, like actually studying scaling properties of algorithms to very-large data sets, I don't see much core science education even benefiting from huge data sets. If your focus in a class isn't on scalability of algorithms, but on something else, is there any reason to make students deal with an unwieldy 30 TB of data? Even "real" scientists often do their exploratory work on a subset of the full data set.

    2. Re:Data management problem by adamchou · · Score: 4, Informative

      thats absolutely not true. the process is vastly different when it comes to working with 100 MB or 10 petabytes. lets take databases for instance. if you have 100MB of data, you can just store the entire database on one server. when it comes to 100 PB of data, its even difficult to find the hardware capable of storing that much data. you need to start looking at distributed systems and distributed systems is such a broad field in itself.

      when i graduated in 2005, a lot of the techniques i was taught worked great for working with database systems that handled a few hundred thousand rows. then i got a job at an internet company that had tables with over 80 million rows. all that normalization stuff i learned in school had to be thrown out. times may have changed now, but when i was in school, not only did i not learn how to handle "internet scale" data sets, i was taught the wrong methods to handle large data sets.

      undergrad college students should at least get a basic intro to large data sets, if not have a class completely dedicated to learning on how to work with those data sets. school is supposed to prepare you for the work force. at least give the students the option to take a class that covers those topics if they want to go into those industries. i sure wish i had that option

    3. Re:Data management problem by Interoperable · · Score: 4, Insightful

      Yeah no kidding. I don't know if maybe that quote ('Science these days has basically turned into a data-management problem') was taken out of context, but I'm surprised a professor would say something that ignorant. I recently did a Master's in physics and it certainly didn't involve huge quantities of data; I ended up transferring much of my data off a spectrum analyzer with a floppy drive. (When we lost the GPIB transfer script I thought it would take too long to learn the HP libraries to rewrite it. That was a mistake, after 4 hours of shoving floppies in the drive I sat down and wrote a script in 2 hours, ah well.)

      But the point is, a 400 data point trace may be exactly what you need to get the information your looking for. Just because we can collect and process huge quantities of data doesn't mean that all science requires you to do so, nor is simply handling the data the critical part of analyzing it.

      --
      So if this is the future...where's my jet pack?
    4. Re:Data management problem by FlyingBishop · · Score: 3, Insightful

      It's also useless to say 'hey I'm analyzing this graph' if you're analyzing it wrong. I think you're missing the big picture. It's incredibly naive to think that the fundamental laws are simple enough to be grasped without massive datasets. It is possible, but all the data gathered thus far suggests that the fundamental laws of nature will not be found by someone staring at an equation on a whiteboard until it clicks. That is why Cern's data capacity is measured in terabytes, and they want to grow it as much as possible. That's why we have so much genetic data.

      Scientific method and principles count, but they are not enough.

    5. Re:Data management problem by Enter+the+Shoggoth · · Score: 2, Interesting

      I agree, and don't think it's anywhere near the science/CS-education bottleneck either. It's true that it can be useful to work with some non-trivial data even in relatively early education: sifting through a few thousand records for patterns, testing hypotheses on them, etc., can lead to a way of thinking about problems that is hard to get if you're working only toy examples of 5 data points or something. But I think there's very little of core science education that needs to be done at "internet-scale". If we had a generation of students who solidly grasped the foundations of the scientific method, of computing, of statistics, of data-processing, etc., but their only flaw was that they were used to processing data on the orders of a few megabytes, and needed to learn how to scale up bigger--- well that'd be a good problem for us to have.

      Apart from very specific knowledge, like actually studying scaling properties of algorithms to very-large data sets, I don't see much core science education even benefiting from huge data sets. If your focus in a class isn't on scalability of algorithms, but on something else, is there any reason to make students deal with an unwieldy 30 TB of data? Even "real" scientists often do their exploratory work on a subset of the full data set.

      I disagree with your agreement :-)

      I suspect that what the article is getting at is that when you deal with very large sets of data you have to think about different algorithmic approaches rather than the cookie-cutter style of "problem solving" that most software engineering courses focus on.

      These kinds of problems require a very good understanding of not just the engineering side of things but also a comprehensive idea of statistical, numerical and analytical methods as well as an encyclopaedic knowledge of computability, complexity and information theory.

      Just think about how different the Lucene library or MapReduce
        are from the way most developers would have approached the problems that these tools address.

      --
      Andy Warhol got it right / Everybody gets the limelight
      Andy Warhol got it wrong / Fifteen minutes is too long.
    6. Re:Data management problem by Hal_Porter · · Score: 3, Interesting

      That's not true. The way you solve the problem changes radically depending on the amount of data you have. Consider

      100 KB - You could use the dumbest algorithm imaginable and the slowest processor and everything is fine.

      100 MB - most embedded systems can happily manage it. A desktop system can easily, even in a rather inefficient language. Algorithms are important.

      100 GB - Big ass server - you'd definitely want to make sure you were using an efficient language and had an algorithm that scaled well, certainly to 2 processors and most likely to 4 processors. Probably should be 64 bit for efficiency.

      100 PB+ You'd want a Google like system with lots of nodes. Actually I think at this point the code would look nothing like the 10 MB case. I remember someone saying that Google is "just a hash table". Now I think that misses the point. Google has invented things like Map/Reduce and has custom file systems. They've also spent a lot of time trying to cut costs by studying the effects of temperature on failure rates.

      Now I think these guys are spouting buzzwords. But if you want to process 100PB of data on

      --
      echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
    7. Re:Data management problem by autocracy · · Score: 2, Informative

      One example: I deal with healthcare claims. We keep everything normalized on insertion, but we also create some redundant, denormalized tables (data warehousing). Almost every query needs the same basic claim information, but I'm doing it in a query with one or two joins instead of 10.

      If something goes south with my manipulated tables, or I need a strange field, I still have my source data in a pure form. For a standard query, though, I can operate an order of magnitude faster by adding redundant tables that only have to be written once on insert by a trigger.

      --
      SIG: HUP
    8. Re:Data management problem by WarpedMind · · Score: 2, Insightful

      I'm afraid you are limited by a short time horizon. I remember working and computing on systems where 100MB was just as difficult and expensive to deal with as 100PB is today. 2MB was the amount of mountable storage on small systems. Anything larger and you had to go to "big iron".

      Real work was done on those small systems and good scientific principles and methods were they key then and the key now.

      Just remember that the "laptop" 10 years from now will have over 8TB local SSD.

      I operate an archive for the university. 10 years ago when we started it, a 10MB file was considered a pretty big file. Today it is the smallest size file we like to see stored in the archive. We store several PB's and I consider ours a small archive. A 100PB in a few years will be nothing. But those 100Exabyte files... now those will be difficult to work with. It will be "difficult to find hardware capable of storing that much data."

       

    9. Re:Data management problem by Anonymous Coward · · Score: 2, Insightful

      I don't think you're going against the spirit of normalized tables. You've added a persistent cache which happens to be implemented in the database, that's all. Most high-end databases support what you're doing via materialized views (or materialized query tables, or summary tables, or whatever; the name varies). The RDBMS basically just writes the triggers for you, but provides the added benefit of using the MQTs for optimization somewhat like an index. Properly done, you can write your queries against the (normalized) base tables, and the query planner will use the MQT instead if it can.

      Really, the reason to push normalized tables is the whole "Code first; optimize later, if at all" thing. Put all your source data in the database because you never know exactly how much of it you need or can benefit from using. Normalize the tables because you never know exactly how you will be using them. Only when your code is quite stable will you know what queries are too slow or complex, and then you can optimize them by creating summary tables. Optimizing too soon will result in a lot of wasted effort and make your job harder down the road.

    10. Re:Data management problem by Hognoxious · · Score: 2, Insightful

      Surely a chemist should know about chemistry, a biologist about biology and so on.

      If either needs to do computation beyond his own capabilities, he needs to get a CS person to help him. That's what specialists do, they specialise.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  2. The LSST? by aallan · · Score: 4, Informative

    Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night.

    Err no it doesn't, and no they aren't. The telescope hasn't been built yet? First light isn't scheduled until late in 2015.

    Al.

    --
    The Daily ACK - Eclectic posts by yet another hacker
    1. Re:The LSST? by Thanshin · · Score: 2, Funny

      You clearly aren't prepared to think in a future frame of reference.

      That's the consequence of studying with equipment that existed at the moment you were working with it.

      Future generations won't have that problem, as they're already studying with equipment that will be paid for and released to the university several years after their graduation.

    2. Re:The LSST? by Shag · · Score: 2, Interesting

      What aallan said - although, 2015? I thought the big projects (LSST, EELT, TMT) were all setting a 2018 target now.

      I went to a talk a month and a half ago by LSST's lead camera scientist (Steve Kahn) and LSST is at this point very much vaporware (as in, they've got some of the money, and some of the parts, but are nowhere near having all the money or having it all built.) Even Pan-STARRS, which is only supposed to crank out 10TB a night, only has 1 of 4 planned scopes built (they're building a second), and has been having optical quality problems with that one. By the time kids born at the turn of the century are leaving high school, though, yes, we do expect things like these to be up and running.

      But at the risk of sounding like that one college that publishes a list every year of what the freshman class of that year does and doesn't know, kids born around the turn of the century (my daughter is one) don't have the "OMG a TB!" mentality that we grownups have. The smallest capacity hard-drive my daughter will probably remember was 5 gigs - and that was in an iPod. Things like 64-bit, gigahertz speeds, multiprocessing, fast ethernet, wifi, home broadband... always been there. DVD-R media has, to her knowledge, always been there. (I did once have to explain to her that CDs used to be the size of platters and made of black plastic, after she found some Queensrÿche vinyl.)

      She's ten now, and you can put a half-terabyte or more in a laptop, so while the idea of some big scientific project spitting out 50 or 60 laptops worth of data in a night is clearly a lot of data, it's not something that can't be envisioned.

      --
      Village idiot in some extremely smart villages.
  3. A fantastic idea by Anonymous Coward · · Score: 2, Interesting

    This is a great idea
        Even in business we often hit problems with systems that are designed by people that just dont think about real world data volumes. I work in the ERP vendor SPACE (SAP, ORACLE, PEOPLESOFT and so on) and their inhouse systems arent designed to simulate real world data and so their performance is shocking when you load real throughput into them. AND so many times have I seen graduates think Microsoft systems can take enterprise volumes of data - and are shocked when the build something that collapses under a few terabytes or so ! Im used to having to post millions of transactions a day and there isnt an MS system in the world that deals with that. No offence to MS - we use excel for reporting and drilldowns and access a lot but understanding the limitations of the tools what it can really handle and scale to is essential. As well as understanding what large data volumes actually are these days !

    I know of a large bank that put in an ERP system using INTEL and MS SQL SERVER (with LOTS of press). We were a bit shocked actually because that bank was larger than we were and we had mainframes struggling to cope with our transaction load.
    In fact I was hauled over the coals for the cost of our hardware - so i investigate. The INTEL / MS solution failed so miserably they quietly shut it down and moved back to their mainframe - no press !. It wasnt able to cope with the merest fraction of the load and couldnt have. However the people involved had no conception of what large meant ( and they thought that a faster processor was all you needed - it never occurred to them you get something for all the extra money you pay for in a mainframe !)

    I think this is a terrific idea - but not only a the whole internet but they should teach this so the students understand these concepts for any large corporation they may work for !

  4. Students don't need to think at internet scale by Rosco+P.+Coltrane · · Score: 2, Insightful

    They just need to think. That's what they study for (ideally). Thinking people with open minds can tackle anything, including the "scale of the internet".

    When I was in high school, I used a slide rule. When I entered university, I got me a calculator. Did maths or problem solving abilities change or improve because of the calculator? no. Student today can jolly well learn about networking on small LANs, or learn to manage small datasets on aging university computers, so long as what they learn is good, they'll be able to transpose their knowledge on a vaster scale, or invent the next Big Thing. I don't see the problem.

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    1. Re:Students don't need to think at internet scale by adamchou · · Score: 2, Insightful

      A LOT of research has been put into improving algorithms for working on large scales. By not teaching our youth all that we have learned in school, they are just going to have to figure it out themselves an continue to reinvent the wheel. How are we supposed to advance if we don't put them in a situation to learn and apply our new found knowledge?

    2. Re:Students don't need to think at internet scale by Strange+Ranger · · Score: 2, Informative

      I don't see the problem.

      ^Maybe this illustrates the point?

      Really really big numbers can be hard for the human brain to get a grip on. But more to the point, operating at large scales presents problems unique to the scale. Think of baking cookies. Doing this in your kitchen is a familiar thing to most people. But the kitchen method doesn't translate well to an industrial scale. Keebler doesn't use a million gallon bowl and cranes with giant beaters on the end. They don't have ovens the size of a cruise ships. Just because you can make awesome cookies in your kitchen doesn't qualify you one bit to work for Keebler.
      Whether it's cookies or scientific inquiry it's a good idea to prepare students to process things on the appropriate scale.

      --

      Operator, give me the number for 911!
    3. Re:Students don't need to think at internet scale by vxvxvxvx · · Score: 2, Funny

      So when it comes to really really big numbers, we need to rely upon elfs in trees?

  5. Wrong by Hognoxious · · Score: 2, Insightful

    Summary uses data and information as if they are synonyms. They are not.

    --
    Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  6. Indeed by saisuman · · Score: 5, Interesting

    I worked for one of the detectors at CERN, and I strongly agree with the notion of Science being a data management problem. We (intend to :-) pull a colossal amount of data from the detectors (about 40 TB/sec in case of the experiment I was working for). Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales. In short, there is a lot that happens with the data before it is even seen by a physicist. Having said that, I agree that very few people have a real appreciation and/or understanding of these kinds of systems and even fewer have the required depth of knowledge to build them. But this tends to be a highly specialized area, and I can't imagine it's easy to study it as a generic subject.

  7. Huge Misstatement by Jane+Q.+Public · · Score: 3, Insightful

    "Science these days has basically turned into a data-management problem," says Jimmy Lin.

    This is about the grossest misstatement of the issue that I could imagine. Science is not a data-management problem at all. But it does, and will, most certainly, depend on data management. They are two very different things, no matter how closely they must work together.

  8. The Petabyte Problem by ghostlibrary · · Score: 4, Insightful

    I wrote up some notes from a NASA lunch meeting on this, titled (not too originally, I admit) 'The Petabyte Problem'. It's at
    http://www.scientificblogging.com/daytime_astronomer/petabyte_problem. It's not just a question of thinking on the 'Internet scale', but about massive data handling in general.

    What makes it different from previous eras (where MB was big, where GB was big) is that, before, the storage was expensive, yes, but bandwidth wasn't as much of a trouble for transmitting, if even locally. You could store MBs or GBs on tape, ship it, and extract the data rapidly-- bus and LAN speeds were high. Now, with PB, there's so much data that even if you ship a rack of TB drives and hook it up locally, you can't run a program on it in reasonable time. Particularly for browsing or inquiries.

    So we're having to rely much more on metadata or abstractions to sort out which data we can then process further.

    --
    A.
  9. Well... by DavidR1991 · · Score: 2

    If you swap the focus from smaller size problems to the mega-scale problems, then you get a bunch of students who can only do mega-scale problems (reverse of the trend the article talks about)

    Here's the rub: It's easier to scale up than it is to scale down. Most big problems are made up of lots of little problems. Little problems are rarely made up of mega-scale problems...

    I think what they need to do is to keep the focus on the small/'regular' stuff, but also show how their knowledge applies to the "big stuff" (so they can 'see' problems from both ends) - not just focus on one or the other

  10. Work at enterprise... by SharpFang · · Score: 2, Interesting

    It was a very surprising experience, moving from small services where you get 10 hits per minute maybe, to a corporation that receives several thousands hits per second.

    There was a layer of cache between each of 4 application layers (database, back-end, front-end and adserver), and whenever a generic cache wouldn't cut it, a custom one was applied. On my last project there, the dedicated caching system could reduce some 5000 hits per second to 1 database query per 5 seconds - way overengineered even for our needs but it was a pleasure watching the backend compressing several thousands requests into one, and the frontend split into pieces of "very strong cache, keep in browser cache for weeks", "strong caching, refresh once/15 min site-wide", "weak caching, refresh site-wide every 30s" and "no caching, per visitor data" with the first being some 15K of Javascript, the second about 5K of generic content data, the third about 100 bytes of immediate reports and the last some 10 bytes of user prefs and choices.

    --
    45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2