Slashdot Mirror


The Big Promise of 'Big Data'

snydeq writes "InfoWorld's Frank Ohlhorst discusses how virtualization, commodity hardware, and 'Big Data' tools like Hadoop are enabling IT organizations to mine vast volumes of corporate and external data — a trend fueled increasingly by companies' desire to finally unlock critical insights from thus far largely untapped data stores. 'As costs fall and companies think of new ways to correlate data, Big Data analytics will become more commonplace, perhaps providing the growth mechanism for a small company to become a large one. Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly. It's no accident that many of the underpinnings of Big Data came from the methods these very businesses developed. But today, these methods are widely available through Hadoop and other tools for enterprises such as yours.'"

13 of 78 comments (clear)

  1. Re:Big Data? by abigor · · Score: 2, Funny

    It's actually his hip-hop name.

  2. LiveSQL by ka9dgx · · Score: 3, Interesting

    I think that the real innovation will be a variation of SQL that allows for the persistence of queries, such that they continue to yield new results as new data is found to match them in the database. If you have a database of a trillion web pages, and you continue to put more in, it doesn't make sense to re-scan all of the existing records each time you decide you need to get the results of the query again. It should be possible, and far more computationally efficient to have a stream of results from a LiveSQL query that can feed a stream, instead of batch mode.

    I've registered the domain name livesql.org as a first step to helping to organize this idea and perhaps set up a standard.

    1. Re:LiveSQL by starsky51 · · Score: 2, Insightful

      Couldn't this be done using regular sql and an indexed timestamp column?

      --
      There are 2 types of people in this world. Those who understand ternary and those who don't.
    2. Re:LiveSQL by Anonymous Coward · · Score: 2, Funny

      You are posting AC, but I bet your IP address resolves to Japan

    3. Re:LiveSQL by oldspewey · · Score: 2, Funny

      Too late, I already registered regularsqlwithanindexedtimestampcolumn.com

      --
      If libertarians are so opposed to effective government, why don't they all move to Somalia?
    4. Re:LiveSQL by Laxitive · · Score: 4, Informative

      There are some serious technical challenges to overcome when you think about actually implementing something like this.

      Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

      This issue is also present in queries using ordered results (as changes to a single value participating in the ordering would affect the global ordering of results for that query).

      The issue that "Big Data" presents is really the need to run -global- data analysis on extremely large datasets, utilizing data parallelism to extract performance from a cluster of machines.

      What you're suggesting (basically a functional reactive framework for querying volatile persistent data), would still involve a number of limitations over the SQL model: basically disallowing the usage of any truly global algorithm across large datasets. Tools like Hadoop get around these limitations by taking the focus away from the data model (which is what SQL excels in dealing with), and putting it on providing an expressive framework for describing distributable computations (which SQL is not so great at dealing with).

      -Laxitive

    5. Re:LiveSQL by Rakishi · · Score: 2, Informative

      Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

      Stddev is trivial to recompute on the fly and I'd be surprised if any decent sql engine didn't compute it one row at a time. Store mean(column) and mean(column^2). SD = sqrt(mean(c^2)-mean(c)^2) not considering the unbiasing stuff. Add new row value deltas to both, do some simple math and you're done.

      Now medians and quantiles are a bitch.

      Frankly complex data mining of large data is a pain in the ass on hadoop as much as anywhere else. You can't do anything too global with hadoop because then you'd need to send all your data to one box anyway. You need specialized complex algorithms since you can only keep a fraction of your data in memory at a time. Simple regression? Have fun.

      That said if you're already using hadoop it's quite possible you're using some sort of online learning algorithm anyway for just that reason so converting it to real time updating would be easy.

    6. Re:LiveSQL by BitZtream · · Score: 4, Informative

      I think that the real innovation will be a variation of SQL that allows for the persistence of queries

      Thats been done for years, materialized views, using triggers on INSERT/UPDATE/DELETE to update the views on the fly.

      Streaming results as needed is done with cursors.

      I know you think you're probably talking about something that 'materialized views and cursors don't do'. Fortunately, you're wrong and just don't understand how to use them.

      It really bothers me how people who talk about problems with SQL really have no fucking clue what they are talking about or how to work with the data in the first place.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  3. Re:What's the promise? by Sarten-X · · Score: 4, Insightful

    It isn't about Facebook so much as it's a shift in what problems are practically solvable.

    First, realize that traditional approaches like SQL are limited mostly by the single box (or the few mirrors) the platform runs on. Querying a large (a billion rows) table can take minutes on a very fast machine, hours if there's significant disk access needed, and months if the query's complex enough. Clusters can process those same billion records far faster, bringing that time down from months to hours, or even seconds for a simple scan. Advances in cluster computing over the last few years have made this parallel processing much easier.

    The promise is that problems that were previously too big to even think about are now easy. If your solved problem is something people want, like showing what their friends are up to, your product will do well.

    --
    You do not have a moral or legal right to do absolutely anything you want.
  4. End of Science by mysterons · · Score: 2, Informative

    Related to using Big Data in Business is Big Data in Science. Wired ran a nice series of articles looking at this (http://www.wired.com/wired/issue/16-07). This raises all sorts of problems (for example, how can results be reproduced? What if the model of the data is as complex as the data? Are all results obtained with Small Data simply artefacts of sparse counts?).

  5. Why it works for Google/Yahoo/Facebook by BitZtream · · Score: 5, Interesting

    Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly.

    Because their business is based entirely on how that data correlates.

    99.999999999% of the rest of the world do other things as their primary business model. Small businesses aren't going to do this because it requires a staff that KNOWS how to work with this software and get the data out.

    Walmart might care, but they aren't a small business.

    The local auto mechanic, or plumber, or even small companies like lawn services or maid services simply aren't big enough to justify having a staff of nerds to make the data useful to them, and they really don't have enough data to matter. It simply is too expensive on the small scale.

    Companies that can REALLY benefit from the ability to comb vast quantities of data have been doing it for well over a hundred years. Insurance companies are a prime example. You know what? They aren't small in general, so they have the staff to do the data correlation and find out useful information because it works on that scale.

    Anyone who cares about churning through massive amounts of data already has ways to do it. Computing will make it faster, but its not going to change the business model.

    I'm kind puzzled why virtualization has anything to do with this, unless someone is implying that a smart thing to do is setup a VM server, and then run a bunch of VMs on it to get a 'cluster' to run distributed apps on ... if thats the point being made then I think someone needs to turn in their life card (they clearly never had a geek card).

    So now that I've written all that, I went and read the article.

    Now I realize that article is written by someone who has absolutely no idea what they are talking about and simply read a wikipedia page or two and threw in a bunch of names and buzzwords.

    Hadoop doesn't help the IT department do anything with the data at all.

    Its the teams of analyists and developers that write the code to make Hadoop (which is only mentioned because of its OSS nature here) and a bunch of other technologies and code all work together and produce useful output.

    This article is basically written like the invention of the hammer made it so everyone would want to build their own homes because they could. Thats a stupid assumption and statement.

    Slashdot should be fucking ashamed that this is posted anywhere, let alone front page.

    --
    Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  6. Re:Big Data Need by Primitive+Pete · · Score: 2, Informative
    Mainframes and large multiprocessor machines have been handling multi-billion row data sets on RDBMS systems for a very long time. Data warehouses are commonly into the billions of rows. What commodity clusters provide is not efficiency--they often make poorer use of available cycles and repeat work to achieve goals.

    What large commodity clusters provide is a price per cycle low enough that the owner doesn't have to worry about efficiency. For example, Google's Dean and Ghemawat ("MapReduce: Simplified Data Processing on Large Clusters") managed to successfully sort 10^10 100-byte records over 891 seconds, or about 6MB sorted per processor per second. Very fast overall, but hardly efficient use of modern hardware. There's an important place for the new big dataset system, but the argument is cost, not efficiency.

  7. Re:Big Data Need by Sarten-X · · Score: 2, Insightful

    Assuming the maximum configuration is thousands of cores, how does it compare in other aspects to Facebook's 23,000 cores and 36 petabytes of data, with unlimited scalability to come?

    For all intents and purposes, mainframes are still mainframes. They're parallel, and they grow, but they still have those limits that clusters just don't have.

    (I consider price to be a limit as well)

    --
    You do not have a moral or legal right to do absolutely anything you want.