Slashdot Mirror


The Big Promise of 'Big Data'

snydeq writes "InfoWorld's Frank Ohlhorst discusses how virtualization, commodity hardware, and 'Big Data' tools like Hadoop are enabling IT organizations to mine vast volumes of corporate and external data — a trend fueled increasingly by companies' desire to finally unlock critical insights from thus far largely untapped data stores. 'As costs fall and companies think of new ways to correlate data, Big Data analytics will become more commonplace, perhaps providing the growth mechanism for a small company to become a large one. Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly. It's no accident that many of the underpinnings of Big Data came from the methods these very businesses developed. But today, these methods are widely available through Hadoop and other tools for enterprises such as yours.'"

5 of 78 comments (clear)

  1. End of Science by mysterons · · Score: 2, Informative

    Related to using Big Data in Business is Big Data in Science. Wired ran a nice series of articles looking at this (http://www.wired.com/wired/issue/16-07). This raises all sorts of problems (for example, how can results be reproduced? What if the model of the data is as complex as the data? Are all results obtained with Small Data simply artefacts of sparse counts?).

  2. Re:LiveSQL by Laxitive · · Score: 4, Informative

    There are some serious technical challenges to overcome when you think about actually implementing something like this.

    Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

    This issue is also present in queries using ordered results (as changes to a single value participating in the ordering would affect the global ordering of results for that query).

    The issue that "Big Data" presents is really the need to run -global- data analysis on extremely large datasets, utilizing data parallelism to extract performance from a cluster of machines.

    What you're suggesting (basically a functional reactive framework for querying volatile persistent data), would still involve a number of limitations over the SQL model: basically disallowing the usage of any truly global algorithm across large datasets. Tools like Hadoop get around these limitations by taking the focus away from the data model (which is what SQL excels in dealing with), and putting it on providing an expressive framework for describing distributable computations (which SQL is not so great at dealing with).

    -Laxitive

  3. Re:LiveSQL by Rakishi · · Score: 2, Informative

    Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

    Stddev is trivial to recompute on the fly and I'd be surprised if any decent sql engine didn't compute it one row at a time. Store mean(column) and mean(column^2). SD = sqrt(mean(c^2)-mean(c)^2) not considering the unbiasing stuff. Add new row value deltas to both, do some simple math and you're done.

    Now medians and quantiles are a bitch.

    Frankly complex data mining of large data is a pain in the ass on hadoop as much as anywhere else. You can't do anything too global with hadoop because then you'd need to send all your data to one box anyway. You need specialized complex algorithms since you can only keep a fraction of your data in memory at a time. Simple regression? Have fun.

    That said if you're already using hadoop it's quite possible you're using some sort of online learning algorithm anyway for just that reason so converting it to real time updating would be easy.

  4. Re:LiveSQL by BitZtream · · Score: 4, Informative

    I think that the real innovation will be a variation of SQL that allows for the persistence of queries

    Thats been done for years, materialized views, using triggers on INSERT/UPDATE/DELETE to update the views on the fly.

    Streaming results as needed is done with cursors.

    I know you think you're probably talking about something that 'materialized views and cursors don't do'. Fortunately, you're wrong and just don't understand how to use them.

    It really bothers me how people who talk about problems with SQL really have no fucking clue what they are talking about or how to work with the data in the first place.

    --
    Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  5. Re:Big Data Need by Primitive+Pete · · Score: 2, Informative
    Mainframes and large multiprocessor machines have been handling multi-billion row data sets on RDBMS systems for a very long time. Data warehouses are commonly into the billions of rows. What commodity clusters provide is not efficiency--they often make poorer use of available cycles and repeat work to achieve goals.

    What large commodity clusters provide is a price per cycle low enough that the owner doesn't have to worry about efficiency. For example, Google's Dean and Ghemawat ("MapReduce: Simplified Data Processing on Large Clusters") managed to successfully sort 10^10 100-byte records over 891 seconds, or about 6MB sorted per processor per second. Very fast overall, but hardly efficient use of modern hardware. There's an important place for the new big dataset system, but the argument is cost, not efficiency.