The Big Promise of 'Big Data'
snydeq writes "InfoWorld's Frank Ohlhorst discusses how virtualization, commodity hardware, and 'Big Data' tools like Hadoop are enabling IT organizations to mine vast volumes of corporate and external data — a trend fueled increasingly by companies' desire to finally unlock critical insights from thus far largely untapped data stores. 'As costs fall and companies think of new ways to correlate data, Big Data analytics will become more commonplace, perhaps providing the growth mechanism for a small company to become a large one. Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly. It's no accident that many of the underpinnings of Big Data came from the methods these very businesses developed. But today, these methods are widely available through Hadoop and other tools for enterprises such as yours.'"
There are some serious technical challenges to overcome when you think about actually implementing something like this.
Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.
This issue is also present in queries using ordered results (as changes to a single value participating in the ordering would affect the global ordering of results for that query).
The issue that "Big Data" presents is really the need to run -global- data analysis on extremely large datasets, utilizing data parallelism to extract performance from a cluster of machines.
What you're suggesting (basically a functional reactive framework for querying volatile persistent data), would still involve a number of limitations over the SQL model: basically disallowing the usage of any truly global algorithm across large datasets. Tools like Hadoop get around these limitations by taking the focus away from the data model (which is what SQL excels in dealing with), and putting it on providing an expressive framework for describing distributable computations (which SQL is not so great at dealing with).
-Laxitive
Thats been done for years, materialized views, using triggers on INSERT/UPDATE/DELETE to update the views on the fly.
Streaming results as needed is done with cursors.
I know you think you're probably talking about something that 'materialized views and cursors don't do'. Fortunately, you're wrong and just don't understand how to use them.
It really bothers me how people who talk about problems with SQL really have no fucking clue what they are talking about or how to work with the data in the first place.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager