The Big Promise of 'Big Data'
snydeq writes "InfoWorld's Frank Ohlhorst discusses how virtualization, commodity hardware, and 'Big Data' tools like Hadoop are enabling IT organizations to mine vast volumes of corporate and external data — a trend fueled increasingly by companies' desire to finally unlock critical insights from thus far largely untapped data stores. 'As costs fall and companies think of new ways to correlate data, Big Data analytics will become more commonplace, perhaps providing the growth mechanism for a small company to become a large one. Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly. It's no accident that many of the underpinnings of Big Data came from the methods these very businesses developed. But today, these methods are widely available through Hadoop and other tools for enterprises such as yours.'"
I don't think Brent Spiner is any fatter than when he played that android on Star Wars.
I think that the real innovation will be a variation of SQL that allows for the persistence of queries, such that they continue to yield new results as new data is found to match them in the database. If you have a database of a trillion web pages, and you continue to put more in, it doesn't make sense to re-scan all of the existing records each time you decide you need to get the results of the query again. It should be possible, and far more computationally efficient to have a stream of results from a LiveSQL query that can feed a stream, instead of batch mode.
I've registered the domain name livesql.org as a first step to helping to organize this idea and perhaps set up a standard.
Big machines, not toys.
Yours In Moscow,
Kilgore T.
It can't promise to make your site the next Facebook. That only happens when you have sufficient tech, UI design, and luck. Once the network effect plays into your favor you just snowball along. After that you can even have slightly inferior tech and UI design, and you will continue to win. The inconvenience must not outweigh the switching costs. You figure out a way to make your product sticky, you increase the switching costs...
If "Big Data" is a prerequisite for all of that, then use it. Just don't get the idea that it's a silver bullet.
We have a ~100 node hadoop cluster and it barely works. The software is utterly awful and every time we use it, the cluster falls apart in a few hours and has to be restarted. To read data, you have to copy it into HDFS, which means duplicating all your existing data. The design is awful, the implementation is awful, and it doesn't work.
Am I missing something? How is everyone else using it and making it work?
...and 'Big Data' tools like Hadoop are enabling IT organizations...
...these methods are widely available through Hadoop and other tools...
Oh... also... did I mention HADOOP!!??
Faith is a willingness to accept something w/o complete proof and to act on it. Reason allows you to correct that faith.
Looks like somebody got their PR spin piece relayed as a news story again. Bravo!
Related to using Big Data in Business is Big Data in Science. Wired ran a nice series of articles looking at this (http://www.wired.com/wired/issue/16-07). This raises all sorts of problems (for example, how can results be reproduced? What if the model of the data is as complex as the data? Are all results obtained with Small Data simply artefacts of sparse counts?).
This is nothing new. It's called Business Intelligence and is already used in most of the big corporations. Wiki link: http://en.wikipedia.org/wiki/Business_intelligence
Democracy: Crowdsourcing a country near you
Because their business is based entirely on how that data correlates.
99.999999999% of the rest of the world do other things as their primary business model. Small businesses aren't going to do this because it requires a staff that KNOWS how to work with this software and get the data out.
Walmart might care, but they aren't a small business.
The local auto mechanic, or plumber, or even small companies like lawn services or maid services simply aren't big enough to justify having a staff of nerds to make the data useful to them, and they really don't have enough data to matter. It simply is too expensive on the small scale.
Companies that can REALLY benefit from the ability to comb vast quantities of data have been doing it for well over a hundred years. Insurance companies are a prime example. You know what? They aren't small in general, so they have the staff to do the data correlation and find out useful information because it works on that scale.
Anyone who cares about churning through massive amounts of data already has ways to do it. Computing will make it faster, but its not going to change the business model.
I'm kind puzzled why virtualization has anything to do with this, unless someone is implying that a smart thing to do is setup a VM server, and then run a bunch of VMs on it to get a 'cluster' to run distributed apps on ... if thats the point being made then I think someone needs to turn in their life card (they clearly never had a geek card).
So now that I've written all that, I went and read the article.
Now I realize that article is written by someone who has absolutely no idea what they are talking about and simply read a wikipedia page or two and threw in a bunch of names and buzzwords.
Hadoop doesn't help the IT department do anything with the data at all.
Its the teams of analyists and developers that write the code to make Hadoop (which is only mentioned because of its OSS nature here) and a bunch of other technologies and code all work together and produce useful output.
This article is basically written like the invention of the hammer made it so everyone would want to build their own homes because they could. Thats a stupid assumption and statement.
Slashdot should be fucking ashamed that this is posted anywhere, let alone front page.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Rummaging in the Bitlocker
Starring everybody's favorite...
Peta Bites
and costarring...
Bare Bones
and making his professional debut:
Big Data!
you need to know what to look for. In order to know what to look for you need to know what's meaningful and that requires some sort of useful model. Accumulating data in itself isn't that interesting.
EVAR?
The so-called computational scientists I used to work for through an entire alphabet soup of FFRDCs were barely able to program in FORTRAN, much less something as sophisticated at Hadoop.
Notice that the article skirts this issue -- yes, they work with "Big Data" but they don't use any dev tools developed post-1963 to do it in, believe me.
For the most part, Google has moved onto Caffeine and GFS2 for their support. Apparently, Big Table was taking too long to regenerate the entire index, forcing Google to refresh only part of their index frequently. The new Caffeine framework supposedly lets Google get closer-to-real-time search results because newly-indexed/crawled data can be continuously tossed into the search database without requiring an entire batch process. Perhaps that's why quotes from Slashdot comments show up in Google so quickly. This technology allows Google to chase news, blogs, and Twitter feeds while they're still relevant, which is pretty freaking cool.
The guys who were complaining about Google Instant and how Google should make better search results didn't mention Caffeine. Hopefully, Google can figure out how to use this technology to weed out the spam links and SEO crap that dominates some searches.
A NYC lawyer blogs. http://www.chuangblog.com/
Vertica can handle lots of data in a very fast manner (at least for data warehousing). They use a MPP architecture. Commodity hardware in a cluster running Linux.
No need for big machines. You can use lots of little ones.
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
@CmdrTaco, et al., You might go to the 'in-depth guide" Olhorst mentions [http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml], and assess that separately. We did a lot of research with the CIO and the rest of the C-suite in mind as a target audience. Of course Google has moved on beyond Bigtable, etc.... According to @royans [http://www.royans.net/arch/pregel-googles-other-data-processing-infrastructure/], Google uses Pregel to mine graphs. Allegedly 20% of their data they mine with Pregel; the other 80% they mine with MapReduce. Two of the Google engineers presented on Pregel at SIGMOD in June. In other words, these companies are developing and using different methods to mine different kinds of data. Much of the tool innovation happens at the companies doing the mining. @BitZstream, et al., Try to step back a bit and think about the frogs in the ponds next to yours. There is life beyond SQL and relational data. IT departments at large enterprises, particularly those with a significant Web presence or large collections of less-structured data, are using Hadoop, and we cite some of them in the issue. Others we have spoken to since we published in the Spring. Hadoop is a true ecosystem with lots of developers who've been plugged in for years, and they work at Web scale. Yes, the challenge of operating at Web scale is not a challenge everyone has, but it's a challenge more will face. @AlanMorrison