Google's BigQuery Vs. Hadoop: a Matchup
Nerval's Lobster writes "Ready to 'Analyze terabytes of data with just a click of a button?' That's the claim Google makes with its BigQuery platform. But is BigQuery really an analytics superstar? It was unveiled in Beta back in 2010, but recently gained some improvements such as the ability to do large joins. In the following piece, Jeff Cogswell compares BigQuery to some other analytics and OLAP tools, and hopefully that'll give some additional context to anyone who's thinking of using BigQuery or a similar platform for data. His conclusion? In the end, BigQuery is just another database. It can handle massive amounts of data, but so can Hadoop. It's not free, but neither is Hadoop once you factor in the cost of the hardware, support, and the paychecks of the people running it. The public version of BigQuery probably isn't even used by Google, which likely has something bigger and better that we'll see in five years or so."
With Google's tendency to randomly quit working on products and techs I would never use them. That is why Hadoop is much better option.
"The public version of BigQuery probably isn't even used by Google, which likely has something bigger and better that we'll see in five years or so"
With in-depth analysis like that who needs the full article.........
That's so 2008. Wake me up when I can process terabytes of data with the sound of my voice, the wave of my hand, or the wave pattern of my brain. ;P
My God can beat up your God. Just kidding...don't take offense. I know there's no God.
avoid the protracted outages, painful licensing, free access by federal authorities and data mining by a private multinational and just do this instead http://hypertable.org/
Good people go to bed earlier.
wrong link in article
Google really is too ubiquitous. There really does need to be limits on how far reaching one company can become.
Take, for example, Google's expansion into being an ISP. There is not a hint of benevolence there in wanting people to have a fast connection. Google wants more eyeballs for its ads. Full stop. You are not the customer with companies like Google -- you're the product. I wouldn't choose Google as my ISP if they were free. I have never trusted Google or anyone else with my data. They are too big and people love it. I can hear the shouts of capitalistic joy even now...
test to see how this works.
IMHO, Splunk crushes all of these solutions. While it is not "free", it has an incredibly short time to value. The first time I stood it up in production, it took less than two hours - most of our time was spent checking our work. Now, I can quickly analyze large volumes of data, and only have to manage a single software component. I love it!
robots obey what the children say - TMBG
I mean, if you want to have an ad agency host your databases, you've got lots of other options:
R&R Partners
Mccan Ericson
BBDO
J Walter Thompson
Omnicom Group
Young & Rubicam
DDB
Olgivy and Mather
Saatchi & Saatchi
Leo Burnett
Personally, I think I'll try to find a company where cloud computing is their core business, so they don't just write off the service a few months down the road as not-profitable and leave me hanging.
Why the hell would want to have your mission critical systems hosted by an ad agency?
First of all, this article isn't a comparison or matchup - it's just a speculative post by someone who has done very little research and obviously lacks domain knowledge in the space. There is no mention of use cases, data sizes, performance, costs.
Hadoop is an open-source framework for distributed data processing, specifically an implementation of the MapReduce framework. BigQuery is a hosted service that allows you to run queries over massive datasets via an API. There are tools built on top of Hadoop that allow for fast querying over large datasets (Impala), and there are even tools that are not Hadoop based that provide this as well (Spark + Shark). However, actually using these tools is a whole different game - the author makes so mention of how many nodes/VM are required to compare the query performance of BigQuery.
Then there's data sizes. The author makes a strange claim that BigQuery "queries don’t run instantly; one of the samples took 3.3 seconds to grind through 3.49 Gigabytes of data. But that’s clearly fine for quick lookups." Huhn? What tool(s) are you comparing against? BigQuery allows users to run full table aggregate ad-hoc queries over really really big datasets (i.e. terabytes). In public talks, Google has demonstrated that it is possible to run regular expression match queries, with sums and aggregations, over several terabytes of data in under a minute. In order to do this with a MapReduce-based system, what needs to be done - perhaps use something like Hive, or write a custom MapReduce function - and what is the performance in this case? For the same use case, what is the cost of using some of the "OLAP" tools that the author describes? Would love to see some benchmarks.
Re: "In the end, BigQuery is just another database."
Huhn? BigQuery is not a database at all - it doesn't support CRUD operations on data - rather it is an append-only analytics tool. And conversely, databases, relational or not, aren't really the right tools for full table scan ad-hoc queries over many terabytes, which is what BigQuery is designed to do. BigQuery is a developer's product, and one that can be integrated with existing web apps via RESTful API. Hadoop has it's own development role and story (and tools like Cascading are really great) but it's not designed as the backend for interaction via a RESTful API out of the box - it takes a bit more work to provide Hadoop as a service for developers to integrate with an application.
Re: "The public version of BigQuery probably isn't even used by Google, which likely has something bigger and better that we'll see in five years or so."
BigQuery is based on Google's internal Dremel, which is used everyday by Google. There is a very public research paper describing Dremel (much the same as how Google described MapReduce years ago). Read about what is available in Dremel versus what is available in BigQuery: http://research.google.com/pubs/pub36632.html
Mining of Big Data is problem solve in 2013 with zgrep.
[I]nstead of a dialog, this post got a -1.
You're talking about politics and conspiracy theories in an article about big data. Yes, that is off topic.
Why does the Internet always have to be about "monetization"? I'd like to see open, standards-compliant offerings that are truly "free" as in freedom and very low cost...
You're living in a dreamland. Like it or not, electricity, hardware, and wires cost money.
I'm hoping Firefox OS proves to be one of these. Let's hope as a non-profit...
FYI, Mozilla Foundation is funded, in large part, by Google.
Look at OpenBSD, for example. Not much better in terms of a secure server environment.
And it has scant adoption. Meanwhile, the rest of us are charging ahead and getting stuff done with steadily advancing tools rather than messing around with arcane operating systems that have 10-year-old feature sets.
You use vertica