Web Analytics Databases Get Even Larger
CurtMonash writes "Web analytics databases are getting even larger. eBay now has a 6 1/2 petabyte warehouse running on Greenplum — user data — to go with its more established 2 1/2 petabyte Teradata system. Between the two databases, the metrics are enormous — 17 trillion rows, 150 billion new rows per day, millions of queries per day, and so on. Meanwhile, Facebook has 2 1/2 petabytes managed by Hadoop, not running on a conventional DBMS at all, Yahoo has over a petabyte (on a homegrown system), and Fox/MySpace has two different multi-hundred terabyte systems (Greenplum and Aster Data nCluster). eBay and Fox are the two Greenplum customers I wrote in about last August, when they both seemed to be headed to the petabyte range in a hurry. These are basically all web log/clickstream databases, except that network event data is even more voluminous than the pure clickstream stuff."
"Web analytics databases are getting every larger. eBay now has a 6 1/2 petabyte ...
Um, was there a major development in the English language while I was sleeping last night?
My work here is dung.
What's "every larger"? Can I get one, too?
Databases "get every larger"? WTF? Maybe it's a second language for the poster, but as far as I know, CmdrTaco is a plain white-bread murriken who has had a couple of decades to practice the language.
[
At least these won't get out in the open that easily because someone copied them to an USB drive and lost it somewhere.
For shame, Taco...
...but do they move every zig?
Slow news day alert!
The topic in question is blindingly obvious to anyone who has heard of this newfangled "Internet" thing, and frankly is not worth an article in the first place. /. reporting at it's finest... For shame.
Furthermore, such a blatant error in the headline and summary is simply ridiculous. Do the submitters or editors not reread text prior to submission? This is sloppy
There is no psychiatrist in the world like a puppy licking your face - Ben Williams
is this ok?
If I asked them nicely, can they copy it to a floppy and send it to me?
Hopefully it'll compress nicely.
...since that's that database on which Greenplum is based. PostgreSQL 8.4 is coming out soon and looks like it's got a lot of improvements. Too bad replication didn't make it in... hopefully in 8.5.
One of the improvements that looks good is the parallelized restore; RubyForge's upgrade from PostgreSQL 8.2 to 8.3 took 30 minutes to restore the db and it seems like this feature will speed that up considerably.
The Army reading list
These little puppies, i.e. recursive queries, look pretty cool too. Sounds like a good tool for threaded comment systems or finding related items in a table:
Recursive queries are typically used to deal with hierarchical or tree-structured data. A useful example is this query to find all the direct and indirect sub-parts of a product, given only a table that shows immediate inclusions:
WITH RECURSIVE included_parts(sub_part, part, quantity) AS (
SELECT sub_part, part, quantity FROM parts WHERE part = 'our_product'
UNION ALL
SELECT p.sub_part, p.part, p.quantity
FROM included_parts pr, parts p
WHERE p.part = pr.sub_part
)
SELECT sub_part, SUM(quantity) as total_quantity
FROM included_parts
GROUP BY sub_part
They'll get replication some day soon. But there is a lot of cool, very useful stuff with every new release. I usually feel like kid in a candy store wondering what's new that I can exploit.
get every larger
2/12 can be expressed more simply as 1/6.
If you have ever touched one of their Web sites and caught their cookie, your tracks can be followed into unexpected places. This data is a gold mine for them, if they can figure out how to sell it without pissing off users with how much they know.
2/12? Most people would just write that as 1/6, but I guess that doesn't sound as impressive?
Remember RFC 873!
Who cares about eBay and MySpace... tell me about the major players! What is Google running?
Bow before me, for I am root.
This astounds me. These numbers only represent a few companies. Consider that it would take about 5,790 yottabytes* to store a 150lb human body (at a byte per atom). Now consider that people keep in their pocket more storage than existed on the planet 30 years ago. So in another 30 years.... wow. Just think about that for a minute.
* giga tera peta exa zetta yotta
This is running on MS Access, right?
torrent?
They use MySQL for storing adwords data and Google Analytics for web site metrics (which itself stores data in Bigtable).
Bigtable holds a mind-bogglingly huge amount of information. The amount of stuff in their MySQL clusters is merely "absurdly large" by comparison.
-B
Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.
With all that user data, you'd think they would know me better by now. But I still get these lame recommendations.
"You might be interested in action DVDs because you bought one in the past" - BRILLIANT!!
I wonder how much of that is my adcrawler bot with its referer and user agent randomizer. You guys all clear cookies after every unique domain visit, right?
These articles make me believe that Greenplum has some good PR working, because in all the analytics I have done, people tend to scoff at Greenplum.
Hadoop clusters are more scaleable, more flexible, and strangely more supportable than Greenplum. When I worked with Greenplum, we would be able to bring down the server easily by executing simple 'select * from table' queries.
Netezza, which is strangely not mentioned, is much better for doing distincts, which is used quite often in analytics. Greenplum chokes on correlating the data sets.