Web Analytics Databases Get Even Larger
CurtMonash writes "Web analytics databases are getting even larger. eBay now has a 6 1/2 petabyte warehouse running on Greenplum — user data — to go with its more established 2 1/2 petabyte Teradata system. Between the two databases, the metrics are enormous — 17 trillion rows, 150 billion new rows per day, millions of queries per day, and so on. Meanwhile, Facebook has 2 1/2 petabytes managed by Hadoop, not running on a conventional DBMS at all, Yahoo has over a petabyte (on a homegrown system), and Fox/MySpace has two different multi-hundred terabyte systems (Greenplum and Aster Data nCluster). eBay and Fox are the two Greenplum customers I wrote in about last August, when they both seemed to be headed to the petabyte range in a hurry. These are basically all web log/clickstream databases, except that network event data is even more voluminous than the pure clickstream stuff."
For shame, Taco...
...since that's that database on which Greenplum is based. PostgreSQL 8.4 is coming out soon and looks like it's got a lot of improvements. Too bad replication didn't make it in... hopefully in 8.5.
One of the improvements that looks good is the parallelized restore; RubyForge's upgrade from PostgreSQL 8.2 to 8.3 took 30 minutes to restore the db and it seems like this feature will speed that up considerably.
The Army reading list