Wal-Mart's Data Obsession
g8oz writes "The New York Times covers Wal-Mart's obsession with collecting sales data.
Fun fact: 'Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at
its Bentonville headquarters.
To put that in perspective, the Internet has less than half as much data, according to experts.'
That much information results in some interesting data-mining. Did you know hurricanes increase strawberry Pop Tarts sales 7-fold?"
Even Walmart probably doesn't even know what all that data means. Think of the processing power needed to make sense out of it all. I'm sure there are countless interesting trends that are lost in that data ocean.
-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
MaxPower (2263)
"I got it from a hair dryer."
As a guest of WalMart I was able to enter their data center and see this Terraplex first hand. It's massive. It's thousands upon thousands of disks in ~8' frames, rows upon rows of racks. I walked down it and across it and up it and was simply awestruck by the idea of that many disks in one spot.
The gentleman who gave me the tour indicated they have something like 72 weeks (1 year plus 2 weeks) of purchase data on LIVE disk arrays, plus huge archives of the same data on tape. If you buy anything and use your credit, debit, or whatever card they can figure out your sales history obscenely quickly. Be afriad. Be very afraid.
I also got to see Walmart.com (Sun E15k) and Samsclub.com (A bunch of HP boxes in a smallish frame), they were creepy, in a sense... all those sales going on at once, converging on a spot not a few feet from me.
With SQL.
Teradata was built to handle processing very large datasets from day 1. 460 Terabytes distributed across a large number of CPUs and disks working in parallel with a robust SQL implementation isn't really the challenge. The hard part is keeping all those disks spinning when you start pushing MTBF limits, handling the thousands of concurrent users all banging away at the data, and the constant streaming of new data into the system in order to support near real-time DSS.
For those inclined to know more, check here.
007: "Who are you?"
Pussy: "My name is Pussy Galore."
007: "I must be dreaming..."
People who call themselves "experts" but are really just talking out of their asses do. Consider that The Internet Archive alone contains more than a petabyte (1024 terrabyte) of data, all of it accessible, and that they are adding on the order of 20 terrabyte a day, and you start realizing how much bigger the Web is.
We learned a lot about Walmart and Data mining in my database 101 class. And the professor asks "Why do you think Walmart is so successful?"
And everyone says something about leveraging technology and JIT delivery, etc.
Professor Liu says "Nope. Location."
Walmart chose most of their initial locations in cities/regions where there was no other competition. Places where there was no Kmart, no department stores, no malls. And they flourished.
In the future, I would want to not be isolated from my friends in the Space Station.
I know this is a joke but as far as I know, Wal-Mart does not collect individual customer names for most purchases, there is no customer card thing like there is at a lot of supermarkets. I suppose they could collect data via credit cards, but I doubt that is legal.....
Monstar L
The Internet definately has more data than Wal-Mart. Consider this old 2002 study. The "deep web" alone, comprised mostly of databases, comprises 91,850 TB of data. And this was a couple years ago. It doesn't include email or P2P either.
The definition they used for "Internet" was probably "web pages indexed with a search engine" which is definately not the entire Internet.
I know a guy who worked for Wal-Mart for ~8 years as some sort of data analyst and architect at the main offices in Bentonville. While he didn't go into too much detail, he told me that a lot of the back-end querying is done, surprisingly, with Perl-DBI on Oracle databases. When I asked why his team didn't use something like flat C, C++ or Java, portability was cited as a principal motivation and that, after a certain point, speed gains were only marginal. He also said when he left ~1.5 years ago, that a small cluster migration to DB2 was being talked about. I have no idea if they license search and query code, but I got the distinct impression that there was a team of software engineers who custom crafted search algorithms for the data.
"You and your third dimension."
Uh, except that Google hasn't indexed all of the publicly available WWW. It's only indexed a small fraction of it. And the WWW isn't the Internet. They're different. Secondly, the Internet Archive alone has archived 1 petabyte of data so the figure of 230 terabytes of data on the Internet is obviously wrong.
Support the First Amendment. Read at -1
Also, don't forget that the internet includes Usenet and other services under the protocol, which has TONS of additional data. Chances are, the internet is not 230 terabytes large and the idiot who made that claim...is an idiot.
A blog like any other.
Your number is wrong, from their faq:
The Internet Archive Wayback Machine contains approximately 1 petabyte of data and is currently growing at a rate of 20 terabytes per month.
That's 20 terabytes per month, not per day.
1511565 MB, ~1.5 terabytes in PC games being shared.
There were 44977 Seeds and 196735 Downloaders, After all those torrents listed are downloaded there will be 241712 with all that data on their hard drives connected to the internet.
I calculated that total and got 338394133 Mb, ~338 terabytes.