Wal-Mart's Data Obsession
g8oz writes "The New York Times covers Wal-Mart's obsession with collecting sales data.
Fun fact: 'Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at
its Bentonville headquarters.
To put that in perspective, the Internet has less than half as much data, according to experts.'
That much information results in some interesting data-mining. Did you know hurricanes increase strawberry Pop Tarts sales 7-fold?"
would like to welcome our new (evil) data collecting overlords.
"Sanity is not statistical", George Orwell, "1984"
My company alone has over 50 terabytes of data available for download on the internet. Whoever thinks there's that little data on the internet is very poorly-informed.
you fools have no idea that I would never let you hurt the Wall-Mart
they're storing them on a huge cluter of their $200 lindows systems. ;)
Marge, get me your address book, 4 beers, and my conversation hat.
Correlation doesn't imply causation!!!!!
I mean what if a third factor caused both the hurricanes and strawberry Pop Tart sales to increase 7-fold????
Somebody was going to blurt that bromide out at that statement, so it may as well be me.
Seastead this.
As a guest of WalMart I was able to enter their data center and see this Terraplex first hand. It's massive. It's thousands upon thousands of disks in ~8' frames, rows upon rows of racks. I walked down it and across it and up it and was simply awestruck by the idea of that many disks in one spot.
The gentleman who gave me the tour indicated they have something like 72 weeks (1 year plus 2 weeks) of purchase data on LIVE disk arrays, plus huge archives of the same data on tape. If you buy anything and use your credit, debit, or whatever card they can figure out your sales history obscenely quickly. Be afriad. Be very afraid.
I also got to see Walmart.com (Sun E15k) and Samsclub.com (A bunch of HP boxes in a smallish frame), they were creepy, in a sense... all those sales going on at once, converging on a spot not a few feet from me.
More to the point - how do they back it up?
Comment removed based on user account deletion
select sFirstName,sLastName,iPhone from LargeAssDatabase where bWelFare = False;
go on vacation for a week or ten..
deal with resulted data.
I'm a big retard who forgot to log out of Slashdot on Mike's computer! LOOK AT ME.
A few years ago when I worked in retail, everything was going smoothly. Every night the managers would go around with electronic guns and see what needed ordering the next day. Except for the busiest times of the year the backroom was pretty much empty of stock, and on top of the aisles the extra stock was minimal.
Then one day, the managers were really excited, as we were going to have a computer order everything for us, from records of sales from before and it would "predict" what we would need. They said the extra stock on top of the aisles would be eliminated. We would be able to concentrate on customer service.
Well, the day came, and for a few months you could tell the computer was fighting with limited data. Some weeks would be rediculously overstocked on a few items, others, the leading sellers in the store would have empty shelves. When it finally settled down after a year, it was worse than before the computer.
The top of aisles were jammed to the ceiling with stock, there was never any room to put anything up there, and getting to the bottom for something you needed cost a lot of time. Plus, the backroom was packed with stock. You could hardly move around, and trying to find the last box of something buried underneath these huge piles was a task that killed your morale. During the slow months, one stocker for the whole store was enough for a night, now 3 were common to deal with all the stock.
With SQL.
Teradata was built to handle processing very large datasets from day 1. 460 Terabytes distributed across a large number of CPUs and disks working in parallel with a robust SQL implementation isn't really the challenge. The hard part is keeping all those disks spinning when you start pushing MTBF limits, handling the thousands of concurrent users all banging away at the data, and the constant streaming of new data into the system in order to support near real-time DSS.
For those inclined to know more, check here.
007: "Who are you?"
Pussy: "My name is Pussy Galore."
007: "I must be dreaming..."
Firstly, there is no way they can be talkinging about all the data availible on the internet. Filesharing networks alone have WAY more data than this, and when you add all the FTP servers and mirrors, the webmail archives, the home Windows users with insecure shares...
There is no way this can be true. Even if you ONLY take publicly availible WWW pages, it would far exceed their measly estimate.
If it's in you sig, it's in your post.
Wal-Mart employees who use their employee discount cards have every purchase tracked and monitored.
Activity of the cards is ACTUALLY monitored for discrepencies in buying habits to find abusive employees who buy things for their friends?
Did you also know Wal-Mart's employee name badges have RFID tags (and have had for many years) that allow Wal-Mart to track where an employee is at any given time?
Another interesting tidbit, did you know at Wal-Mart's Jewelery warehouses they actually WEIGH the amount of metal in your body when you enter a leave? (And I don't mean they ask you to put things in a dish and weigh the dish - they scan YOU)
Another interesting thing, Wal-Mart has a fallout facility in Oklahoma that has a near-real-time backup of each BIT of that 460 terabytes of data?
Wal-Mart could survive a direct nuclear blast and still keep on a truckin'.
And, of course, if you're in a Wal-Mart home office - ISD building - distribution center - et al... and dial 911 - BOOM - you get Wal-Mart's private security? Niiice, hope it's not a real emergency, you first have to explain it to them - then if they deem it neccessary THEY will call the REAL 911!
Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at its Bentonville headquarters. To put that in perspective, the Internet has less than half as much data, according to experts.' ...
normally, but I guess they didn't check when I was sharing my pr0n on direct connect.
The Internet definately has more data than Wal-Mart. Consider this old 2002 study. The "deep web" alone, comprised mostly of databases, comprises 91,850 TB of data. And this was a couple years ago. It doesn't include email or P2P either.
The definition they used for "Internet" was probably "web pages indexed with a search engine" which is definately not the entire Internet.
I know a guy who worked for Wal-Mart for ~8 years as some sort of data analyst and architect at the main offices in Bentonville. While he didn't go into too much detail, he told me that a lot of the back-end querying is done, surprisingly, with Perl-DBI on Oracle databases. When I asked why his team didn't use something like flat C, C++ or Java, portability was cited as a principal motivation and that, after a certain point, speed gains were only marginal. He also said when he left ~1.5 years ago, that a small cluster migration to DB2 was being talked about. I have no idea if they license search and query code, but I got the distinct impression that there was a team of software engineers who custom crafted search algorithms for the data.
"You and your third dimension."
Uh, except that Google hasn't indexed all of the publicly available WWW. It's only indexed a small fraction of it. And the WWW isn't the Internet. They're different. Secondly, the Internet Archive alone has archived 1 petabyte of data so the figure of 230 terabytes of data on the Internet is obviously wrong.
Support the First Amendment. Read at -1
... do they have a freezer big enough for 460TB worth of drives?
Hugh Hefner?!? Dude, didn't think you'd be posting anonymously! Share the wealth, man :)
Condemnant quod non intellegunt.
That means that the internet has well over a petabyte of information on it, much of the information is probably the same but it is on the internet>
1.5 megabytes of data at walmart
/is/ tracked about customers. I worked with a Fort Lauderdale company a few years back that provided the back-end processing and data warehousing for many grocery discount card programs. They would routinely demonstrate that of the three-hundred data points they collected on a given consumer, one of them was the time of the month a woman had her period. Men weren't exempt either, as they tracked items such as condom sales and kept a score for us as well.
Understanding your method of assessing the data includes lumping data about vendors, data about shipping, inventory status (alone, a huge category), etc., 1.5 MB "per person" isn't huge. The error is in your model as most of the system contains data about things other than customers.
That said, you would be surprised what
The best thing a consumer can do to counteract this consumer surveillance is to toss junk into the system. Here are a few suggestions:
- borrow your mom's/mother-in-law's card and go on a shopping spree for frozen pizzas, candy corn, condoms and saran wrap.
- apply for new cards all the time. provide creative answers as to your address, occupation (animal disposal officer is one of my favorites - someone must be puzzled how many dead animals there are in my city from all the people with this occupation). BE SURE TO ONLY USE CASH with these cards so they don't get an identification anchor.
- spike the data with sustained purchases of one product for a period of time. this is especially fun at smaller retailers that use inventory management - keep buying them out of one product (preferably low cost and low shelf inventory so it is easier and cheaper to do). keep it up for 90 days. then stop buying it and go to another store.
The more you can junk up purchases (especially on anchored cards like friends, in-laws, etc. that have different buying habits), the less valuable the database is.
Your number is wrong, from their faq:
The Internet Archive Wayback Machine contains approximately 1 petabyte of data and is currently growing at a rate of 20 terabytes per month.
That's 20 terabytes per month, not per day.
They got their Internet statistics from the Chinese government.
1511565 MB, ~1.5 terabytes in PC games being shared.
There were 44977 Seeds and 196735 Downloaders, After all those torrents listed are downloaded there will be 241712 with all that data on their hard drives connected to the internet.
I calculated that total and got 338394133 Mb, ~338 terabytes.
... More than 640 Terabytes anyway, right?
(did I just say that out loud?)...
[RM101's mind boggles]
Dude, do you seriously have nothing better to do than spend this crazy amount of time feeding junk data into a supermarket computer? Go outside. Breathe the air.
I dunno, maybe you WILL lay on your death bed, not thinking of your wife, or children, but you'll be proud of how many hours you spent contaminating some database.
Sometimes it's best to just let stupid people be stupid.
I graduated from the Sam M. Walton College of Business at the University of Arkansas with a B.S.B.A in Information Systems. Wal-Mart was nice enough to donate a big chunk (~1 Terabyte) of information for us to datamine. It's pretty interesting stuff and very CPU intensive, as you can probably imagine; we tried not to do any CD burning while waiting on our results ;)
IIRC, It seems like one of the strange correlations we found is that the two items most commonly purchased together were beer and baby diapers. Go figure...