bbiles · Slashdot Mirror

← Back to Users

User: bbiles

bbiles's activity in the archive.

Stories: 0
Comments: 1
First seen: 2006-04-06
Last seen: 2006-04-06
Profile: (view on slashdot.org)

Comments · 1

Deduplication and compression on New 25x Data Compression? · 2006-04-06 04:55 · Score: 1

OK, my bad.

Diligent is not using the term compression AFAIK, but neither are they really deploying this approach yet outside of initial testbeds. Data Domain has been selling a product like this for years, has hundreds of happy customers using it and more than a thousand units in the field. And we came up with a brand, Global CompressionTM, in 2003 to mean the combination of finding long sequences and storing them uniquely across many TB's of stored data (see below) + traditional LZ-style compression.

We sell our system only as a target for backup data, which is extremely redundant. On a first full, we tend to see 2x-4x compression effect. Subsequent file incrementals, 6x-8x. Subsequent fulls, 50x-60x. Aggregate compression effect across a couple months of retention tends toward 20x in a weekly full / daily incremental policy. Exchange or Oracle fulls-daily can be 50x, short retention can be 10x. Mileage varies especially by backup policy, but also (within the 2x factor) by data type. And as mentioned in the postings, the challenge is to get it to go fast; our implementation does this. Early alternatives, such as the Venti filesystem in Plan 9, don't.

Should it be called compression? In lieu of a better term, at least compression is descriptive to a user -- the effect is to compress the backup data. In network equipment they call this technology Wide Dictionary Compression, but it has a half dozen other names. The mechanism of finding a sequence and referring to the original the next time it comes up is pretty much the same as traditional compression, it's just harder to put into silicon because of the size of the referencing window. But it wasn't anticipated by the seminal compression papers many years ago, so there's some debate. In storage, lately, it's starting to get called Deduplication, despite the existing use of that term in databases, and despite another half-dozen vendor terms. Examples of alternatives include capacity optimization, factoring, data coalescensce and sequence reduction. It's only starting to settle down.

Full disclosure: I was at VA Linux in the team that acquired Andover, thus Slashdot, back in the day. Hope that worked out OK.