Automated Tiered Storage Coming to Desktops?
roj3 writes "Tiered storage has been the scourge of administrators because the vendors tell us to hold meetings with all departments and then classify data to storage tier based on its type or relative importance. eWeek has a story about a new approach to tiered storage — sorting it all by usage patterns. Regularly used data goes on high-performance storage, idle data goes on slower/cheaper storage. Volumes and files even span several types of drives or RAID levels. Is automated tiered storage headed to desktops?"
I can see the usefulness of this technology over a busy network with multiple users and masses of files and storage... I just can't see needing anything more than a mirror&stripe RAID array on a PC with only one user. Even that could be considered excessive.
This is exactly what everyone is looking for. People defrag their hard drives in the hopes to increase performance. There is no reason why storage that is accessed more shouldn't be on the high performance drives. Or at least some sort of class rating that defines what storage may need high performance. For example, automatically installing and saving 3D Max to a RAID 0 media, and saving word documents to the lesser-performing drives.
I try to follow this idea all the time with my system. Fast stuff goes on RAID 0, slow stuff, and backup stuff goes on the ole' 200 GB backup drive.
Registers, CPU cache, on-chip cache, RAM, local disk, Network/Removable Media, Paper/Human memory...
It's all about feeding that data hungry CPU, as quickly as possible.
I was using systems that did this 10 years ago. Granted, back then it was disk+tape not different speed disks, but it's the exact same thing.
Looks to me like an excuse to charge 8-10x what you should be paying for storage of that size.
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
Put two 10k Raptors in Raid 0 for your games and other stuff you need REALLY FAST, and then have a big 250GB 7200RPM drive for everything else. People are doing that already.
All you would need is some software for automatically moving it around. Though most people with desktop rigs like that probably would rather control what is on which drives themselves.
"idle data goes on slower/cheaper storage"
So that special little something that you need once a year, but when you need it, you need it RIGHT NOW is tied to the foot of a pigeon fluttering around the warehouse somewhere. Frequency of use does NOT denote importance.
Bad experience is a school that only fools keep going to.
Tiered storage has been around for ages. In the old days it was disk with tape as a backing store.
I do like the idea of this product. Similar performance gains can be had by having the OS manage the data. It's a different-yet-similar concent but some desktop OSes do this already with code libraries, putting them all in a single directory with little or no fragmentation within the file to allow for faster loading. Other OSes play similar tricks with system library metadata.
--
This would have been FIRST POST but I decided to actually write something.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
This scheme reminded me of low power optimization in circuit level. The critical pahts in the circuit are governed by low threshold transistors ( ensuring high performance, i.e speed ) The non-critical paths are governed by high threshold transistors ( ensurin low leakage in stand-by mode with no particular degradation of speed since they sit on non-critical paths, that is the idle paths. It is nice to see the core of this idea in a macro-scale.
There's plenty of room at the bottom! Richard P. Feynmann
Is automated tiered storage headed to desktops?"
I thought OS X did something like this-- ie, moving most-often accessed files into optimal places on the hard drive.
For a large-scale organization on the order of hundreds of employees, but I doubt very much that it would be viable on the desktop (watch, as I say it, HP and Dell as rubbign their hands...). This is for a number of reasons mostly rotating around performance.
For example, take an MP3 collection. I go to open up my old Soviet music collection (which I have), but I haven't listened to it in months, possibly even years. This would put it on the low-end of the priority and I would have to wait for the data to be retrieved, all the while watching paint dry. Similarly, if I have a game that I haven't played in a long while, but that was installed on my computer, you would see HUGE performance delays as each file has to be retrieved.
There is also the question of quality. In large-scale organizations, where you might have your volitile backups on this medium, in the case you need to restore from it, you really do need something of a high-quality, not something that is "cheap". Likewise, when the PHB is opening a finance sheet only to see it has been corrupted due to the "cheap" media failing, there will be hell to pay. I will say, however, that this technology does have some very interesting applications outside of your general company server.
Anyway, I for one welcome our two-tiered storage overlords...
In Soviet Russia, two-tiered storage retrieves you!
I'm a third-tier storage you insensitive clod!
Cowboy Neil!
-PixelPirate
Apply "frequency of use = urgency" to BIGNUM pieces of data and you will have a very useful albeit sub-optimal algorithm.
Yes, there are exceptional cases, like the President's access to the Nuclear Briefcase. It hasn't been used for real in a long time if ever but when he needs it it had better be close at hand. However, these special cases can be treated as the special cases they are.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
From its beginnings, the Hard Drive has leveled the playing field for all files. Everday files can have their content read by thousands, even millions of processes.
The Coalition of Unused Files believes that the desktop is a crucial engine for personal and economic growth. They are working together to urge System Admins to preserve IDE Neutrality, the First Amendment for the Desktop Hard Drive that ensures that the Desktop remains open to innovation and progress.
IBM mainframes that literally pumped water were doing this decades ago.
What, you say water cooling is coming back too?
That's why you have HDD with cache. That's the whole concept of "virtual memory". The next step might be hybrid hdds (solid state / mag platters). But I don't think it will go much farther than that. Multiple raids is overkill for the average desktop.
please excuse my apathy
$50k for a 6TB fileserver? What's that extra $40000 paying for that a normal fileserver loaded with RAM can't do just as fast?
Cheetos go in the easy-to-reach cabinet next to the fridge.
Beer goes in the fornt on the top shelf of the fridge, milk (eventually cheese, typically) goes on the bottom shelf in the back.
This is automated, since I simply shove things onto the shelves when I get home from the supermarket. Anything I consume and replace ends up at the front. Anything I buy because I 'should' be eating it (like fiber biscuits, or whatever) ends up pushed to the back.
It's automated via metatag, too. Anything tagged 'ice cream' goes in the door of the freezer, anything tagged 'vegetable' gets relegated somewhere in the back, where it quickly develops an inch of ice crystals, to slowly dry out to a freezer-burnt state of suspended animation until I buy a new fridge unit.
This costs no more than regular kitchen storage space, but if you'd like a custom design for you and your loved ones, my consulting fee is $75/hr, or a bag of chips and a six-pack.
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
Automatic tiered storage is definitely coming, but probably not in the form of multiple disks that run at different speeds or RAID levels.
Microsoft announced a while back that Windows Vista would support three technologies designed to improve disk speed called SuperFetch, ReadyBoost, and ReadyDrive. SuperFetch is simply a way of preloading applications and data when the OS anticipates that you'll be loading those soon.
ReadyBoost and ReadyDrive both utilize persistent memory caches to speed up access to the disk.
ReadyBoost treats normal USB keys and flash disks like temporary caching locations for data from the disk.
ReadyDrive is essentially the term Microsoft uses to described their support for hybrid hard drives, which are disks that have a built in flash memory module that's used as a persistent cache.
Not only do hybrid disks dramatically increase performance, but they also result in huge power savings for mobile devices like laptops and media players.
I could see a use for something like this. Personally, I've stopped throwing stuff away. With the exception of temporary and cache files, storage is cheap enough that I just don't delete anything on the off chance that I might want it again. Every email, every instant message, every dictated note (I use a little Olympus digital recorder), every digital photo, it's all saved. By the time I fill up my main hard drive with stuff, I can just buy another one that's probably between two and five times the size, dump everything onto it, and keep the old one as a historical backup. (I keep online backups as well, but I won't bore you with it here.)
I don't think I'm that atypical in this regard. GMail brought the idea of saving all your email, forever, to the masses; Flickr gives you an unlimited amount of photo storage; and technologies like Apple's Spotlight make it relatively easy to search through gigabytes of saved information and pull up related items. What we haven't seen yet is a lot of popular interest in redundant backup systems: that'll come later, once people start realizing how much of their lives they're stored away on the crummy OEM drive in their Dell. (Probably after a lot of them fail and we hear some real horror stories.)
It's not hard to imagine a near future where people just get used to not throwing anything away. In that situation, tiering storage -- allocating the fastest media to the most frequently accessed information -- could have big performance gains. And assuming that you have a relatively static amount of frequently-accessed information, and basically only add information to the "infrequenly accessed" category, a tiered system means that you only really have to add storage to the bottom tier. It's a pyramid where the base gets larger and larger, but the upper part remains basically the same size.
So for example, as you save more and more emails (infrequently accessed information), they automatically get saved onto inexpensive, slower drives, which are then mirrored to each other for redundancy. A single, fast drive could hold the system -- maybe solid state storage? -- and more frequently-accessed data. A smart system would know what information needs to be moved up to faster storage to be very useful (uncompressed digital video, for example, wouldn't be much fun to work with off of a slow drive), and what can be left there as it's accessed (MP3s and compressed video could be played directly from slower media).
I think it's an interesting technology with a lot of possible applications, but as with a lot of other things, it'll be the home user who arrives last to the party, because their storage is the least centralized. Unless there's a move away from storage on individual desktop PCs and towards storage on per-home servers, it'll be a while before most people require or see the benefit in such a thing.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
I've worked for several years both creating programs inside the database and on a server layer outside it (and also just about every other layer).
I have to agree with grassbeetle above.
Software architecture-wise:
- You can't make a scalable architecture if you put everything in one single place (in this case the database).
- You will be hard-pressed to create a failure tolerant architecture if you stuff everything in a single point of failure.
- Databases are NOT application servers. They are designed with data storage and retrieval in mind, not reliable execution of complex business logic. Amongst other things databases do not make available in an easy and/or reliable way some of the standard application server functionality.
- All external components of the application (for example UIs) have to connect to the database. You're now stuck to using the connection protocols from the chosen database. This might cause all sort of problems with security, firewalls, use of asychronous messaging, availability of adaptors in the platform you are deploying your applications to, etc...
- Spliting your application accross several servers or in a multi-tiered geographical distribution is much harder.
- All coders have to have a good knowledge on how to work with the specific database you are using.
- Programing inside databases is not standartized. Different databases and indeed different versions of the same database have sometimes different versions of the same language or different libraries available. The language/libraries have not been so throughly used/tested/examined by a big user comunity (while for example standard C/Java/etc libraries have been thouroughly debugged in billions of man-hours of use). This means more library bugs and a lack of third party tools for software design and development inside the database.
- Facilities such as version control, source control, etc are either not available or difficult to use in a reliable manner.
- Availability of compatible 3rd party libraries or application modules is very, very restricted by comparison to NOT having your server side logic all inside the database.
- Forget about moving databases in the future. Also, simple migrating to a newer version of the database can be a nightmare.
Software design-wise, the design of the software will be strongly constrained by the internal structure of the database:
- Information flows will mostly have to be database-like information flows
- A true object oriented structure is pretty much impossible. At the most you can do weakly connected islands with an objecte oriented structure. If the database language you have to use is procedural forget about OO design.
- Server-side initiated connections to outside entities, thread control, ditributed transactions and other more advanced functionalities are pretty much impossible.
- Usage/integration with 3rd party libraries or application modules is very hard or even impossible.
Software programming-wise, and from my experience (mostly Oracle):
- The language sucks.
- The application libraries (not the DBA ones) suck big time.
Simply put, a software architect that puts all server-side logic inside the database is with this single choice removing almost all his other architecture options and creating/fortifying vendor lock-in of the application to the database itself and 3rd party tools and also of the development team itself by means of the knowledge experience they have/will gain with said database and said 3rd party tools.
Such a person should IMHO either be demoted to a place were he/she can't cause any damage or fired outright.
This is hardly a new concept — mainframes have been migrating untouched datasets to tape for years. If this really is a new idea in the SAN market, SANs must suck worse than I'd previously supposed.
And “Is automated tiered storage headed to desktops?” Well, no, unless there's something cheaper than hard disks, which there currently really isn't.
Mind the Gap
One application of something similar is definately coming to desktops (and laptops in particular) in hybrid hard drive arrangements--cacheing commonly used files to flash memory to be able to spin down the platters and conserver power or for performance gains. (Although I remain wary of Vista using USB thumb drives as caches . . . finite read/write cycles and all.)
This is interesting, because when you read about old operating systems that ran on computers with several types of memory--fast magnetic core memory for the active programs, slower rotating-drum memory for less active data, large and slow hard drives, and automatic tape drives--they did exactly this. It makes sense that, given that we have L1 cache, L2 cache, and system RAM, each of which is slower and larger than the next, that we would extend this to hard drives, having a small, fast drive for often-used data and larger, slower hard drives for archived data. This would be the sort of thing I would expect Hans Reiser to want to add to reiser4 (or maybe reiser5)--the ability to span a filesystem across multiple block devices to optimize performance.
ttuttle is a rankmaniac
I clearly see a benefit of using the client machine (PC) as part of the storage hierarchy, the data being moved belongs to a specific user. You can apply usage patterns, policies based on server storage available etc. Email could be moved from the client to the server transparently over IMAP even without modifying the protocol. For most cases this makes it irrelevant whether you are given 100MB or 2.7GB of email storage by your email (online spyware) provider. Here are my 2 cents. http://blogs.hk.com/index.php?/archives/56-The-nee d-for-Hierarchical-Email-storage..html
No. Absent other data, it only denotes frequency of use, period. Playboy.com gets more hits than the general ledger webapp if you unblock your company firewall, but the general ledger is more important to the company.
There is actually very little correlation between what the average user wants and what s/he needs, as is empirically obvious. If the image from the "fly-fishing.com" website that they've set to come up as their background image every morning fails to load, they can still work, but if the once-a-year corporate audit checklist gets put on slow, old storage and then gets lost in a hardware failure, the company stock price may flutter and certainly heads will roll in the corporate IS department.
I don't think that word means what you think it means.
I should never have to empty my recycle bin manually, except where I want to perform a security erase - which should be a function delivered with my operating system. This is the height of stupidity.
It's not even a hard problem! There's functions which programs use to check for free space. Lie to them. Don't count files in the recycle bin against the available free space. If you're about to run out of space, delete the least recently used file. Perhaps you might also base things based on total number of accesses, or other criteria, but I believe (perhaps naively) that making the trash can an automatic FIFO from which files are automatically deleted when disk space is low would be about a hundred times better than what we have now.
Also, I want this functionality on all operating systems. Unless I explicitly request deletion, no file should ever be unlinked, deleted, or whatever you call it when I delete it, whether through the command line or the GUI.
This is not hard and it would make everyone a lot happier.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Sure, your desktop connects over the network to a SAN attached server in some fashion, but I don't see anywhere in the article that says this product:
A. runs a desktop agent of some sort that classifies your data based on access patterns
B. is meant to be directly attached to desktop machines
Where did desktops come from in the article summary? This isn't for your workstation folks.
Apple's "About disk optimization with Mac OS X" (basically telling you that you don't need to defrag), says "Mac OS X 10.2 and later includes delayed allocation for Mac OS X Extended-formatted volumes. This allows a number of small allocations to be combined into a single large allocation in one area of the disk."
There's also a reference to a "hot band," a region of the drive where data is written that's used during startup, in order to increase performance and I assume lessen boot times.
There's also reference to some automatic defragging in this macosxhints article on HFAC:So that seems to be the deal; if anyone else has more information, I'd be interested to hear about it.
There's also a MacSlash article on HFAC and a discussion on Ars that includes a post of the source code.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
This is so simple, you have your good failsafe raid1 setup with 10,000 rpm hard drives for the IMPORTANT data and the rest of your drives are just 7200 rpm drives to store what is not important. Its really easy to determine which drive the data goes to too:
if filename=pr0n
store_on_good_drive
else
store_on_slow_drive
Arguing with an engineer is like wrestling a pig in mud. Soon, you realize the pig is dirty, and he likes it.
So, if i watch a movie once, its rarely used, put on the slowest disk and stutters when played? ;p Well, anyways, where on a desktop system would you need really expensive high speed data stortage? Normal disks these days are extremely large and fast and they never come under any stress anywhere near a high performance webserver or anything alike. While backups system will get more important while disk space gets cheaper and cheaper, i dont see any need for more performace. People will just put up a raid with some 500GB drives pretty soon ...
Wouldn't it make more sense to work on making more intelligent/efficient automatic backups? I mean, there are (generally) two types of computer users: nerds(gamers/IT/programmers/etc.) and non-nerds(grandma/sister/college students/most home-office people). Nerds, for the most part, know exactly how to make their machines run as fast as possible when it comes to what to store where, and that's because they care about performance, be it top speed, reliability, or both. Non-nerds want the damned thing to work, and get really pissed if it doesn't. 99% of said non-nerds don't make any kind of backup whatsoever, so why not take those extra drives you're proposing and stick a copy of their work on there? Automatically? It can't be that hard to automatically copy word documents...
I really wish that my local host's storage were used only as scratch space for encrypting all my data for network storage, and a local cache. Why should I lug "my" PC around when there are PCs everywhere? Maybe if my PC were really better than the others, but for most of my data access, any Web terminal will do. Combine that with a biometric/password protected mobile "phone" containing my keyring and bookmarks, and I'm literally "good to go".
--
make install -not war
Tape is flat out, I could see doing it periodically and putting tapes off site somewhere but it's incredibly slow compared to drives. It's nice in theory and I could see it having a market for archival type stuff, but consumer grade stuff and small business stuff just sucks.
I can't help but think good solid backup technology will be a killer app on the desktop and small business in the future. The whole backup software biz is kind of built around the idea of tapes and such, if you have any tape rotation policy, pulling incrementals together can be a pain in the ass and most tape systems are so bloody slow that you can't afford to do a full backup very often.
Even with the set of firewire drive, some machines are network attached and it's slower. having some intelligence that does the mirroring would be sweet, also for stuff that isn't touched very often, compressing it would be sweet. It seems like it's time for backup technology to be retooled and brought up to par with the technology. It seems like there are a lot of things you could do to optimize this kind of setup. Compress data that isn't changed very often, find duplicate data between machines (like the MP3 library is copied on to 4 diffferent machines at my house) stuff like that. I also really like having a non-proprietary format for my data.
I already do this at my home.
... hmmm...another tier :)
Big files that I don't mind losing (ripped dvds and cds) are on a local, cheap raid-5 array.
Everything else resides on my PC.
Every night, my PC runs an automatic rsync job that syncs it all up to my rsync.net filesystem.
I guess, theoretically, I could take it a step further, and add a layer of geographic (and even political) redundancy by making my account sync to California and Colorado, and not just the primary CA site.
rsync.net just announced sites in Switzerland and India
Quality of Storage Service (QoSS) has been a feature in Veritas Storage Foundation for two years or so.
My single disk raid 0 setup works great...
I've been doing this for the past decade, at least. Two drives, First one partitioned to hold OS and system files on one partition, swap file on another partition. Other drive partitioned into frequently used apps/data and [essentially] archival. Initially, it was a by-product of the limit imposed by WinNT, on drive size -- but it's grown into a habitual application of tiered storage to all of my lab machines, and - even - my home boxes. If it wasn't for the !@&^$#*&$ system registry b.s., this would be an even better setup (give me back my .ini files, damn it! - I want portability!).
I've read about this company before. However, I'm not sold on it, and at last check (a couple of monhts ago), their website was remarkably bereft of useful technical detail.
My biggest question is how they handle free space tracking? Unless this box has "hooks" into the filesystem, it is not going to have the faintest clue when data has been deleted.
Also, can you say "Holy Fragmentation Batman!"? Again, pretty intense "hooks" into the filesystem are going to be required in order to keep files even remotely together. A tape backup of a large file on this mess is going to take all 'freakin week.
I'll take good-'ol-fashioned file or volume-based HSM, which has been around for a great many years, over this block-based stuff any day of the week. I might change my mind if they published some nice juicy technical papers on how they handled free-space and fragmentation issues.
SirWired
Evaluating sentences one by one and changing their meanings is not fair-play. I see your so-called insightful comment(!)... 1st of all, you should understand this > PRIORITIES. In everything in science, daily life, health, insurance, things to do OR equivalently data storage has priorities. You can look it up if you do not grasp the meaning. ---Absent other data, it only denotes frequency of use, period---- Frequency of use HAS nothing to do with period. Period is inverse of frequency. Frequency stands for the NEED of that information here. This means NECCESSITY. You are right that playboy.com gets a lot of hits. Then IT SHOULD HAVE PRIORITY THAT MUST BE FAST. Even to your use. It is PRIORITY. IT is NEEDED. Can you understand that ? Let me deliver the coup-de-grace : ---There is actually very little correlation between what the average user wants and what s/he needs, as is empirically obvious.--- IF the user does not know what he needs; or he cannot correlate his needs with his PRIORITIES then it is an ineffective user and it REALLY his problem. Oh; then we should accomodate DATA STORAGE SPEED vs. PERFORMANCE according to the grandnannies who can't seperate their needs with their wishes. Then they complain just like the comment posted by the guy who does not , unfortunately, understand statistics. QED.
There's plenty of room at the bottom! Richard P. Feynmann
I built something like this 10 years ago. A big corporation's in-house marketing & PR department, lotsa project files full of artwork and such for campaigns, big files used daily for months then ignored for years. It was MacOS 9 & Windows 95 clients, Netware 4.1 on a HP server with RAID 5 and 2 DLTs w/ loaders.
One DLT was for backups using ARCServe (before they got bought by CA). It was simply a matter of shipping cartridges in and out of the storage vault & off-site as required, replacing individual tapes when they got too old.
The other DLT was 2nd tier storage. As files aged on the RAID 5 array they'd first be compressed by Netware for space savings, then after they were inert some period of time they'd be migrated to tape. If read they'd be automatically pulled back from tape, decompressed, and returned to active use on the RAID array until they aged back to 2nd tier storage again.
The whole architecture was nothing more then a few settings under Netware, it was invisible to clients, and seemed to manage itself quite well. I did a lot of tests; pulling tapes, inserting damaged tapes, pulling drives, setting yesterday age-dates and then pulling back files, etc. - it ran flawlessly. I recall being particularly impressed that backups & indexing could be exempted from counting as 'reads' and not pulling every file back to active use.
Sadly the department was outsourced a few weeks after the new architecture went live, so I never really got to see it all in ongoing use. I took advantage of the change to make my own departure to a saner environment (IT's employee-retention average was literally weeks & after 2 years I'd had enough.)
However the tiered storage was a thing of beauty, and dead easy to administer after it was set up. I've always suspected that was one of Netware's problems: It didn't need much baby-sitting so it kinda fell of most folk's radar, it just did a few things but it did them really really well, maybe too well.
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
But it would be nice to see the technology adapted to consumer price points, but it probably won't be as long as huge ATA disks are $200.
It was about 1962, when IBM was touting something they called "Percolate & Drip" storage. The idea was that things that were used often "percolated" up to the fastest storage medium, while data that was only infrequently used would "drip" down to the most capacious media. Why do children get to claim everything they imagine is somehow NEW? Mature adults try to stand on the shoulders of giants.
That's how things used to work on ICL George 3/4 circa 1977.
:HUGHES.SOMEFILE(1/FORT)", if the lazy bugger didn't want to load the tape, or if he couldn't find it he'd type "CANTDO LOAD VOLUME" and you'd get a horrid error).
The joys of waiting for an operator to load a tape so you could edit a file, hoping he wouldn't CANTDO.
(Little used files got shoved of to mag tape. Still showed up in the filestore. When you accessed them a message was sent to the operator: "PLEASE LOAD VOLUME ASBHJ123 FOR
Watch this Heartland Institute video