Ask Slashdot: Smarter Disk Space Monitoring In the Age of Cheap Storage?
relliker writes In the olden days, when monitoring a file system of a few 100 MB, we would be alerted when it topped 90% or more, with 95% a lot of times considered quite critical. Today, however, with a lot of file systems in the Terabyte range, a 90-95% full file system can still have a considerable amount of free space but we still mostly get bugged by the same alerts as in the days of yore when there really isn't a cause for immediate concern. Apart from increasing thresholds and/or starting to monitor actual free space left instead of a percentage, should it be time for monitoring systems to become a bit more intelligent by taking space usage trends and heuristics into account too and only warn about critical usage when projected thresholds are exceeded? I'd like my system to warn me with something like, 'Hey!, you'll be running out of space in a couple of months if you go on like this!' Or is this already the norm and I'm still living in a digital cave? What do you use, on what operating system?
I never run out of disk space.
How does performance change as the big disks approach full? That was always one reason for the rule of thumb about keeping at least 10% free space on UNIX.
"Almost every wise saying has an opposite one, no less wise, to balance it." - George Santayana
Isn't smart enough to track trends, but it does do graphs so you can easily see where your headed and how fast.
Windows 7. :P.
Seriously though, you do have a good question. Every environment is different. A stable environment with very little fluctuation can be a few hundred MB (plus whatever the OS needs for temporary files) away from capacity for years on end - set the alarm at that level plus 1. A drive that's used for archiving everything-ever-created in a video-editing shop will grow to infinity quite fast - set the alarm so you catch it in time to add more space and consider a second alarm that monitors for increases in the rate of growth. A "temp drive" that fluctuates wildly but has only hit 75% once and probably never will again can probably have the alarm set at 76%.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
The bigger question is how to reserve less than 1% for the superuser?
I am becoming gerund, destroyer of verbs.
But the 'just monitor percentages' crowd always wins by demanding an across the board standard of specific percentages to alert on
Percentages still make sense. Much more sense than absolute numbers.
It's possible that the alarm thresholds we've chosen might be tweaked, but percentages are still the way to go.
If you don't understand why we use percentages in the first place, you probably shouldn't be working in IT.
At a previous job, I had set up a cron job that would record nightly to a database the amount of disk space used for each file system. I would then use excel to chart and project consumption trends. Using excel I predicted I would run out of space about 2 months after my server refresh. This was pretty close to accurate when I would have ran out of disk space, but since I had moved to a new server with twice as much space, it was a non issue.
Today, however, with a lot of file systems in the Terabyte range, a 90-95% full file system can still have a considerable amount of free space but we still mostly get bugged by the same alerts as in the days of yore when there really isn't a cause for immediate concern.
When we had drives in the 100s of MB range, we used a few MB at a time. Now that we have drives in the multi-TB range, we tend to use tens of GB at a time. In my experiences, a 90 percent full drive has as much time left before running out as it did a decade ago.
Perhaps more importantly, running at 90% of capacity kills your performance if you still use spinning glass platters as your primary storage medium (not so much when talking about a SAN of SSDs). In general, when you hit 90% full, you have problems other than just how long you can last before reaching 100%.
I install the shareware version of Hard Drive Sentinel on all my Windows systems. It not only will warn you about hard drive usage (%); it will also warn you about errors on the drive -- and in my case I was able to predict that two drives were going to fail (saving data) before they actually failed.
Their support has been very responsive and courteous, their product can work through (see drives behind) most RAID controllers.
And no, I don't have any affiliation with HDS.
...when there really isn't a cause for immediate concern.
It all depends what one is concerned about. Is maximizing disk space down to the last possible byte important to you? Or is performance in accessing random data important to you? Or is wanting to keep artificial limits imposed by monitoring systems important to you?
.
Once you determine what is actually important to you, then you monitor for that parameter.
Whatever is measured is optimized.
The problem is the monitoring group is reluctant to make "custom" changes due to the size of the environment. OS and hardware level alerts are a pretty minor part of the overall monitoring environment in terms of the number of configuration changes required. With mirroring and system/geographic redundancy, we can wait until the morning status reports to identify systems before they get to critical.
[John]
Shit better not happen!
You insensitive clod! In the age of MBs, we were producing KBs of data. In the age of GBs we were producing MBs of data. And in the age of TBs we are producing GBs of data. And so on. Thus a 90% full filesystem is as bad as 10 year ago. Unless you are still producing KBs of data.
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
I went through this years ago (a decade ago probably) at my last job... and I agree while %thresholds are ok for some things, it's not the be-all/end-all of monitoring. I got in an argument with a person who wanted to "automatically add space" to nas/san storage when it hit 95% full... and, of course, my argument was that's somewhat useless - if it's a database volume it might *always* be 95% full, and never growing because it's one big "file" (db area, or maybe a couple but...) that doesn't grow. Or it might be a disk with some log files that some application got an error and started spewing out errors into a log and filled 10% of the disk in an hour before it was noticed... adding disk would be a waste, fix the application, zip/delete the log (maybe save it somewhere to analyze), etc. It's not a "cut and dry" thing.
If you have a 2TB drive you store data on, it's 90% full and growing at 1%/year, you've got a long time before it's a problem. If it's 90% and growing at 1%/week, you've got a big problem in just a couple months. If it's at 90% and 'historically' has grown at 1%/yr, and suddenly it's at 98% tomorrow, you might want to suspect a problem or something way out of the ordinary - and not just go throwing more space at it.
This is why you hire experienced sysadmins and don't just rely on automation and stupid 'rules' that don't always apply, and why sometimes (as much as we were always pushed to - so they could 'offshore' it all) you can't just 'document how to fix it' when there's lots of possibilities.
Storage is going to be cheaper and cheaper, so percentage is ok for today and the future. You dont need trends an heuristics, just increase the percentage and you are good to go.
On the operating system I use, the only alert I get is errno set to ENOSPC. and programs printing "No space left on device"
No idea what you're talking about with annoying warnings and alerts at 90%.
Although, most filesystems perform poorly when they are nearly full. As it is difficult to allocate new extents and it requires more system RAM to construct the mappings to many little extents.
While "only 5% of my disk" is now many times larger than it used to be, so are the things I'm moving around, so "95% full" is just as bad now as it used to be.
Basically, once we got past quotas measured in single or double-digit numbers of kilobytes, this stopped changing for me. 95% full on a 100MB disk and 95% full on a 500GB disk work the same for me.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
Most monitoring consists of taking a snapshot of something at fixed intervals. While you can generate alerts based directly on monitoring, they might not be as helpful as you would like, all those usage spikes and the drone of warnings when the thresholds are too low. When you look at a graph of monitored stats over time this is actually a reporting system, and can allow you to determine change over time, observe trends, and other good stuff. Use the reporting system to create second order monitoring information, and use that for your alerts.
free_space / (current_usage - minus_1_day_usage) = number_of_days_until_full. Send an alert when this value is under 60.
You're living in a digital cave IMHO.
:).
Don't worry, I was too until recently...
Always mucked with fast external storage as the "main" solution -- firewire, thunderbolt, etc. This system is the main and had a few externals hooked up, that system had another, another over there for something else. It was a mess all around. How to back it all up??
Gave them all away -- bought a Synology
Then bought another (back it up
180-200M/sec throughput is the norm. On the network. Beats out most external drives I've ever come across. Everything ties into / backs up to the array. Home and work now too.
I use everything but Microsoft products. They're shit.
My filesystem is 60T w/ under 10T used today. I'll consider plugging in more drives or changing them out in the Synology somewhere between 2017 and 2020...
I have less concern than the amount of data being stored as I do the incredible number of files that a typical system stores. Do an ls -lR / on a typical system and you will get tens or even hundreds of thousands of files.
As recently as the days of Windows/NT 4 I could probably keep the gist of the entire structure in my head -- what each sub-tree is for and in most cases what each directory/file is for. Somewhere since then it has become impossible to do so and that goes for Windows, MacOS X, or almost any Linux distribution.
We switched to Check_MK for monitoring. It's basically a collection of software that sits on top of Nagios.
The default disk monitoring allows alerting based on trends (full in 24hours, etc.) or thresholds based on a "magic factor." Basically it scales the thresholds so that larger disks alert at a higher percentage, adjustable in quite a few different ways to suit your tastes.
Not only monitoring system should look at historical growth versus estimated time left to full, it should also keep track of storage addition, so that I can smartly tell us 3, 2, 1 months ahead, each with one email... and then it should also keep track of sudden increase and send an emergency alert if for example the past 7 days have climbed more than a whole month before... stuffs like that.
All existing monitoring tools these days are not setup that way and I wish folks get more creative and dilligent with their works.
If my drive is 90% full, I don't care if it is 2 MB or 100 GB, I have a situation I need to know about.
If you are an enterprise shop, you likely have so many disks spread across so many servers that you probably have an admin team responsible for projecting utilization for the next 12 months, so that procurement and installation costs can budgeted.
For the home user, or a small business, 90% is still a good rule of thumb. I would hate to see some additional process running in the background constantly projecting when the disk will be full. Just throw a warning for the user when you reach 80-90% capacity, and let them figure it out. They are probably more likely to fill their thumb drives than they are the local media.
It is ridiculous in 2014 that we have to worry about running out of disk space. We have the technology to solve this problem. The government should provide cheap unlimited disk space to all.
Oh wait, I thought I was on the broadband thread...
One thing that I've noticed.
Internet browsers tend to allocate cache space in terms of a %, and so does the OS itself (checkpoints, deleted space and so on). Those percentages haven't changed much over the years and allocating 5-10% seems quite reasonable at first.
However do the math on a modern HDD. Let's take the latest 8 TB drives, just a 5% allocation for any caching purpose gives that cache roughly 400 GB! Now thats a humongous cache and many caching algorithms cannot efficiently use such an enormous cache space.
The lesson is that as storage volumes continue to grow in capacity, optimal configuration requires the percentage allocated to reserved system uses needs to be scaled back.
You can build advanced, predictive analytics with Splunk. It can do exactly what you asked for.
robots obey what the children say - TMBG
The actual number you are looking for is 85%.
Straight out of Donald Knuth volume 3: Sorting and Searching; at 85% fill, a perfect hash starts degrading in performance.
The basis of the Berkeley Fast File System warn level was an 85% fill on the disk, which the filesystem effectively hashed data allocations onto. As people started getting larger and larger disks, they began to be concerned about "wasted space" in the free reserve, and moved the warnings down to 10%, then 8%, and so on.
This is what the OP is suggesting (again) for very large disks, but without something like an LFS and a background defragger, fundamentally, most FS implementations performance still starts to drop of a 85%+ fill. Background defraggers/"cleaner daemons" have their own performance issues (e.g. like Garbage Collectors, they tend to run at the worst possible times, as in when you are putting performance pressure on the system already).
But as Ken Thompson said: "The steady state of disks is full".
I use http://www.monitorix.org/ both at home and at work.
It monitors (nearly) everything, filesystem(s) usage included.
Can trigger a script of your choice when an arbitrary treshold is reached.
Has nice colored graphs too.
Mastering the English language is fucking easy: all you have to do is to put an f* word in every fucking sentence.
I've been working on a open source storage solution(http://rockstor.com) to address some of these concerns, which I broadly categorize as "smart storage management". So far we have a few dashboard widgets that give insight into usage patterns and some probes for storage analytics. I think that alert mechanisms should model storage consumption and I/O patterns at the very least. Not only is it important to alert, but also provide recommendations so the admin/user has a clear action to follow up with. For example, "hey, you are running out of space at rate X, mainly due to files of type Y and you have W weeks until you completely run out of space. You can migrate these Z-set of files to archival storage which give you M more months of time." We hope to get there with Rockstor.
Most enterprise type monitoring packages (HP OM, IBM Tivoli Monitoring, CA Spectrum) have a predictive feature either installed by default or obtainable as a free bolt-on. As mentioned above, Nagios has Check_MK also.
If you're using a filesystem like ZFS, 80% is a critical threshold. Even if you have 50+ terabytes of storage.
As the title says, for a large site you'll typically needs 3-6 months notice to get from desire to delivery.
You need to allow time for financial approvals, corporate governance, power approvals, floor space and cabling, quarterly forcast budget vs. exemption processes, etc.
In that type of environment you need to monitor usage and pipeline, and initiate the procurement process at 60-80% capacity. The alternative is to risk running out of capacity before the new kit is operational (and having to explain that to the business).
There are commercial monitoring products (BMC, IBM, etc) that can give you that level of information/alerting out of the box, usually the open source solutions are not that smart, on those cases usually you have to save your metrics to a data warehouse and from there you can do capacity planning and alert or automate stuff based on that.
I would suggest to review your thresholds and update them (and generate a monitoring baseline thresholds for all your servers, yes, apply same baseline to all similar servers) accordingly to avoid getting alerts when there is no need, usually only actionable items should be alerted by default.
On the servers I manage, the usage is fairly stable so we have alerting set at various levels for each file system. Some are set above 95% and others as low as 60%. I want to know when disk usage changes abnormally, no matter what the absolute level is.
Some disks are less important than others so they just send email alerts. The file systems that are critical send text messages since we're a 24x7 shop.
"Almost every wise saying has an opposite one, no less wise, to balance it." - George Santayana
I would always shoot for more disk but then issues arise from managing such large disks in the 1+ TB range that we tend to fill up fast.
For a laptop or desktop, I am targeting two large drives in a RaidZ mirror on Linux. I would do the same for a desktop.
For more data and centralization for my house or office, I would choose an iXSystems FreeNAS Mini. It has all the features that you need for your data and can be easily configured to send out warning messages on various measurements like disk space, SMART messages and raidz warnings. I think that de-dupe is coming soon if its not there already. The Mini is super powerful for its size and power footprint.
With ZFS, it solves the nasty issue of having to recover files on massive disks like those we get today. There is nothing worse than waiting for fsck, surface scans or recovery operations on 1TB+ drives; It takes forever. With a well maintained ZFS system those issues are gone.
Another really cool thing about ZFS is the ability to maintain a perfect audit on the faults in your drives. Once ZFS starts saying there are issues with the drive, you send it back to the vendor in warranty period with the error messages and you get a brand new drive. I met someone at BSDCan this year who has not purchased a new drive in years because he keeps finding errors before the warranty expires. Pretty sweet.
why not just monitor gigabytes/terrabytes free rather than a percentage then?
I think most individual server filesystem monitoring for free space is kind of a waste of time anymore or at least low prioirty.
SANs and virtualized storage and modern operating systems can extend filesystems easily. Thin provisioning means you can allocate surpluses to filesystems without actually consuming real disk until you use it. Size your filesystem with surpluses and you won't run out.
Now you only have to monitor your SAN's actual consumption, and hopefully you bought enough SAN to cover your growth until you can buy another one.
Interesting things to monitor are I/O rates and read/write latency. More esoteric things might be stats about most active files and directories or percentage of recently accessed data -vs- inactive data. But these are more analysis than monitoring. What other parameters would a sysadmin want to look at?
RLH
But does that server use local disc?
The discussion is a bit closer to the metal here than something in a virtual machine dealing with data on a SAN even though that technically is also a server. It's just not a file server.
The linked article used to be about how RAID was going to stop working in 2005 or similar.
It didn't because disks and controllers got much faster as well as dealing with more capacity, while the premise assumed nothing but a change in capacity.
So now we have arrays 10x larger that rebuild in less than half the time of the old ones. We also have stuff like ZFS that acts like RAID6 in many ways (with raidz2) but can have much shorter rebuild (resilver) times because it only copies data instead of rebuilding the full capacity of the disk like a hardware RAID controller would do.
I'd expect someone running FreeNAS to know more than a journalist rewarming an old article that was a poor prediction in the first place, but I suppose seeing it in magazine format does make it look more credible.
It's a block size vs available space issue so 90% full kills performance on small drives with big blocks (eg. SSDs from a couple of years back) but at 90% of 4TB you've still got a vast quantity of available blocks so it still performs very well.
So although I'm not the poster above I've had experience of both - the percent full number is only a rough guide and falls down when the block size is very small compared with the available space.
No point nitpicking just because the "b" denoting Megabits was forgotten. A speed of 200Mb/s is not huge but it's not too bad either, even though a fairly old machine (6 years) with a few disks in an array can get close to five times that and saturate gigabit (or even twice over if a second connection is going somewhere else).
It runs on every machine and almost on every os. And yes I have one server running on my home.!
I don't need the computer to tell me when a big disk nearly full. That would be something I was aware of for some time.
In an enterprise setting where there could be many disks... one would assume the sysadmin has set reasonable alert levels rather then leaving everything on default.
So... I guess this is relevant to non-power users in residential contexts? But then how is a non power user filling a terrabyte harddrive? I mean... seriously.
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
the disk fills up with the same relative speed.
okay, the OS does not get a big problem with 99% full disk. but your media collection does. you still need to upgrade your storage, when its getting full, because you will still get new big files.