In the US, you're allowed to drop anything you want as long as you ensure that anyone or anything on the ground will not get hurt or damaged from it. That is spelled out in FAR 91.15.
Wow, you don't see common sense lawmaking like that very much any more.
Yeah, we're all very proud of you, that you can solder teeny tiny little things. I can solder with a gasoline torch, you whippersnapper. Now get off my lawn.
Well, my friend, that was a joke (did you read the comic?) but like any good joke it had some relation to truth. I guess that may be the only way it was like a good joke.
The last time the USA had a civil war, we killed off two percent of our total population. And that was before the advent of air power or modern artillery or air-cooled man-portable machine guns.
If it's up to me, I'm agin it. I'd like to believe we learned our lesson last time around.
Same here. But honestly you probably don't want to be in the top 1% for income; those people are always the first up against the wall when the revolution comes.
I have a similar situation myself, due to a large evergreen tree. I buckle up after I clear the tree because I'm more concerned about not running over a child or pet than I am about somebody driving off the road, through the woods, over my yard, and colliding with me in my own driveway.
In my car I can't lean far enough over to get a good view past the tree if I've got the seat belt on.
You've got a very good point about my high expectations for competence... I plead guilty due to advanced age!
When I got started you didn't call yourself a sysadmin if you couldn't parse a core dump. You had to have already been a systems programmer, and you usually didn't get to do that until after you'd been a successful apps programmer. Nowadays most systems are so much simpler to administrate (and core dumps so much less useful, too) that you don't necessarily need programming experience to achieve some minimal level of competence.
I personally still wouldn't hire an admin who couldn't code to a standardized syscall, though. If you ignore accreditation (college degrees etc.) and just focus on actual ability, you can find some really smart, capable people out there looking for IT jobs.
The easiest way to do that is to cut up the files into blocks and compress them. If two blocks are the same, you don't recompress them but just put in a link to the previous compression. This is what (roughly speaking) BZIP2, RAR, ACE and some other formats do.
That's very interesting! Thank you - I will look into BZIP2 more deeply as time permits.
My experience in the field has been that premature compression can be the bane of efficient business continuity planning. Real life example: your client wants to make nightly offsite backups of a live, highly active email system. This can be done using a combination of LVM snapshotting and rsync --link-dest (and you could multicast that backup to multiple sites if rsync batch mode actually worked, which I'm sure it will someday). But if there are re-organization and compression jobs already running on the source system, you'll run out of bandwidth, because there will be too many daily changes and the client can't afford more than a couple T1s. If you stop doing disk space optimization on the source system, instead just adding more hard drive (use AOE cheap multi-terabyte arrays if necessary) you may be able to bring down the number of changes found by rsync's block checksumming to where you can fit easily in the site-to-site WAN bandwith constraints. Now you've got to worry about db performance, so you shove the mail into maildirs and use the filesystem as your db and you're good to go.
As you point out, the magic lies in creating a working system from all the abstract theory. Typically you have to make compromises in one area to suit another... which is why I think a competent sysadmin is the key to getting any systems job done right. Buying fancy products or worshipping the approach of a particular vendor just doesn't cut it once you've passed a certain level of complexity.
Dedupe isn't at the file-level. It wouldn't be nearly as useful if it just got rid of exact copies of files.
You can implement de-duplication at whatever level is appropriate to your needs. Most people do it at file level; I use it to reduce my nightly rsync snapshot load. Gets me more than 90% savings in disk space because of my specific use case. Understanding what you need and how you can efficiently satisfy those needs is the key to good systems architecting and management.
Several people have pointed out that block-level de-dup is inherently best suited to being implemented in the filesystem, but if your toolset (such as compression utilities, for example) wasn't written to suit such filesystems, you can still get screwed. Again, it depends on your use case - does your database backup software change every bit in every block if one byte in the first 512 changes? If so, any form of de-dup may get you nothing - especially if you keep those backups on a dedicated partition on your hot backup site - you'll just be wasting processor time.
There's no substitute for knowing what you're doing, unfortunately. One-size-fits-all solutions usually don't.
inotify can't watch an entire filesystem. No current *notify kernel hooks can offer this, unfortunately.
Yoiks, you're right. It's been over a year since I wrote an inotify interface, and I forgot that! In real use - well, OK, in my real use anyway - if I try to set up a completely recursive structure that adds new watches as new folders are created, the number of inotify events due to normal user operations becomes so high that events start getting dropped with IN_Q_OVERFLOW yadda yadda yadda. This turned out not to be a problem for me specifically but that was only because I wasn't de-duping, I was just triggering events on client file transfers, which were restricted to specific folders anyway.
I like being corrected, because I don't like being wrong. Thanks!! And thanks for the Stearns link, too - I may use some of that code (with soft links instead of hard links, though, for a particular use case I have in mind).
I think file-level de-dupe is usually a lot less effective because it can't accomodate files that differ only slightly but are otherwise the same, whereas block-level de-dupe works with everything.
Not really. There are real-world advantages to both, and you'll have to examine your specific environment to see what's best. Generally, block-level will work better for giant database files where there are very localized changes, file level will work better for systems with zillions of small files (mailservers etc.), and nothing really works well with giant compressed image files because very small changes can cause every single block to change after compression (the same is true of large databases that are frequently re-orged for performance).
I also don't know what happens in your scheme when you have "de-duped" a file that's the same in 4 different directories but then one application wants to change "its" version of the file. It sounds like it trashes the file for the three other uses of it since there's no way to automate copy-on-write with your shell script but maybe my clue isn't working.
It's not necessarily a problem, depends on the filesystem and use case. In snapshot systems based on rsync --link-dest the toolset handles that issue transparently to the user. And if the files in question absolutely shouldn't be forked (payroll files, perhaps?) then this might be desired behavior.
You really do need to know what you are doing, though - that's why I started this thread by recommending a competent linux sysadmin. Somebody who knows what you are trying to accomplish, what the tools can do, how to write an inotify interface, etc... if you don't want to pay for highly skilled staff this is not the road to take. Rather often it's worthwhile for a business to invest in high quality people, though.
Very good point! Things like time machine, dirvish, my own recipe, they all deal quite poorly with giant image files.
Often such files are compressed on the fly at creation time, too, so they may get very little benefit from de-duping at the block level, either. Depends on your compression algo...
Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
I mean, off the top of my head, use "find" to get today's files, "md5sum" to MD5 them, "grep" or "gawk" to check your flat file of existing sums, "ln" to link dups to a single copy, and a couple dozen lines of shell to glue it all together and append new sums.
You could probably code it all up in busybox in an afternoon, and run it from cron every hour.
But you'd have to have a clue. If you don't have a clue, you can always buy a package. Yay freedom of choice!
If you're exceptionally clueful you can do this with inotify hooks and existing file checksums.... and "say it doesn't affect performance almost at all" if that sort of linguistic double-dutch appeals to you.
I frequently find web sites that only work in IE, and sometimes find sites that work in everything but IE, but at least IE lets me visit http://unqualifiedhostname:9000/ - which chrome does not.
Sure, I'd love to know the magic settings to make Chrome act like a browser instead of just a fancy UI for Google search, and I'd love to know the settings to make IE9 standards compliant, but it's honestly not worth my time when every new version of firefox "just works" on all three of the platforms I use every day.
I live near DC. I hear TV and radio commercials related to some upcoming government policy change or decision all the time and they all follow that exact theme.
Getting off topic but for those outside the DC area.. It is surprising the number of commercials that are played on local radio and TV for the joint strike fighter, Boeing, health care, telecomm, network neutrality, cleaning up the hudson, etc. I guess if you can't lobby the pentagon and government officials directly, catch them in their commute waiting in traffic listening to the radio.
And thus the phrase popular outside DC, when referring to federal government - "those people are living in their own little bubble".
PHP is like DCL or Perl or TCL; it evolved by accretion during actual use, rather than being built to a design. It has no overarching consistency to its syntax or structures. So, unsurprisingly, it is both ugly and very useful, like a battered old 4-way lug wrench.
People whose sense of self-worth is dependent on their programming chops will prefer Java or ASP or C#, because those languages are so much harder to learn. If you're a basement-dwelling man-child with no girlfriend, you can always console yourself with your elite status as a lisp programmer or whatever.
OK, I was just kidding about lisp, no need for the torches and pitchforks, I - hey! Ow! Stop that!
Yeah, I see that one all the time too. I tell 'em there are all sorts of perfectly "natural" poisons and they get mad at me.
I never really liked that word anyway. Why is a beaver dam more natural or less artificial than a dam made by humans? It's not like humans exist completely outside of nature.
Here, let me Google that for you.
Interestingly enough, the bubonic plague may have selected for HIV resistance.
You mean like youtube? That just might work.
Well, isn't that the official Republican position on healthcare?
I'm kidding, OK, kidding. Back away from the flamethrower.
In the US, you're allowed to drop anything you want as long as you ensure that anyone or anything on the ground will not get hurt or damaged from it. That is spelled out in FAR 91.15.
Wow, you don't see common sense lawmaking like that very much any more.
Although some people are still trying to enshrine common sense, I guess.
That's both a blessing and a curse. Anywhere I go, I always have awesome job security!
Yeah, we're all very proud of you, that you can solder teeny tiny little things. I can solder with a gasoline torch, you whippersnapper. Now get off my lawn.
Well, my friend, that was a joke (did you read the comic?) but like any good joke it had some relation to truth. I guess that may be the only way it was like a good joke.
The last time the USA had a civil war, we killed off two percent of our total population. And that was before the advent of air power or modern artillery or air-cooled man-portable machine guns.
If it's up to me, I'm agin it. I'd like to believe we learned our lesson last time around.
Pertinent:
http://en.wikipedia.org/wiki/United_States_military_casualties_of_war
http://boards.straightdope.com/sdmb/archive/index.php/t-369691.html
Same here. But honestly you probably don't want to be in the top 1% for income; those people are always the first up against the wall when the revolution comes.
http://amultiverse.com/2011/10/24/eat-the-rich/
Shoot for somewhere in the top 5% and you might not end up wearing a bad sweater, eh?
Inside information, score zero. Welcome to the post-Taco slashdot, I guess.
Good one!
I have a similar situation myself, due to a large evergreen tree. I buckle up after I clear the tree because I'm more concerned about not running over a child or pet than I am about somebody driving off the road, through the woods, over my yard, and colliding with me in my own driveway.
In my car I can't lean far enough over to get a good view past the tree if I've got the seat belt on.
You've got a very good point about my high expectations for competence... I plead guilty due to advanced age!
When I got started you didn't call yourself a sysadmin if you couldn't parse a core dump. You had to have already been a systems programmer, and you usually didn't get to do that until after you'd been a successful apps programmer. Nowadays most systems are so much simpler to administrate (and core dumps so much less useful, too) that you don't necessarily need programming experience to achieve some minimal level of competence.
I personally still wouldn't hire an admin who couldn't code to a standardized syscall, though. If you ignore accreditation (college degrees etc.) and just focus on actual ability, you can find some really smart, capable people out there looking for IT jobs.
That's very interesting! Thank you - I will look into BZIP2 more deeply as time permits.
My experience in the field has been that premature compression can be the bane of efficient business continuity planning. Real life example: your client wants to make nightly offsite backups of a live, highly active email system. This can be done using a combination of LVM snapshotting and rsync --link-dest (and you could multicast that backup to multiple sites if rsync batch mode actually worked, which I'm sure it will someday). But if there are re-organization and compression jobs already running on the source system, you'll run out of bandwidth, because there will be too many daily changes and the client can't afford more than a couple T1s. If you stop doing disk space optimization on the source system, instead just adding more hard drive (use AOE cheap multi-terabyte arrays if necessary) you may be able to bring down the number of changes found by rsync's block checksumming to where you can fit easily in the site-to-site WAN bandwith constraints. Now you've got to worry about db performance, so you shove the mail into maildirs and use the filesystem as your db and you're good to go.
As you point out, the magic lies in creating a working system from all the abstract theory. Typically you have to make compromises in one area to suit another... which is why I think a competent sysadmin is the key to getting any systems job done right. Buying fancy products or worshipping the approach of a particular vendor just doesn't cut it once you've passed a certain level of complexity.
You can implement de-duplication at whatever level is appropriate to your needs. Most people do it at file level; I use it to reduce my nightly rsync snapshot load. Gets me more than 90% savings in disk space because of my specific use case. Understanding what you need and how you can efficiently satisfy those needs is the key to good systems architecting and management.
Several people have pointed out that block-level de-dup is inherently best suited to being implemented in the filesystem, but if your toolset (such as compression utilities, for example) wasn't written to suit such filesystems, you can still get screwed. Again, it depends on your use case - does your database backup software change every bit in every block if one byte in the first 512 changes? If so, any form of de-dup may get you nothing - especially if you keep those backups on a dedicated partition on your hot backup site - you'll just be wasting processor time.
There's no substitute for knowing what you're doing, unfortunately. One-size-fits-all solutions usually don't.
Yoiks, you're right. It's been over a year since I wrote an inotify interface, and I forgot that! In real use - well, OK, in my real use anyway - if I try to set up a completely recursive structure that adds new watches as new folders are created, the number of inotify events due to normal user operations becomes so high that events start getting dropped with IN_Q_OVERFLOW yadda yadda yadda. This turned out not to be a problem for me specifically but that was only because I wasn't de-duping, I was just triggering events on client file transfers, which were restricted to specific folders anyway.
I like being corrected, because I don't like being wrong. Thanks!! And thanks for the Stearns link, too - I may use some of that code (with soft links instead of hard links, though, for a particular use case I have in mind).
Well, I back up 12 terabytes a night, so I am amazed at your awesomeness, that you consider this trivial.
Not really. There are real-world advantages to both, and you'll have to examine your specific environment to see what's best. Generally, block-level will work better for giant database files where there are very localized changes, file level will work better for systems with zillions of small files (mailservers etc.), and nothing really works well with giant compressed image files because very small changes can cause every single block to change after compression (the same is true of large databases that are frequently re-orged for performance).
It's not necessarily a problem, depends on the filesystem and use case. In snapshot systems based on rsync --link-dest the toolset handles that issue transparently to the user. And if the files in question absolutely shouldn't be forked (payroll files, perhaps?) then this might be desired behavior.
You really do need to know what you are doing, though - that's why I started this thread by recommending a competent linux sysadmin. Somebody who knows what you are trying to accomplish, what the tools can do, how to write an inotify interface, etc... if you don't want to pay for highly skilled staff this is not the road to take. Rather often it's worthwhile for a business to invest in high quality people, though.
Very good point! Things like time machine, dirvish, my own recipe, they all deal quite poorly with giant image files.
Often such files are compressed on the fly at creation time, too, so they may get very little benefit from de-duping at the block level, either. Depends on your compression algo...
Nah, dirvish uses the native rsync --link-dest, which is easier.
Isn't de-dup a fairly trivial application for a DB of MD5sums, even if you don't have the chops to use the filesystem at a more fundamental level?
I mean, off the top of my head, use "find" to get today's files, "md5sum" to MD5 them, "grep" or "gawk" to check your flat file of existing sums, "ln" to link dups to a single copy, and a couple dozen lines of shell to glue it all together and append new sums.
You could probably code it all up in busybox in an afternoon, and run it from cron every hour.
But you'd have to have a clue. If you don't have a clue, you can always buy a package. Yay freedom of choice!
If you're exceptionally clueful you can do this with inotify hooks and existing file checksums.... and "say it doesn't affect performance almost at all" if that sort of linguistic double-dutch appeals to you.
I frequently find web sites that only work in IE, and sometimes find sites that work in everything but IE, but at least IE lets me visit http://unqualifiedhostname:9000/ - which chrome does not.
Sure, I'd love to know the magic settings to make Chrome act like a browser instead of just a fancy UI for Google search, and I'd love to know the settings to make IE9 standards compliant, but it's honestly not worth my time when every new version of firefox "just works" on all three of the platforms I use every day.
And thus the phrase popular outside DC, when referring to federal government - "those people are living in their own little bubble".
PHP is like DCL or Perl or TCL; it evolved by accretion during actual use, rather than being built to a design. It has no overarching consistency to its syntax or structures. So, unsurprisingly, it is both ugly and very useful, like a battered old 4-way lug wrench.
People whose sense of self-worth is dependent on their programming chops will prefer Java or ASP or C#, because those languages are so much harder to learn. If you're a basement-dwelling man-child with no girlfriend, you can always console yourself with your elite status as a lisp programmer or whatever.
OK, I was just kidding about lisp, no need for the torches and pitchforks, I - hey! Ow! Stop that!
What I get out of this is that it's impossible for anyone with real talent to work productively for the people in charge of HP's WebOS.
Opera's guys have major talent, as you pointed out, so they'd probably be unable to work with them either.
Can't prove you wrong because you're right.
I personally don't care for the UI, but I have to admit the underlying OS design has always been better than the other home-user systems.
Yeah, I see that one all the time too. I tell 'em there are all sorts of perfectly "natural" poisons and they get mad at me.
I never really liked that word anyway. Why is a beaver dam more natural or less artificial than a dam made by humans? It's not like humans exist completely outside of nature.