I wouldn't say error, it was designed with parity protection only, so was incapable of correcting single bit errors, only detecting them. Hence, the reason for the crashes (i.e it detected a bit flip). If two bits were flipped you would never know.
I worked in the Sun front line call support during this time, and explaining this over and over to customers was somewhat painful. Never mind the years of mocking that still come from telling customers "it was a cosmic ray". Sun put massive effort into tracking, diagnosing and fixing this issue though. Some customers got versions of CPUs with "mirrored" SRAMs. Sad to say, I remember customers still getting errors with those.....
The US-III chips came out with end to end ECC protection, but the problems remained. In the end it turned out to be a host of socket mounting, pin contact and design specification issues that caused the errors, mostly solved by the time the 1200MHz CPUs were out. I wouldn't be surprised if it was something similar with the US-II.
As for Toyota, if they dont have end to end ECC they only have themselves to blame.
Well, thats the nature of de-duplication. Either your data is full of duplicates, or it isn't. If you are not sure what your data will do, you probably should assume no reduction until you have actual numbers.
We had a disk library vendor try to sell us on their dedupe on their VTL, which we put to work backing up exchange data. They told us stories of huge compression ratios, and how Exchange mailboxes compess so easily and we could easily fit weeks of backups onto it. The fact was the machine was allready bought by management, and we just looked after the backups not exchange, so we just set it up as we were ordered to.
Turns out their examples were all based on sites with 1G mailboxes or more. Our bizare setup with 50M mailboxes and thousands of employees was not that good for dedupe because all the repeat mails got moved to peoples PSTs immediately or their mailbox would fill. So after a week of backups and the VTL full, we were stuffed. So then we had to move it all off to tape....
Turns out while it did backup and run its de-dupe fine, unduping the data back for restoration was hopeless - like less than 1MB/s throughput, and so it took days to move off.
And that experience is why I now believe that de-dupe for backups is fundamentally flawed. You want a backup of all your data - not the data some borken firmware decides is unique. Any corruption, bang, all backups useless because they are all missing that highly duplicated block.
I have a case of it. Some time ago an unoffical work server I ran needed some disk, so the SAN admin kindly donated some 1TB of unwanted 5400 rpm PATA Clariion disks, even though I only wanted 300G or so. I put ZFS on it and left it at that.
Anyway, now the server is important, it performs like a dog because the PATA disks are crap, and the Clariion is on the way out. So now it needs to be moved.
So my only option is ZFS send/recv - which is reasonably slow, or a backup/restore, again slow. However being on UFS would have made no difference, still a dump/restore operation.
At a guess the only combination of filesystem/VM that might have done this is VXFS + VXVM, but that is nowhere near free and personally seems to cause as many panics as UFS did.
So its not really a ZFS only issue per-se, but occasionally the need is there.
I am talking about 6Tb in the time of DLT7000 drives and 9G disks. As I recall a "failure" was mostly window overruns caused by jammed tape drives or crap performance. I think it also used AdvFS clones which also had some issues.
The moral of the story anyway was backing up what was needed, not what was there.
An article linked to above suggested the cause was a firmware upgrade failure on a HDS array - sounds like maybe it lost the config or did something nasty during the upgrade. At any rate the core question is where is the backup tape?
Dubious backups? Depends. We had a system which was a 6TB cluster that was notoriously difficult to back up. This went on for years, it took too long, failures caused issues downstream etc. Then someone took a moment to realise that the application was not capable of re-using that 6Tb of data if it was restored - once the data came in it was processed and archived. To recover the application all they had to do was backup a few gig of config and binaries, and restart slurping data from upstream again. Viola - backup stripped down to nothing, 6TB a day of data less to backup, and next to no failures as it was now so quick to backup.
Then there is the case of an application which the vendor and application developer signed off on using a backup solution using a daily BCV snapshot. What they failed to tell us was application not only held data in a database, but in a 6G binary blob file buried deep in the application filesystem. If the database and the binary where out of sync in any way, it could mean missed or replayed transactions or generally that the application was inconsistant. As this was an order management platform, that was bad. You can guess the day we found out about this dependancy.... yup, data corruption, bad vendor advice screwed the binary file and all we had to go on was a backup some 23 hours old where the database was backed up an hour after the application. Because of a corresponding database SNAFU, the recover point was actually another day before that, with the database having to be rolled forward. It was at this point we found out the despite the signed off backup solution, the vendors documented recommendations (that were not supplied to us) was that the only good backup was a cold application one - not possible on a core order platform. Thankfully after some 56 hours of solid work the application vendor managed to help sort the issue out and the restore from backup was not actually needed. The backups were never really tested as the DR solution worked on SRDF - the DR consideration for data corruption was never really part of the design (from a very high level, not just this platform).
So there you have it. Two dubious Enterprise backups - one not needed, the other not usable.
Because unless they state they are only publishing positive reviews, it is misleading to show that all feedback from "users" is positive. It is deceptive to filter out the negatives as it misleadingly portrays the product as good based of what is supposedly unbiased user feedback as opposed to vendor advertising.
For advertising, yes, of course you only show positive reviews, it stands to reason to choose what supports the product (movie etc).
The T-series (sun4v platform actually) have LDoms which are very similar to LPARs, but a bit more simplistic in their implementation. You can virtualise storage, networking on a control domain (i.e like a VIO server) and create domains out of the available threads and memory on the box. So with this you can do individual OSes in each LDom. It even now has dynamic migration where you can migrate a live running LDom between two machines (akin to VMWare Vmotion or the LPAR equivalent, the name of which escapes me now)
I see zones and LDoms complementing each other though. I see zones as great of "environment" isolation where you can make multiple copies of the same application in zones using the cloning and integration with ZFS. You make a LDom to seperate any applications that have real OS versioning restrictions, and put zones within the LDOMs to seperate environments.
That way you can patch a particular application as you want, but you can easily provide new/fresh/cloned application environments using zones.
What workload needs 64 POWER or SPARC procs anymore? More often than not if it needs that much CPU it is horizontally scalable anyway, in which case buying 2x as many T series boxes would be cheaper anyway. Most of the time the reason you have boxes with 64 CPUs installed is for partitioning with LPARs or domains.
And for scaling it all depends what you are doing with the box. We have an application which consumed a full 48 core E6900 (i.e the box was 100% on CPU) because it ran all its components on the one box. We moved a component that consumed 50% of the load onto a single T5240. The T5240 was only consuming 15% CPU with improved response time (granted it was a Borland java application which suited a T-series box a lot better).
For the cost of a E6900 uniboard, we could buy 2-3 T5240s to replace the E6900 and the T5240s would handle 6 times the load.
Looks like a cheap downscale undersized version of a Sun X4500/X4540.
And as others have pointed out, you pay a vender because in 4 years they will still be stocking the drives you bought today, where as for this setup you will be praying they are still on ebay
As I recall it was something to do with the routers that if they lost power, they lost configuration - something to make sure if gear was stolen then it didnt come up with any of the secure networks details.
From memory someone viewed this as him setting up some sort of timebomb instead of being good security practices, and charged him as such.
Well, a fallible point in your logic is assumption that the creator was "god", where as creationism in purity could/should pertain to any "creator". It doesn't, it presumes in the certainty of "god" as the creator and steps from there. Using your argument I could ask for you to present evidence of this "god" as the creator.
Don't get me wrong - there is nothing to prove the argument of evolution except that what it is based on is provable, and demonstrated in nature. It does not complete the picture back to the dawn of time necessarily, but the pattern is so consistent and simple that it needs to make far fewer leaps of "faith" to paint the picture.
I see it more as the arrogance of "man" as a species to presume we could not have evolved from apes or supposedly lower species, which also fits with the invalid belief we have "dominion" over all life on earth. Maybe we are doing pretty good now with this whole "technology" thing since we started by throwing rocks, but nature has shown time and again we are not in charge. The belief that we are in the image of "god" fits more to comforting us to believe we are actually special, not just along for the ride.
Simple - use ZFS + snapshots. There are tools out there allready that to periodic and rolling snapshots - bad delete, go into the snapshot and copy it back, that easy.
The real trick with undeletion is making sure you dont overwrite the data before you are really sure it doesnt need to undeleted....
For hardware support it really depends what segment of the market you are arguing about. If you are talking white box, low end mostly self supported stuff then no doubt, Linux wins hands down. But as a sysadmin I find Linux to be the of the most painful platform to work on compared to Solaris or AIX - predominantly because of the lack of standardised, stable and properly supported management interfaces.
Fibre channel support is a joke. Sure, for the most part you can dynamically bring stuff in and out, and udev goes a short way to bringing some consistancy. The problem is when something goes wrong you are left with pretty much just rebooting - messages tell you nothing - is the device there or not? Usable details are buried away in/proc and/sys and typically are only useful for developers. Solaris and AIX had cfgadm/cfgmgr and lsdev and friends to tell you what state things are in or what has happened. There are useful and informative error messages (typically). So far on RHEL 3/4/5 all I ever see is odd octal dumps from drivers when errors occur, and wierd hangs and IO errors when devices get broken. It gets worse as you change fibre drivers and versions. Options which exist in one disappear in others. Vendor drivers add customisations which cause other issues.
The lack of stablity in terms of being able to do things between versions gets me as well. On AIX/Solaris you write a script for Solaris 8, and it just works going forwards to other versions. Solaris 10 changes things a bit, but for the most part you can still poke around the same places or the same way to get info back. In short they tend not to break things that work.
Linux goes the other way - a change is made, and thats that, it seems to be up to you to either track or figure it out. You find yourself having to customise things for many many variations of platform - not just major versions, but minor versions as well. Changes to config file locations, the ways those files are defined etc.
Don't get me wrong, I got into UNIX on Linux and I wont dispute its strength in drivers or community, but that community is not "Enterprise" focused. Its why I use it for my PVR and not my file server. The rapid changes in Linux are why the DVB-T cards I got became supported so quickly after the hardware changed. I get the differences, but its not one size fits all.
Read the ZFS White Paper. Just because the disk checks its blocks, doesn't prevent other sources corrupting, overwritting or generally tampering with data. For example, say your el cheapo fibre card corrupts one bit in every 2 billion writes - on disk its fine, SMART never sees it, it never complains. When ZFS reads the corrupted blocks it will see a checksum error, and repair if necessary.
It also doesn't cover the case of deliberate/administrative corruption such as accidentally overwriting the wrong disk etc. With a normal mirrored device you could read off either side, it simply returns the data and the data would be blockwise correct. With ZFS again you would see failures and if possible it could correct. In fact this is how the early demonstrations of ZFS would work - simply using dd to clobber one half of the mirror and watching it fix itself.
And for ZFS I would absoultely recommend ECC memory. For the exact case that I had a blown capacitor cause random memory errors on a motherboard I had, and any new or modified file would throw up checksum errors when re-read. Without ZFS I probably would not have known until I got some weird panics from corrupted metadata or something.
We took a Java application off a E6900 using 35% of 48 1.35Ghz US-IV cores. We put it on a T5240 with 16 1.4Ghz cores we saw it only use 14% of the machine with improved user response time.
We also ran a database benchmark for some tests we were running between some AIX and Linux boxes and threw it against a T5240 running Oracle 11g for comparison. Because it was predominately a single threaded operation it ran slower than the 2.2Ghz Power5 LPAR, but the overall difference was about the same ratio as the difference in clock speeds. The thing to note was the machine was only a few percent utilised, so we could have run another 16 or so instances and coped easily.
These machines are workhorses. Granted, you need the right workload but highly parallel/highly transactional work like java web applications or web serving absolutely fly.
ZFS is managed by two commands - zpool and zfs. Both commands are so simple, straight forward and consistant its almost pointless wrapping them in something else.
I wouldn't say error, it was designed with parity protection only, so was incapable of correcting single bit errors, only detecting them. Hence, the reason for the crashes (i.e it detected a bit flip). If two bits were flipped you would never know.
I worked in the Sun front line call support during this time, and explaining this over and over to customers was somewhat painful. Never mind the years of mocking that still come from telling customers "it was a cosmic ray". Sun put massive effort into tracking, diagnosing and fixing this issue though. Some customers got versions of CPUs with "mirrored" SRAMs. Sad to say, I remember customers still getting errors with those.....
The US-III chips came out with end to end ECC protection, but the problems remained. In the end it turned out to be a host of socket mounting, pin contact and design specification issues that caused the errors, mostly solved by the time the 1200MHz CPUs were out. I wouldn't be surprised if it was something similar with the US-II.
As for Toyota, if they dont have end to end ECC they only have themselves to blame.
I would suggest our democracy was formed by a vote on a referendum for federation rather than a minor civil disturbance.
Because they are territories of the federal government, and IANAL but I believe they fall under the Federal court.
ZFS detects the checksum failure, then picks it up from the other mirror or ditto block, replace the corrupted one.
You did setup disk redundancy didn't you?
Well, thats the nature of de-duplication. Either your data is full of duplicates, or it isn't. If you are not sure what your data will do, you probably should assume no reduction until you have actual numbers.
We had a disk library vendor try to sell us on their dedupe on their VTL, which we put to work backing up exchange data. They told us stories of huge compression ratios, and how Exchange mailboxes compess so easily and we could easily fit weeks of backups onto it. The fact was the machine was allready bought by management, and we just looked after the backups not exchange, so we just set it up as we were ordered to.
Turns out their examples were all based on sites with 1G mailboxes or more. Our bizare setup with 50M mailboxes and thousands of employees was not that good for dedupe because all the repeat mails got moved to peoples PSTs immediately or their mailbox would fill. So after a week of backups and the VTL full, we were stuffed. So then we had to move it all off to tape....
Turns out while it did backup and run its de-dupe fine, unduping the data back for restoration was hopeless - like less than 1MB/s throughput, and so it took days to move off.
And that experience is why I now believe that de-dupe for backups is fundamentally flawed. You want a backup of all your data - not the data some borken firmware decides is unique. Any corruption, bang, all backups useless because they are all missing that highly duplicated block.
I have a case of it. Some time ago an unoffical work server I ran needed some disk, so the SAN admin kindly donated some 1TB of unwanted 5400 rpm PATA Clariion disks, even though I only wanted 300G or so. I put ZFS on it and left it at that.
Anyway, now the server is important, it performs like a dog because the PATA disks are crap, and the Clariion is on the way out. So now it needs to be moved.
So my only option is ZFS send/recv - which is reasonably slow, or a backup/restore, again slow. However being on UFS would have made no difference, still a dump/restore operation.
At a guess the only combination of filesystem/VM that might have done this is VXFS + VXVM, but that is nowhere near free and personally seems to cause as many panics as UFS did.
So its not really a ZFS only issue per-se, but occasionally the need is there.
My word from inside Sun was that BP rewrite was putback a few months ago. This was from the organiser of the Australia conference.
How many cores does it take to run a parallel algorithm?
100 - 1 to do the processing, 1 to fetch the data and 98 to calculate an efficient way to make the whole thing run in parallel.
I am talking about 6Tb in the time of DLT7000 drives and 9G disks. As I recall a "failure" was mostly window overruns caused by jammed tape drives or crap performance. I think it also used AdvFS clones which also had some issues.
The moral of the story anyway was backing up what was needed, not what was there.
An article linked to above suggested the cause was a firmware upgrade failure on a HDS array - sounds like maybe it lost the config or did something nasty during the upgrade. At any rate the core question is where is the backup tape?
Which begs the question, where are THOSE backups then?
Dubious backups? Depends. We had a system which was a 6TB cluster that was notoriously difficult to back up. This went on for years, it took too long, failures caused issues downstream etc. Then someone took a moment to realise that the application was not capable of re-using that 6Tb of data if it was restored - once the data came in it was processed and archived. To recover the application all they had to do was backup a few gig of config and binaries, and restart slurping data from upstream again. Viola - backup stripped down to nothing, 6TB a day of data less to backup, and next to no failures as it was now so quick to backup.
Then there is the case of an application which the vendor and application developer signed off on using a backup solution using a daily BCV snapshot. What they failed to tell us was application not only held data in a database, but in a 6G binary blob file buried deep in the application filesystem. If the database and the binary where out of sync in any way, it could mean missed or replayed transactions or generally that the application was inconsistant. As this was an order management platform, that was bad. You can guess the day we found out about this dependancy.... yup, data corruption, bad vendor advice screwed the binary file and all we had to go on was a backup some 23 hours old where the database was backed up an hour after the application. Because of a corresponding database SNAFU, the recover point was actually another day before that, with the database having to be rolled forward. It was at this point we found out the despite the signed off backup solution, the vendors documented recommendations (that were not supplied to us) was that the only good backup was a cold application one - not possible on a core order platform. Thankfully after some 56 hours of solid work the application vendor managed to help sort the issue out and the restore from backup was not actually needed. The backups were never really tested as the DR solution worked on SRDF - the DR consideration for data corruption was never really part of the design (from a very high level, not just this platform).
So there you have it. Two dubious Enterprise backups - one not needed, the other not usable.
Because unless they state they are only publishing positive reviews, it is misleading to show that all feedback from "users" is positive. It is deceptive to filter out the negatives as it misleadingly portrays the product as good based of what is supposedly unbiased user feedback as opposed to vendor advertising.
For advertising, yes, of course you only show positive reviews, it stands to reason to choose what supports the product (movie etc).
The T-series (sun4v platform actually) have LDoms which are very similar to LPARs, but a bit more simplistic in their implementation. You can virtualise storage, networking on a control domain (i.e like a VIO server) and create domains out of the available threads and memory on the box. So with this you can do individual OSes in each LDom. It even now has dynamic migration where you can migrate a live running LDom between two machines (akin to VMWare Vmotion or the LPAR equivalent, the name of which escapes me now)
I see zones and LDoms complementing each other though. I see zones as great of "environment" isolation where you can make multiple copies of the same application in zones using the cloning and integration with ZFS. You make a LDom to seperate any applications that have real OS versioning restrictions, and put zones within the LDOMs to seperate environments.
That way you can patch a particular application as you want, but you can easily provide new/fresh/cloned application environments using zones.
What workload needs 64 POWER or SPARC procs anymore? More often than not if it needs that much CPU it is horizontally scalable anyway, in which case buying 2x as many T series boxes would be cheaper anyway. Most of the time the reason you have boxes with 64 CPUs installed is for partitioning with LPARs or domains.
And for scaling it all depends what you are doing with the box. We have an application which consumed a full 48 core E6900 (i.e the box was 100% on CPU) because it ran all its components on the one box. We moved a component that consumed 50% of the load onto a single T5240. The T5240 was only consuming 15% CPU with improved response time (granted it was a Borland java application which suited a T-series box a lot better).
For the cost of a E6900 uniboard, we could buy 2-3 T5240s to replace the E6900 and the T5240s would handle 6 times the load.
Looks like a cheap downscale undersized version of a Sun X4500/X4540.
And as others have pointed out, you pay a vender because in 4 years they will still be stocking the drives you bought today, where as for this setup you will be praying they are still on ebay
Because 1024 approximates 1000 - hence it is close to a "kilo" unit, and the rest extrapolate from there.
As I recall it was something to do with the routers that if they lost power, they lost configuration - something to make sure if gear was stolen then it didnt come up with any of the secure networks details.
From memory someone viewed this as him setting up some sort of timebomb instead of being good security practices, and charged him as such.
Well, a fallible point in your logic is assumption that the creator was "god", where as creationism in purity could/should pertain to any "creator". It doesn't, it presumes in the certainty of "god" as the creator and steps from there. Using your argument I could ask for you to present evidence of this "god" as the creator.
Don't get me wrong - there is nothing to prove the argument of evolution except that what it is based on is provable, and demonstrated in nature. It does not complete the picture back to the dawn of time necessarily, but the pattern is so consistent and simple that it needs to make far fewer leaps of "faith" to paint the picture.
I see it more as the arrogance of "man" as a species to presume we could not have evolved from apes or supposedly lower species, which also fits with the invalid belief we have "dominion" over all life on earth. Maybe we are doing pretty good now with this whole "technology" thing since we started by throwing rocks, but nature has shown time and again we are not in charge. The belief that we are in the image of "god" fits more to comforting us to believe we are actually special, not just along for the ride.
But good on you for keeping an open mind.
What do you expect from creationists? Rational thought based on your own judgment of presented evidence?
Simple - use ZFS + snapshots. There are tools out there allready that to periodic and rolling snapshots - bad delete, go into the snapshot and copy it back, that easy.
The real trick with undeletion is making sure you dont overwrite the data before you are really sure it doesnt need to undeleted....
For hardware support it really depends what segment of the market you are arguing about. If you are talking white box, low end mostly self supported stuff then no doubt, Linux wins hands down. But as a sysadmin I find Linux to be the of the most painful platform to work on compared to Solaris or AIX - predominantly because of the lack of standardised, stable and properly supported management interfaces.
Fibre channel support is a joke. Sure, for the most part you can dynamically bring stuff in and out, and udev goes a short way to bringing some consistancy. The problem is when something goes wrong you are left with pretty much just rebooting - messages tell you nothing - is the device there or not? Usable details are buried away in /proc and /sys and typically are only useful for developers. Solaris and AIX had cfgadm/cfgmgr and lsdev and friends to tell you what state things are in or what has happened. There are useful and informative error messages (typically). So far on RHEL 3/4/5 all I ever see is odd octal dumps from drivers when errors occur, and wierd hangs and IO errors when devices get broken. It gets worse as you change fibre drivers and versions. Options which exist in one disappear in others. Vendor drivers add customisations which cause other issues.
The lack of stablity in terms of being able to do things between versions gets me as well. On AIX/Solaris you write a script for Solaris 8, and it just works going forwards to other versions. Solaris 10 changes things a bit, but for the most part you can still poke around the same places or the same way to get info back. In short they tend not to break things that work.
Linux goes the other way - a change is made, and thats that, it seems to be up to you to either track or figure it out. You find yourself having to customise things for many many variations of platform - not just major versions, but minor versions as well. Changes to config file locations, the ways those files are defined etc.
Don't get me wrong, I got into UNIX on Linux and I wont dispute its strength in drivers or community, but that community is not "Enterprise" focused. Its why I use it for my PVR and not my file server. The rapid changes in Linux are why the DVB-T cards I got became supported so quickly after the hardware changed. I get the differences, but its not one size fits all.
Read the ZFS White Paper. Just because the disk checks its blocks, doesn't prevent other sources corrupting, overwritting or generally tampering with data. For example, say your el cheapo fibre card corrupts one bit in every 2 billion writes - on disk its fine, SMART never sees it, it never complains. When ZFS reads the corrupted blocks it will see a checksum error, and repair if necessary.
It also doesn't cover the case of deliberate/administrative corruption such as accidentally overwriting the wrong disk etc. With a normal mirrored device you could read off either side, it simply returns the data and the data would be blockwise correct. With ZFS again you would see failures and if possible it could correct. In fact this is how the early demonstrations of ZFS would work - simply using dd to clobber one half of the mirror and watching it fix itself.
And for ZFS I would absoultely recommend ECC memory. For the exact case that I had a blown capacitor cause random memory errors on a motherboard I had, and any new or modified file would throw up checksum errors when re-read. Without ZFS I probably would not have known until I got some weird panics from corrupted metadata or something.
Really? What sort of test was it?
We took a Java application off a E6900 using 35% of 48 1.35Ghz US-IV cores. We put it on a T5240 with 16 1.4Ghz cores we saw it only use 14% of the machine with improved user response time.
We also ran a database benchmark for some tests we were running between some AIX and Linux boxes and threw it against a T5240 running Oracle 11g for comparison. Because it was predominately a single threaded operation it ran slower than the 2.2Ghz Power5 LPAR, but the overall difference was about the same ratio as the difference in clock speeds. The thing to note was the machine was only a few percent utilised, so we could have run another 16 or so instances and coped easily.
These machines are workhorses. Granted, you need the right workload but highly parallel/highly transactional work like java web applications or web serving absolutely fly.
ZFS is managed by two commands - zpool and zfs. Both commands are so simple, straight forward and consistant its almost pointless wrapping them in something else.