First, you want to get some very experienced engineers who have done this type of thing before. Try ones with a background in either Avionics or medical devices,
since both are life-critical / mission critical arenas. Second, you may want to look at companies which make fail-safe systems as these usually require special purpose hardware. HP has a computer line called NonStop which may be worth
looking into (no, I don't own any HP stock:)).
In terms of techniques:
1. NEVER, NEVER, NEVER, NEVER -- NEVER execute a loop waiting for some event
to happen, that does not have a bailout mechanism, even if its just
counting a variable up to (or down from) a few million or so (however
long you've determined would be the maximum wait interval. If a piece
of hardware breaks or a sibling thread crashes you'll be out to lunch.
2. Try to use a real-time system that is used on fail-safe systems
commercially.
3. Don't use Windows. No matter how defect-free / error-free you make
your system, it won't matter, because Windows will have more than
enough defects and flaws to make your system fail in weird and
mysterious ways.
4. Use a journalling file system like ext3 or reiserfs.
5. keep a recent copy of your operational state / data somewhere safe,
like in non-volatile memory. If your system has to restart itself,
this data will help you become operational again much faster.
6. Use a watchdog timer. Basically, this is a piece of hardware that
your code has to "feed" on a periodic, repeated basis. If your
code gets hung up in an infinite loop somewhere, the watchdog timer
will assert the reset line and start things up again. That's where
your "warm" data comes into play.
7. As many here have mentioned, try to partition your system in such a way
that you can stay away from C++ as much as possible.
8. As some here have mentioned, real-time java or a commercial garbage
collector library service could help alot in avoiding pesky memory
leaks.
9. Assume you will mess up the first time. Its a much more realistic
assumption than assuming you'll get it right the first time. Hey,
most of us didn't even get our first KISS right the first time, and
what you are looking at is alot more complicated than that:)). So,
schedule enough time to do so (call the first one an R&D program),
collect enough information about your design decisions and rationale
that they will help you to understand where you went wrong, and help
you to do better the second time around. Good Luck.
10. You've gotten alot of good comments from a lot of very intelligent
and experienced people on this list. Read them over carefully.
Good Luck
dennis
Well, an "always on" connection does have a higher risk factor. However, you can
get similar benefits just by unplugging your ethernet cable. It generally plugs into the back of your PC, and the other end plugs into either a telephone jack on the wall, or into an intermediary box such as a router, a switch, or whatever. When you are not doing internet related stuff, remove one end of the cable or the other from the jack it plugs into. Terribly low tech, but.. it works.:)
First, you want to get some very experienced engineers who have done this type of thing before. Try ones with a background in either Avionics or medical devices, since both are life-critical / mission critical arenas. Second, you may want to look at companies which make fail-safe systems as these usually require special purpose hardware. HP has a computer line called NonStop which may be worth looking into (no, I don't own any HP stock :)).
In terms of techniques:
1. NEVER, NEVER, NEVER, NEVER -- NEVER execute a loop waiting for some event
to happen, that does not have a bailout mechanism, even if its just
counting a variable up to (or down from) a few million or so (however
long you've determined would be the maximum wait interval. If a piece
of hardware breaks or a sibling thread crashes you'll be out to lunch.
2. Try to use a real-time system that is used on fail-safe systems
commercially.
3. Don't use Windows. No matter how defect-free / error-free you make
your system, it won't matter, because Windows will have more than
enough defects and flaws to make your system fail in weird and
mysterious ways.
4. Use a journalling file system like ext3 or reiserfs.
5. keep a recent copy of your operational state / data somewhere safe,
like in non-volatile memory. If your system has to restart itself,
this data will help you become operational again much faster.
6. Use a watchdog timer. Basically, this is a piece of hardware that
your code has to "feed" on a periodic, repeated basis. If your
code gets hung up in an infinite loop somewhere, the watchdog timer
will assert the reset line and start things up again. That's where
your "warm" data comes into play.
7. As many here have mentioned, try to partition your system in such a way
that you can stay away from C++ as much as possible.
8. As some here have mentioned, real-time java or a commercial garbage
collector library service could help alot in avoiding pesky memory
leaks.
9. Assume you will mess up the first time. Its a much more realistic
assumption than assuming you'll get it right the first time. Hey,
most of us didn't even get our first KISS right the first time, and
what you are looking at is alot more complicated than that :)). So,
schedule enough time to do so (call the first one an R&D program),
collect enough information about your design decisions and rationale
that they will help you to understand where you went wrong, and help
you to do better the second time around. Good Luck.
10. You've gotten alot of good comments from a lot of very intelligent
and experienced people on this list. Read them over carefully.
Good Luck
dennis
Well, an "always on" connection does have a higher risk factor. However, you can get similar benefits just by unplugging your ethernet cable. It generally plugs into the back of your PC, and the other end plugs into either a telephone jack on the wall, or into an intermediary box such as a router, a switch, or whatever. When you are not doing internet related stuff, remove one end of the cable or the other from the jack it plugs into. Terribly low tech, but .. it works. :)