Do Car Safety Problems Come From Outer Space?
Hugh Pickens writes "As electronic devices are made to perform more and more functions on smaller circuit chips, the systems become more sensitive and vulnerable to corruption from single event upsets. This is especially true of Toyota, which has led the auto industry in its widespread inclusion of electronic controls in the manufacture of their various car models. 'These circuit families store not just data, but their basic function electrically,' says Lloyd W. Massengill, director of engineering at the Vanderbilt Institute for Space and Defense Electronics at Vanderbilt University. 'In the unfortunate event of a particle flipping just the right bit, a circuit configured to carry out a benign action may be reprogrammed to carry out some unintended action.' Denise Chow writes in Live Science that some scientists are pointing to cosmic ray radiation as a plausible mechanism behind the sudden, unexplained acceleration reported to have occurred with the late model Toyotas."
"As the design of automobile systems continues to evolve from mechanical to electronic controls, relying more and more on various circuitry and chips, these electronic components may be vulnerable to being confounded by high-energy radiation writes Chow. Federal regulators were prompted to look into the possible role that cosmic rays played in Toyota's product recall fiasco after an anonymous tipster suggested the design of Toyota's microprocessors, software and memory chips could make them more vulnerable (PDF) to interference from radiation compared with other automakers. 'What's not known is what direction Toyota and other automakers are taking in terms of finding and correcting these issues,' says senior researcher Ewart Blackmore."
there should be some checksum that shouldn't add up. When a fault is detected, it should go to a backup program about safely shutting down the car.
Or how about a computer redundancy system where a group of computers that are all capable of controlling the car watch the behavior of the computer that is actually controlling the car. Through a voting system they could decide to hand the control of the car over to a another computer in the event that the controlling computer doesn't act in a way that was deemed safe. This way the car could continue to operate normally while signaling that there is a problem that needs to be addressed.
Look where all this talking got us, baby.
Actually, it was due to a design error, as the cache wasn't ECC protected and occasional bit-flips weren't detected.
http://www.sparcproductdirectory.com/artic-2001-dec-1.html
>>Confirmed cases of runaway acceleration are virtually non-existent.
And how do you confirm it? Ask the person?
My '84 Cutlass Supreme went out of control accelerating when I was driving on the campus loop (back in '97 or so), but how could you confirm this? It did happen, but how can you verify it? (I've posted the story on Slashdot before, if you really dig back into my history, long before the runaway Toyota thing entered our national consciousness.)
And to the snarky people posting on this - it's terrifying as fuck for your car to accelerate arbitrarily fast (especially when you run a stop and have to dodge pedestrians), and no, the brakes didn't work. Long story short, I had to kill the gas and use non-power assist brakes to come to a stop, fortunately without killing anyone.
The problem is that many microcontrollers used in automotive systems don't have support for ECC or any other hardware based error checking mechanism. A lot of these systems only use the memory on the microcontroller chip. If there is external RAM on the unit, ECC memory isn't always used since it is more expensive. Flash is typically checksumed/CRCed/MD5 checked, but you don't typically see flash cells get flipped in the field. I've seen one unit get flash corrupted(out of many millions of possible units) in the last 11 years.
It will be interesting to see if they get to the root cause of the problem. If it is an electromagnetic interference problem, it will be very difficult.
There's a reason it's always hitting the same system in the car.
It may be that the system or packaging in which the processor or memory is embedded emits alpha particles at an unusually high rate. It wouldn't be the first instance of that happening.
I think its highly likely that Toyota would have included checksums for their data. They put their cars through a lot of testing and I'm sure all the mobile phone, bluetooth, and other RF interference would have been tested in their labs. They know their cars last 20+ years so I'm sure they would have tested their electronics to so it can handle degraded and faulty wires and interference.
Yeah sure, some cosmic particle could flip a bit in your data, but with a checksum you'd throw away that corrupted packet and keep going.
Given that the electronics is responsible for everything in the car (including the timing of every spark in the cylinders) you think other things like an engine misfiring would be the most likely thing to have happen. These cars have data flowing through them all the time.
It sounds more and more like a software bug the more I read. Sure something could have mucked up the software - but you'd get random outcomes of that.
If the common outcome is sudden unintended acceleration - then it sounds like the bug is in the same section of code - sounds like a software bug - not some random "act of god" liability reducing cosmic particle that's figured out how to change the same bit on multiple cars spread across the globe.
Maybe they should have gone for the more internet friendly headline "aliens attack toyota model cars with accelerating retractor beams" - it'd sound just a plausible as their cosmic ray problem
More to the point they generate secondary showers of ionizing radiation when they transverse metallic shields so we should be careful not to make the problem worse by creating showers of particles with a greater cross section.
http://michaelsmith.id.au
At the risk of sounding like a geezer, I remember back in the late 70's when this was a problem in early designs of mini-computers. Then we used to see single bits get flipped and crash computers from a variety of sources including cosmic radiation and alpha particles that came from the spontaneous decay of elements in the ceramic chip housings. More recently, when I purchased my 2005 Cadillac CTS it experienced a variety of problems similar to this when I would drive through a toll station that was equipped with RFID ID systems. Behaviours including sudden acceleration, engine stalling, indicator lights on the instrument panel going "crazy", On-Star calling for help when nothing was wrong, causing the driver's seat to suddenly drive forward to the steering wheel (making it really hard to steer), etc. At the time the only solution was to pull over, shut off the car, remove the key, open the door, wait for everything to shut down and then restart. After many frustrating weeks of "we can't duplicate the problem" it was discovered that the car had faulty shielding on one of the cables that makes up the in-car network. Once fixed the "gremlins" went away. The real crime here is that, because the problem can't be replicated on demand, Toyota is blaming the behaviour on attention seeking owners. This bizare response was recently repeated on the floor of Congress by one of Toyota's congressional tools. (I mean duly elected government representative.)
When I was working for NASA, on the NISN network, we'd get these weird router crashes for the old Cisco router located at (or very near) the South Pole in Antarctica. It was always a memory problem, and I'd always have to call someone to get them to powercycle the router. It irritated me to keep bothering those guys, so I opened a case with Cisco TAC.
The TAC guy sent a terse response, saying that particular crash was a "transient memory error" due to "alpha radiation or sun spots." That really pissed me off -- Cisco TAC just gave me a standard BOFH response! I escalated, and swung the NASA club around some, and finally got a senior engineer on the phone. "You said this router's at the South Pole, right? So that means it's at very high altitude, with very little ozone shielding, right?" "Umm, yeah." "Well there you go. There's a lot more radiation at that altitude than at sea level. Our stuff's only rated for sea level. See if they can .. I dunno, put a lead blanket over it or something."
I relayed the info to my contact at McMurdo, and he laughed and said he'd figure something out.
On a hunch, I checked the other two "high-altitude" routers we had, and sure enough, they both had a statistically higher failure rate for "transient memory errors".
http://unxmaal.com
In order for it to interfere with a digital circuit, it first has to be radiation of the "ionizing" category
Neutron radiation isn't considered ionizing, yet interactions between the neutrons and the silicon in a typical chip will create charged particles that cause current surges. These current surges can interfere with the correct operation of a circuit and that includes individual transistors, not just bits in memory.
I remember a news story from several years ago that even made the evening news. Someone had a Saturn car that they realized they couldn't afford and tried to return. The dealer wouldn't just take it back for a full refund, since it was now a used car.
Over the next few months, the driver had several "emergencies" with it, each time having it towed back to the dealership, where they couldn't find a problem. One in particular that was video taped by the police, the car was circling in a parking lot and the driver called 911. The insisted the car wouldn't stop. They told her to step on the brakes, use the emergency brake, throw it in neutral, shut it off, etc, etc, etc... She circled for something like 30 minutes. Finally they got her to open the drivers window, and an officer got in the middle of where it was circling. He ran for the side of the car, grabbed the wheel, and then turned off the key. The car (amazingly enough) came to a stop.
Of course, she claimed it wouldn't stop for her. There was all kinds of talk about lemon laws, and how Saturn vehicles weren't safe. She made a whole bunch of noise, and the dealership traded her car for another one. The problems persisted for her. Obviously Saturns were amazingly dangerous vehicles. Someone from the dealership (I think the owner) actually started driving her original car to work every day, to find out what the problem really was. He never had a problem.
Eventually, she was charged, I believe with reckless endangerment. Pretty much, she was driving dangerously, and endangered the officers who tried to help her.
I won't say that the mystery Toyota is driver error or a mechanical problem, but where the cases that have been in the news have massive parallels in other vehicles too, where drivers just did the wrong things.
A older lady in a Buick several years ago was pulling into the parking lot where I worked. I happened to be in the front of the store, and heard her tires squeal. She smashed into a parked car. That broke the parking pawl and sent the parked car across the parking lot into two other parked cars. One of those cars belonged to one of my coworkers, who wasn't exactly very happy that his car was totaled. I ran out to see if she was ok (once the cars stopped moving). She said "What happened?" I told her what she did. She was very insistent that she hit the brakes. I told her she spun the tires before hitting the first car. She said the other car must have done it. The driver of the other car was in the store at the time. At least everyone with wrecked cars had a good sense of humor about it, and no one was hurt. The funniest part was, her car was fine. There was absolutely no damage. It wasn't even scratched. The other three car were severely damaged though. Her insurance gave my coworker full book value on his car, even though it was a rusted piece of junk that barely ran. They were fully aware of it, they were just avoiding potential legal problems.
Serious? Seriousness is well above my pay grade.
And how do you confirm it?
You replicate it and see if it happens again, or look for physical causes that might come to that result. Loose floormats have been confirmed to cause it. rusty/sticky throttle cables have been confirmed to cause it. Bad cruise control units have been confirmed to cause it (mostly because of physical errors, not all are electronic).
But "the car accelerated, I applied the brake and only the brake once the acceleration started and pushed it as hard as I could and the vehicle continued to accelerate out of control" cases have, as far as I know, *never* been replicated. The brakes are somewhere around ten times more powerful than the engine. If you slam the brake pedal to the floor with all your might, you will stop all cars, unless your brakes failed before you tried to use them. So, every case of "I pressed the brakes as hard as I could with my foot off the throttle" defaults to someone that didn't have their foot on the brake and off their throttle.
And to the snarky people posting on this - it's terrifying as fuck for your car to accelerate arbitrarily fast (especially when you run a stop and have to dodge pedestrians), and no, the brakes didn't work. Long story short, I had to kill the gas and use non-power assist brakes to come to a stop, fortunately without killing anyone.
Another reason why manuals are better. You just put in the clutch, and the car stops accelerating. And turning off the car or putting it in neutral is so easy one wonders about the competency of the California trooper who was out of control for over a minute.
But for brakes to not stop a car means the brakes are so bad that their failure should have been evident before the incident. Would you say the car you were in when this happened was in excellent mechanical shape without any problems braking or accelerating ever before that incident? I had a Cutlass Ciera of about that age that accelerated out of control once. It was the cruise control that got stuck in the "accelerate" position. The brakes worked. But the car is so crappy that if I'd used the brakes to hold the constant speed for 10+ seconds before trying to stop as fast as possible, they would have faded to the point they would be useless. So when people make reports, it's also interesting to me how long people are holding the brakes at low pressure before going to high pressure. Because, especially in crappy American cars, like Oldsmobiles, the brakes fade fast. They have more than enough power to stop you from 100+ mph under full acceleration, but can't do so after riding them for a mile.
Learn to love Alaska
> Radiation that can upset bits in an electronic circuit don't come from your cell phone, TV/radio stations or microwave oven
> You may get enough EMI to interfere with your radio, but flipping individual bits in a chip pretty much requires an ion
You don't need to flip individual bits in a chip to cause problems with car electronics. I suspect if something flipped dozens or thousands it would still cause problems. So you shouldn't get so fixated on individual bit flips.
From the perspective of car safety, the people that are saying "outer space" seem like they're clutching at straws.
As for the removal of lead. It actually made the tin-whisker problem bigger and thus made stuff less reliable.
I strongly doubt the removal of lead was anything to do with making stuff more reliable by avoiding lead decay, if you can provide a decent citation for that, that'll be interesting.
Having worked in digital electronics for 30 years, I have seen some pretty strange ways to introduce noise into digital circuits:
1) Inadequate grounding - two circuits are communicating, but are grounded to two different ground planes. Over time they build up a potential difference, and the 5 volts necessary to form a "1" bit starts to look like 3 or 4 volts to the other side. The signal just "stops", until you power down and the charge bleeds off. It won't reoccur during short tests.
2) Static electricity. Cars develop thousands of volts in static electric potential from air friction, just like airplanes. You may laugh, but static can be devastating to digital circuits. It can make craters in chips and even when it doesn't destroy them it can flip bits undetected until they are accessed. I worked on one system that would reboot whenever my boss walked by and brushed against it wearing a wool suit jacket - true story.
3) Temperature sensitive dielectric in the capacitors. Capacitors are shielding the power lines on the bus from digital information - which behaves just like high frequency noise. The capacitors get hot from engine heat, the dielectric looses its resistance to electrons, the capacitors fail temporarily and allow digital noise onto the power lines which then bleeds into the circuits attached - causing random errors all over the place.
4) The antenna effect - circuits operating in the multi-hundred megahertz to gigahertz frequencies start to radiate from copper conductors on the circuit board - these signals can be picked up by other copper traces on the circuit board and cause "ghost" signals. It is often necessary to use micro-coax cable instead of etched copper traces to quell this problem.
Toyota should let their computer geeks go back to playing WoW, and give a couple of good high-frequency electrical engineers look a the problem.
This was indeed a real problem in the late 70's, particularly for DRAM chips and only ceased to be a problem when manufacturers tightened up on the allowable level of impurities in materials near the memory chips, such as the encapsulating plastics and the chip coatings used within ceramic ICs. Many elements have naturally occurring isotopes that are radioactive and DRAM errors are dependent on the concentration of these within materials surrounding the memory chip and the radioactive decay method. Back then of course we had atmospheric atomic testing and straw packing material was a good way to capture atmospheric fallout (and a good way to get fogged photographic film too). When you consider the effect of Moore's Law on the size of the capacitor used within the DRAM over the last 30 years (the bit flip is caused by the radioactive decay particle discharging this capacitor) and the fact we can't make perfectly pure materials at an economic cost, it is surprising that this problem is not more obvious now. I suspect software bugs are more likely to be the cause however.
My dad was an IBM CE (Customer Engineer) specialist on one of the models in the IBM System/360 mainframe range. He used to like telling the story about how he and another engineer were out on a customer's site trying to determine an intermittent fault. They would bring the machine up and sure enough there would be this glitch at precise intervals. They just couldn't figure out what was causing it. That was, until the other CE took a look out the window.
After a bit he said 'Tell me when it happens'. OK... '...now' my dad said. Then he said 'I'll tell you when the next one happens' and a few seconds later said '...now'. Which is exactly when it did glitch.
It turned out that the customer's DP center was situated close to an airport. The CE could see the radar dish revolve at the end of the runway. When it pointed straight at him was when the glitch occurred. Needless to say the computer room received some RF shielding.
You misunderstand my argument. That's OK -- it happens to me all the time.
Allow me to rephrase: What are the chances of the RAM being marginally-bad in such a way as to allow unintended acceleration, while not producing any other symptoms?
The chances of it being bad to begin with are slim (after all, all RAM is tested, often by more than one party). But this won't be just any RAM -- this will be, in today's terms, glacially slow RAM which has been tweaked to perfection over the past decade (or more), because the stuff that a Prius does just doesn't require anything lightning fast. (See, also: US space program.)
I'll go ahead and answer the question: The chances of bad RAM causing unintended and irrevocable acceleration and no other badness are about the same as bad RAM causing your PC to boot up and say "Hello, world!" instead of loading an OS. Could it happen? Why, sure! (In other news: A thousand monkeys and a thousand typewriters will, eventually, produce the complete works of Mark Twain as long as you replace the parts when they wear out.)
Will it happen? Ummm.......
Will it happen more than once? Uh. Erm. *ahem*
Kid-proof tablet..
Having heard all these stories really makes me wonder, i live in Belgium where cars with manual gear boxes are the common norm, and i've had my car accelerate like nuts once (pedal got stuck because of the floormat) i shifted to neutral, turned of the engine & used my momentum to get to the side of the road where i could dislodge the mat.
Are manual gearboxes that rare in the States?
Wrong analogy. Windows does crash a lot. It should be "It reminds me of Windows users who say Linux isn't ready for the desktop".
Funny, this is the first time I ever saw a computer analogy used to explain a car problem in Slashdot. But, come to think of it, this is a rather neat analogy. Toyota is blaming their problems on driver error, Microsoft says third-party drivers are the only cause of crashes in Windows ever since XP came out.
Both of these corporations are *wrong* at that, any system should be resistant to outside errors.
A computer shouldn't crash just because a hardware driver fails. I have seen several Linux computers freeze when running some graphics applications, ATI cards are particularly prone to this, but you can still enter through the network and kill the offending application or, at worse, restart the windowing system. The fault with Windows is not the third-party hardware driver, it's the windowing system being built into the operating system.
Likewise, a car shouldn't depend entirely on one computer system for operation. Brakes, even with anti-lock, should have a hydraulic system that should always be able to stop the wheels from turning if the driver presses hard enough on the pedal. The transmission should have a mechanical lever that puts it into neutral. Steering should be operable by mechanic links from the wheel if the power-assisted system fails.
All this because a broken mechanical link or a leaking hydraulic system can be seen, or heard, but a software bug will remain lurking there undetected until it kills you.
The effect of random bit flips on software is going to be hard to define. Modern hardware probably has all of the code running in RAM, not ROM as it would have been back in the 80's. A bit flip in a register could cause very odd things to happen. Perhaps someone coded a loop like:
for (i=0; i!=10; i++)
do_something();
Flip a bit in the register and that loop will not terminate until the register overflows.
I don't think you can code so that random bit flips will not be a problem. The hardware needs to be robust enough to catch them and either fix them or at least throw an error so that things can be reloaded.
I haven't looked at the communications protocols in use between the various modules but it wouldn't surprise me if there were a lot of possibilities for errors in there as well. Software engineers will put a lot of reliance on "checksums" and swear up and down that there is no possibility for things to go wrong, but in the end it turns out the checksums used are not very robust. TCP/IP checksums, for example, are almost worthless but most TCP/IP communications takes place over links with robust checksums so they're not tested very much. I implemented very simple links (TCP/IP over a VME bus - don't ask it was a whacky idea) and found that single bit errors in the hardware could get through a single layer of the checksums quite easily (that is, it would pass the IP checksums but the TCP checksums would catch things).
I wouldn't say error, it was designed with parity protection only, so was incapable of correcting single bit errors, only detecting them. Hence, the reason for the crashes (i.e it detected a bit flip). If two bits were flipped you would never know.
I worked in the Sun front line call support during this time, and explaining this over and over to customers was somewhat painful. Never mind the years of mocking that still come from telling customers "it was a cosmic ray". Sun put massive effort into tracking, diagnosing and fixing this issue though. Some customers got versions of CPUs with "mirrored" SRAMs. Sad to say, I remember customers still getting errors with those.....
The US-III chips came out with end to end ECC protection, but the problems remained. In the end it turned out to be a host of socket mounting, pin contact and design specification issues that caused the errors, mostly solved by the time the 1200MHz CPUs were out. I wouldn't be surprised if it was something similar with the US-II.
As for Toyota, if they dont have end to end ECC they only have themselves to blame.
"If everybody is thinking alike, somebody isn't thinking" - Gen. George S. Patton
The last few process generations of DRAM have not become more susceptible to radiation induced soft errors as originally predicted but instead have leveled off or even gotten a little better. CPU static RAM based cache has an order of magnitude higher susceptibility for a number of different reasons but there, ECC (or parity for instruction cache since bad instructions can just be reloaded) has been routine for quite a while. Larger memory sizes make systems as a whole more susceptible though and the cosmic ray induced soft error rate is measurable on modern PCs with altitude making a difference of at least 2 orders of magnitude. Sea level has about 1/10th the rate of Denver which has about 1/10th the rate of a cruising passenger jet airplane.
For DRAM, I suspect what is going on is that the smaller charge storage volume means that any given ionization event is spread over more cells while each cell's higher charge density makes it less susceptible.
I have had full ECC support on my last three home workstations (P3 1GByte, P4 2GByte, and now a Phenom 2 8GByte since Intel was not an option) but have not recorded enough events to draw a meaningful conclusion.
Before the engine was started, the main micro would cycle the throttle plate. The current draw of the h-bridge would get monitored during that test to verify the spring was working. The whole test too 500ms. So, a broken spring would get detected and the car would be put into "limp home" mode where the engine was only allowed to idle.
The whole thing was crazy like that. There were some many test.
It's quite common actually, and many documented studies have proven it does occur. You don't hear much because well, the effects are minimal in most cases. A flipped bit in RAM does nothing if it's just unused memory, for example. Or maybe it flips the bit in an unused register (that's getting reloaded with new data). Or alters the result of an unused computation unit. Heck, there were old RAM chips made with somewhat radioactive encapsulation - the computers they were in crashed more frequently than normal.
Other times, it may show up as a graphical glitch in a game - a fiddly pixel that goes away on next refresh, or other unnoticed operation. If it damages a critical data structure, well, an application just crashes. If it gets really lucky and gets a crucial kernel data structure, then the computer crashes/panics/BSODs.
The amount of data damaged is on the order of a bit. Depending on the whole system, that bit could be nothing (i.e., unused), unnoticable (a flicker in a pixel in the framebuffer), or crucial (application/OS crashes).