Toyota's Killer Firmware
New submitter Smerta writes "On Thursday, a jury verdict found Toyota's ECU firmware defective, holding it responsible for a crash in which a passenger was killed and the driver injured. What's significant about this is that it's the first time a jury heard about software defects uncovered by a plaintiff's expert witnesses. A summary of the defects discussed at trial is interesting reading, as well the transcript of court testimony. 'Although Toyota had performed a stack analysis, Barr concluded the automaker had completely botched it. Toyota missed some of the calls made via pointer, missed stack usage by library and assembly functions (about 350 in total), and missed RTOS use during task switching. They also failed to perform run-time stack monitoring.' Anyone wonder what the impact will be on self-driving cars?"
I'm convinced. I'll give up my career as a computer programmer now, and go use my bare hands for subsistence farming now. Sorry, I was wrong.
Those working on self-driving cars and those that are watching the technology already know that any such car would need an absolutely 100% rock solid OS.
This changes nothing.
The owner of a self-driving car will have had to accepted the EULA and accepted not to hold the manufacturer liable for sofware defects. (half joking but I wouldn't rule it out)
Sure, they will be more safe. Just like in the aviation industry, where each incident/crash is investigated meticulously, and flying has become safer ever since 1903. With non-selfdriving cars 99% of the incidents were caused by human error. Now no more, so we can fix it!
2nd link, 5th paragraph:
Anyone wonder what the impact will be on self-driving cars?
A longer chapter on debugging in the first edition of "Programming Self-Driving Cars: The Missing Manual."
Brought to you by Carl's Junior.
If there's no human fall back or ability to overthrow the computer's control of the car I'll never drive it. I don't think this will change anything except maybe give the people that are rushing for self-driving cars some pause. Every developer knows the risks of self-driving computer controlled cars (if they don't, well they're naive). Between human error in programming and human maliciousness, there are two camps. People who think they can overcome the possibilities of putting a semicolon in the wrong place and prevent hackers from comprising the software's integrity. And people who realize the first people are fooling themselves.
Your post demonstrates a complete lack of understanding of what JIT manufacturing (i.e. lean) is and what it tries to accomplish. Hint: it's not about doing more with less. Further, you either willingly fail to mention Kaizen (continuous improvement) or just aren't aware that THIS is the heart and soul of the true Toyota Way.
Whatever the reasons they failed in software engineering, neither JIT nor Kaizen would be to blame because neither of those try to nor should they translate to "engineer badly".
'Although Toyota had performed a stack analysis, Barr concluded the automaker had completely botched it. Toyota missed some of the calls made via pointer, missed stack usage by library and assembly functions (about 350 in total), and missed RTOS use during task switching. They also failed to perform run-time stack monitoring.'
Huh? I'm a software engineer and don't understand the relevance of this statement, how can a jury? How does it confirm that there was a defect?
My God can beat up your God. Just kidding...don't take offense. I know there's no God.
Any engineering project requires that the engineers have to answer for what they've done. The mantra is, "As an engineer, if you fuckup, someone dies." Every engineer, regardless of discipline, needs to understand this and if they don't, should consider going into Liberal Arts or something equally useless where the worst they can do is fuck up my food or drink order.
some karma... and kinda lukewarm about it.
"If there's no human fall back or ability to overthrow the computer's control of the car I'll never drive it."
by definition you wouldn't be driving it.
The Kruger Dunning explains most post on
Trust? No, I'd want to see test results. Believe that it's possible? Hell yes.
You mean humans, who get it wrong 10 million times a year in the USA alone?
10M accidents out of 250M drivers isn't a very good error rate.
Good lord, they have got to be kidding? If Toyota (or their parts suppliers) are making those kinds of errors, you can bet your ass that other manufacturers will be making them as well.
There needs to be very strict set standards for car control systems. We have standards for OBD, so why not strict, over engineered and thoroughily coded critical systems standards? Even better, why not make them open standards, including the hardware?
Standardising would make parts cheaper as well as stopping manufacturers from building closed black box units that may be of dubious quality. It would also make it easier to maintain and repair modern cars as they get older, and allow third parties to provide new hardware long after the manufacturer loses interest.
As an aside, I do wonder what we're going to do in ten years time when the failure rate for most of the control hardware starts creeping up. Would they fail safely? Would the repair cost be prohibitive?
It would be a sad irony if these environmentally conscious efficiency improving measures resulted in cars being scrapped en masse because the ECU that superseded a $10 throttle cable costs a grand.
Actually, there is absolutely zero proof that they did fail.
NASA certain could not find any way to fault the system.
What this decision is based around is a bunch of technical argument that they could have tried harder to prove
that the system could not fail, but with absolutely zero proof that it does or even can fail. No procedure to make
the software fail was presented, no theory of a set of inputs that could result in the theorised output was presented,
only a critique of their testing and analysis procedure that poked a few holes in that.
This is a VERY concerning direction for programmers in the USA, as of course complex software by definition cannot
be proven correct (at least there currently exists no known way). It opens the door for all sorts of development-process
based litigation, which is a very very bad direction for things to take.
Again, so far ZERO evidence, proof, or test case has been provided that the software is in any way responsible for this
problem.
It's an ECU, not a desktop. All those latencies you're used to are OK when you're browsing the internet or programming the Next Big Thing, but they are not acceptable when you're adjusting fuel ratios, timing detonations, responding to impact sensors etc.
You clearly have no idea what you're on about, or why real-time operating systems are real things that have actual niche use.
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Ha ha, classic:
"Svenson (1981) surveyed 161 students in Sweden and the United States, asking them to compare their driving safety and skill to the other people in the experiment. For driving skill, 93% of the US sample and 69% of the Swedish sample put themselves in the top 50% (above the median). For safety, 88% of the US group and 77% of the Swedish sample put themselves in the top 50%." cite.
I'd be happy with a car OS that kills less than 30,000 people per year.
If a car manufacturing defect kills anybody at all, then there should be manufacturer's liability for it.
They don't get a free pass just because of the kind of manufacturing defect, there's no privilege against liability just because it's a software defect.
-wb-
There was a time after automated elevators first came out when people refused to use them because they didn't trust them without a "human fall back or ability to overthrow the computer's control". Today, when nearly all the elevators we've ever seen were automated, this seems crazy.
In 50 years, when most people have never seen a manually operated car, we'll seem just as crazy for not trusting them.
Did you read TFA?
In a nutshell, the team led by Barr Group found what the NASA team sought but couldn’t find: “a systematic software malfunction in the Main CPU that opens the throttle without operator action and continues to properly control fuel injection and ignition” that is not reliably detected by any fail-safe.
That's proof, not an argument that they could have tried harder to find the system could fail. The bottom line is that its software that puts people's lives at risk. It's reasonable to hold that type of code to a higher standard. There are millions of other cars, trains, and planes out there with similar software but without this type of problem. At some point you should be responsible for the things you create.
Oh certainly, there's lots of reasons to have all sorts of things wireless, and I fully expect all the fancy media systems, etc to go that route. I just don't think the autopilot will be so, any more than the engine control module is today. A wireless kill switch is one thing, but that doesn't need to be connected to the autopilot, just its power line. And so long as the producers aren't shielded from liability for faulty security I would expect them to take a heavily safe route.
That's not to say that I would be surprised by a networked navigation computer/robotic chauffeur/etc. I just don't think there is any reason to integrate it into the autopilot. There's no reason it couldn't just relay navcomp style "turn left in 1/4 mile" type instructions over a simple high-security text mode serial link with an extremely limited vocabulary. So long as the autopilot itself is heavily defended against intrusion the worst that's likely to happen is that a distracted passenger gets driven to a dangerous destination (the observant passenger would presumably flip the override switch)
Actually, for nefarious purposes the ideal autopilot hack would likely be to simply swerve suddenly into oncoming traffic, preferably into a cement truck or something, in which case it will all be over before a human could even reach the override switch - so perhaps an override delay would be advisable to prevent a panicked rider from screwing up the collision avoidance while still giving them time to take over for any less immediate threats. Maybe a two-stage switch - flip off the autopilot, then 20 seconds later press the button on the wheel to confirm that you really mean it and are in control - just to avoid the scenario where a panicked person tries to take control, gets stunned/unnerved/disoriented by the extreme recovery maneuverings, and proceed to drive themselves off a cliff.
In fact we probably want multiple autopilot settings - On and Off of course, but also "panic mode" where the autopilot takes over when a collision in imminent but still avoidable - great for when the kids are learning to drive, or you decide to go for a drive after you've had a few. And maybe something like a co-piloted "driving instructor mode" as well.
--- Most topics have many sides worth arguing, allow me to take one opposite you.
> Again, so far ZERO evidence, proof, or test case has been provided that the software is in any way responsible for this problem.
I had a car that didn't have a tape deck and only five buttons for the radio. ...
And we LIKED it.
-- Tigger warning: This post may contain tiggers! --
Just in case that wasn't enough:
Vehicle tests confirmed that one particular dead task would result in loss of throttle control, and that the driver might have to fully remove their foot from the brake during an unintended acceleration event before being able to end the unwanted acceleration. A litany of other faults were found in the code, including buffer overflow, unsafe casting, and race conditions between tasks.
That will be feasible in software when signoff by the equivalent of a PE is required.If PEs couldn't hold a project hostage until it was actually safe, we'd see a lot more cut corners by management. In software, nothing prevents the corner cutting currently.
A software engineer who attempts to dig in and demand more QA and debugging time will be reassigned (possibly to the unemployment line).
I've been reading the transcript. It's fantastic. The expert explains clearly and lucidly in terms that (I imagine are) understandable by non-techies.
The transcriber made some funny mistakes... Let me tell you about "parody bits" and "pointer D references" :)
Elevators have a mechanical safety that you as a passenger have no control over, so it doesn't address neoritter's demand for a human fall back. And that mechanical safety only protects you from a cable failure. It does nothing to protect you from out of control elevator computers bouncing you up and down the shaft.
And such a device could easily be put on a car.
Which device, a big red stop button? That's only true for stopping the engine. It wouldn't work for steering or brakes, as would be needed in a self-driving car.
It's also presumptuous to assume his fear is irrational. He stated his reasons (and he sounds like a programmer, so he's not just talking about a bogey man he doesn't understand). If you disagree with him it doesn't necessarily mean his fear is irrational.
Couple of details here:
Toyota had no software testing procedures, no peer review, etc. The secondary backup CPU code was provided by a third party in compiled form, Toyota never examined it.
Their coding standards were ad hoc and they failed to follow them. Simple static analysis tools found massive numbers of errors.
They used over ten thousand global variables, with numerous confirmed race conditions, nested locks, etc.
Their watchdog merely checked that the system was running and did not respond to task failures or CPU overload conditions so would not bother to reset the ECU, even if most of the tasks crashed. Since this is the basic function of a watchdog, they may as well not have had one.
They claimed to be using ECC memory but did not, so anything from single bit errors to whole page corruption were undetected and uncorrected.
A bunch of logic was jammed in one spaghetti task that was both responsible for calculating the throttle position, running various failsafes, and recording diagnostic error codes. Any failure of this task was undetected by the watchdog and disabled most of the failsafes. Due to no ECC and the stack issue below, a single bit error would turn off the runnable flag for this task and cause it to stop being scheduled for CPU time. No error codes would be recorded.
They did not do any logging (eg of OS task scheduler state, number of ECU resets, etc), not even in the event of a crash or ECU reset.
The code contained various recursive paths and no effort was made to prevent stack overflows. Worse, the RTOS kernel data structures were located immediately after the 4K stack, so stack overflows could smash these structures, including disabling tasks from running.
They were supposed to be using mirroring of variables to detect memory smashing/corruption (write A and XOR A to separate locations, then compare them on read to make sure they match). They were not doing this for some critical variables for some inexplicable reason, including the throttle position so any memory corruption could write a max throttle value and be undetected.
Instead of using the certified, audited version of the RTOS like most auto makers, they used an unverified version.
Thanks to not bothering to review the OS code, they had no idea the OS data structures were not mirrored. A single bit flip can start or stop a task, even a life-safety critical one.
These are just some of the massive glaring failures at every level of specifying, coding, and testing a safety-critical embedded system.
I am now confident in saying at least some of the unintended acceleration events with Toyota vehicles were caused by software failures due to gross incompetence and negligence on the part of Toyota. They stumbled into writing software, piling hack on top of hack, never bothering to implement any testing, peer review, documentation, specifications, or even the slightest hint that they even considered the software something worth noticing.
Natural != (nontoxic || beneficial)
If you read the sentence before that: As single bits in memory control each task, corruption due to HW or SW faults will suspend needed tasks or start unwanted ones. It only took a single bit in non-error-detecting RAM getting flipped to cause that particular fault, something that could easily happen due to cosmic rays or minor radioactive contamination in the ECU packaging - and that's before you even take into account all the other potentially memory-trashing code. It's more like a manufacturer deciding not to ground the case at all and just hoping nothing will come loose and short to it.