How Your Compiler Can Compromise Application Security
jfruh writes "Most day-to-day programmers have only a general idea of how compilers transform human-readable code into the machine language that actually powers computers. In an attempt to streamline applications, many compilers actually remove code that it perceives to be undefined or unstable — and, as a research group at MIT has found, in doing so can make applications less secure. The good news is the researchers have developed a model and a static checker for identifying unstable code. Their checker is called STACK, and it currently works for checking C/C++ code. The idea is that it will warn programmers about unstable code in their applications, so they can fix it, rather than have the compiler simply leave it out. They also hope it will encourage compiler writers to rethink how they can optimize code in more secure ways. STACK was run against a number of systems written in C/C++ and it found 160 new bugs in the systems tested, including the Linux kernel (32 bugs found), Mozilla (3), Postgres (9) and Python (5). They also found that, of the 8,575 packages in the Debian Wheezy archive that contained C/C++ code, STACK detected at least one instance of unstable code in 3,471 of them, which, as the researchers write (PDF), 'suggests that unstable code is a widespread problem.'"
Humans write unstable code.
Running static analysis tools on a whole repository gives lots of warnings.
Who'da thunk it?
If my C code contains *foo=2, the compiler can't just leave that out. If my code contains if (foo) { *foo=2 } else { return EDUFUS; } it can verify that my code is checking for NULL pointers. That's nice; but the questions remain:
What is "unstable code" and how can a compiler leave it out? If the compiler can leave it out, it's unreachable code and/or code that is devoid of semantics. No sane compiler can alter the semantics of your code, at least no compiler I would want to use. I'd rather set -Wall and get a warning.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
"Unstable" code is not a technical term used by any self-respecting programmer. Researchers love to make up terms that nobody but themselves use. Props to the MIT News article for correctly avoiding that term.
Compilers have been ignoring meatspace problems for years. It's well known that most compilers will both ignore some bad chunks of code as well as do its own optimizations (like unrolling).
If the binaries it compiles to work as intended and pass validation, what's the issue? The compiler being a point of trust is something that's been rehashed constantly with people continually reposting the 70 year old ken article
Since C/C++ is fairly liberal about allowing undefined behavior
No, it's not. The language forbids undefined behavior. If your program invokes undefined behavior, it is no longer well-formed C or C++.
I'd rather set -Wall and get a warning.
There are some undefined behaviors that can't be detected so easily at compile time, at least not without a big pile of extensions to the C language. For example, if a pointer is passed to a function, is the function allowed to dereference it without first checking it for NULL? The Rust language doesn't allow assignment of NULL to a pointer variable unless it's declared as an "option type" (Rust's term for a value that can be a pointer or None).
Compilers ought to have switches that deliberately branch to the error cases they're trying to optimize away. Getting rid of a divide by zero? Force the error instead so it gets attention. Coder forgot to declare volatile variables? Make local static shadow copies of static variables for comparison at every reference. And so on. Development environments ought to be helping with this stuff, not confounding developers.
An example of "unstable code":
char *a = malloc(sizeof(char));
*a = 5;
char *b = realloc(a, sizeof(char));
*b = 2;
if (a == b && *a != *b)
{
launchMissiles();
}
A cursory glance at this code suggests missiles will not be launched. With gcc, that's probably true at the moment. With clang, as I understand it, this is not true -missiles will be launched. The reason for this is that the spec says that the first argument of realloc becomes invalid after the call, therefore any use of that pointer has undefined behaviour. Clang takes advantage of this, and defines the behaviour of this to be that *a will not change after that point. Therefore it optimises if (a == b && *a != *b) into if (a == b && 5 != *b). This clearly then passes, and missiles get launched.
The truth here is that your compiler is not compromising application security – the code that relies on undefined behaviours is.
"To understand unstable code, consider the pointer overflow check buf + len | buf shown in Figure 1 .. While this check appears to work with a flat address space, it fails on a segmented architecture" ref
Do you think most-all exploits are down to the defective x86 segmented memory architecture.
I haven't heard of any compiler that removes code just because it contains undefined behavior. All compilers I know of leave it in, and whether it misbehaves at run-time or not is... well, undefined. It may work just fine, eg. dereferencing a null pointer may just give you a block of zeroed-out read-only memory and what happens next depends on what you try to do with the dereferenced object. It may immediately crash with a memory access exception. Or it may cause all mounted filesystems to wipe and reformat themselves. But the code's still in the executable. I know compilers remove code that they've determined can't be executed, or where they've determined that the end state doesn't depend on the execution of the code, and that can cause program malfunctions (or sometimes cause programs to fail to malfunction, eg. an infinite loop in the code that didn't go into an infinite loop when the program ran because the compiler'd determined the code had no side-effects so it elided the entire loop).
I'd also note that I don't know any software developers who use the term "unstable code" as a technical term. That's a term used for plain old buggy code that doesn't behave consistently. And compilers are just fine with that kind of code, otherwise I wouldn't spend so much time tracking down and eradicating those bugs.
What is "unstable code" and how can a compiler leave it out?
The article is actually using that as an abbreviation for what they're calling "optimization-unstable code", or code that is included at some specified compiler optimization levels, but discarded at higher levels. Basically they think it's unstable due to being included or not randomly, not because the code itself necessarily results in random behaviour.
If I set -Wall and the compiler fails to warn me that it optimized out a piece of my code then the compiler is wrong. Period. Full stop.
I don't care what "unstable" justification its authors gleaned from the standard, don't mess with my code without telling me you did so.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
No wonder my app came out with 0 bytes.
Table-ized A.I.
Back in the day when I was doing C++ work, I used a product called PC Lint (http://www.gimpel.com/html/pcl.htm) that did basically the same thing STACK does. Static Analysis of code to find errors such as referencing NULL pointers, buffer over flows, etc... Maybe they should teach History at MIT first...
Visit the Arcade Restoration Workshop @ http://www.arcaderestoration.com
It's a pretty cool critter, but I don't know if they actually sell it as a product. It might be something that they only use internally:
http://www.research.ibm.com/da/beam.html
http://www.research.ibm.com/da/publications/beam_data_flow.pdf
Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
Yes it leads to real bugs - Brad Spengler uncovered one of these issues in the Linux kernel in 2009 and it led to the kernel using the -fno-delete-null-pointer-checks gcc flag to disable the spec correct "optimisation".
Another, more common example of code optimizations causing security problems is this pattern:
int a = [some value obtained externally];
// integer overflow occurred ...
int b = a + 2;
if (b < a) {
}
The C spec says that signed integer overflow is undefined. If a compiler does no optimization, this works. However, it is technically legal for the compiler to rightfully conclude that two more than any number is always larger than that number, and optimize out the entire "if" statement and everything inside it.
For proper safety, you must write this as:
int a = [some value obtained externally];
// integer overflow will occur ...
if (INT_MAX - a < 2) {
}
int b = a + 2;
Check out my sci-fi/humor trilogy at PatriotsBooks.
YOU SUNK MY BATTLESHIP!
Science advances one funeral at a time- Max Planck
The TFA links to the actual paper. Maybe you should read that.
Towards Optimization-Safe Systems:Analyzing the Impact of Undefined Behavior
I don't need to test my programs.. I have an error correcting modem.
"What every C programmer should know about undefined behaviour" (part 3, see links for first 2 parts).
For example, overflows of unsigned values is undefined behaviour in the C standard. Compilers can make decisions like using an instruction that traps on overflow if it would execute faster, or if that is the only operator available. Since overflowing might trap, and thus cause undefined behaviour, the compiler may assume that the programmer didn't intend for that to ever happen. Therefore this test will always evaluate to true, this code block is dead and can be eliminated.
This is why there are a number of compilation optimisations that gcc can perform, but which are disabled when building the linux kernel. With those optimisations, almost every memory address overflow test would be eliminated.
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
This project is an NSA wet dream! Essentially its a factory for creating CVEs...
The C standard needs to meet with some realities to fix this issue. The C committee wants their language to be usable on the most esoteric of architectures, and this is the result.
The reason that the result of signed integer overflow and underflow are not defined is because the C standard does not require that the machine be two's complement. Same for 1 31 and the negative of INT_MIN being undefined. When was the last time that you used a machine whose integer format was one's complement?
Here are the things I think should change in the C standard to fix this:
* Fixation of two's complement as the integer format.
* For signed integers, shifting left a 1 bit out of the most-significant bit gets shifted into the sign bit. Combined with the above, this means that for type T, ((T) 1) << ((sizeof(T) * CHAR_BIT) - 1) is the minimum value.
* The result of signed addition, subtraction, and multiplication are defined as conversion of all promoted operands to the equivalent unsigned type, executing the operation, then converting the result back. (In the case of multiplication, the high half is chopped off. This makes signed and unsigned multiplication equivalent.)
* When shifting right a signed integer, each new bit is a copy of the sign bit. That is, INT_MIN >> ((sizeof(int) * CHAR_BIT) - 1) == -1.
That should fix most of these. Checking a pointer for wraparound on addition, however, is just dumb programming, and should remain the programmers' problem. Segmentation is something that has to remain a possibility.
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
The article doesn't summarize this very well, but the paper (second link) provides a couple examples. First up:
They then give another example, this time from the Linux kernel:
The basic issue here is that optimizers are making aggressive inferences from the code based on the assumption of standards-compliance. Programmers, meanwhile, are writing code that sometimes violates the C standard, particularly in corner cases. Many of these seem to be attempts at machine-specific optimization, such as this "clever" trick from Postgres for checking whether an integer is the most negative number possible:
The remainder of the paper goes into the gory Comp Sci details and discusses their model for detecting unstable code, which they implemented in LLVM. Of particular interest is the table on page 9, which lists the number of unstable code fragments found in a variety of software packages, including exciting ones like Kerberos.
Visit the
Er... so your conclusion is that we should all "run screaming" from C/C++ and hire a bunch of people who "truly know what they're doing... on all levels of detail" with C/C++, who are liable to be a relatively limited number and command top wages.
Top plan! One would almost think you view yourself as one of those people who "truly known what they're doing"! Even "on all levels of detail!"
The third alternative is "Elucidate how compilers are (rightfully, according to the standard) introducing potential dangers, and educate people accordingly", on whatever level you feel that involves. Alas, that alternative never occurred to you. Which, given the kind of abstractions a paper I read recently pointed out -- where Clang and G++ acted quite differently and equally dangerously -- is a bit of a pity since I strongly suspect you'd probably be caught out by an unexpected trap too.
(If you think you wouldn't you've never developed professionally for a living. We all have, and no-one should be ashamed to say they have.)
Is this a really pathetic way of saying "HEY I WORKED FOR IBM!!!!!!!!!!!!!!!!!!!!" without trying to be quite so clear about it?
Gee, I wish I programmed all the things you program! Then I'd NEVER need anything but unsigned integers! Ah, there but for the grace of God...
(In case you missed the subtext: prick. You aren't everyone and the 99% of the things *you* program probably form the 1% of the things that *I* program, and that does not make what I program worthless any more than it makes what you program worthless... unlike you, since you're an arrogant piece of shite.)
For example, overflows of unsigned values is undefined behaviour in the C standard.
I'm glad I didn't know that when I used to play with software 3d engines back in the 90s. 16-bit unsigned integer "wrap around" was what made my textures tile. I do seem to vaguely recall that there was a compiler flag for disabling integer traps and that I disabled it. It was Microsoft's C compiler, and it's been a loooooong time.
OK, I'm looking through the options on the 2005 free Visual Studio... I can find a flag to disable floating point traps, but not integer. Maybe the full version lets you do that. I used to have the full version. I suppose if it were really important I could track down the magic assembly voodoo incantation to do it. I'm guessing the MS disables integer overflow traps by default...
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
That code is bad for many reasons, not the least of which is that it's semantically ambiguous whether the result of malloc() should be assigned to a or *a.
However, the compiler here clearly can't make any valid assumptions about the contents of *a following the realloc. That's what undefined means: it holds a value about which you can't make any assumptions. Because the behavior is undefined, no *valid* optimization is possible.
Clang is wrong. If it's smart enough to recognize the undefined behavior then it should (a) warn the user and (b) make no optimization attempts to any code which later references *a.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
I'd rather set -Wall and get a warning.
I see your -Wall, and raise you a -Werror -pedantic
Overflows of unsigned values are well-defined in C (they wrap). (Technically the standard says unsigned values can't overflow because they're wrapped)
Overflows of signed values are undefined.
A cursory glance at this code suggests missiles will not be launched. With gcc, that's probably true at the moment. With clang, as I understand it, this is not true -missiles will be launched.
It's not quite correct. a == b is not a use of the argument that has been invalidated. a was a variable containing an address of the object that was passed by value to the realloc() function.
In case the value of a is no longer valid, then the b = realloc ... assignment, would not have returned the same value;
therefore, a == b would evaluate to false, and with the short-circuit && operator, the *a != *b test would never have been executed.
The first mistake was using signed integers. unsigned integers always have well-defined overflow (modulo semantics), which means it's easier to construct safe conditionals
Not in C and C++ they don't. The compiler is allowed to perform that optimization with either signed or unsigned integers.
Checked out their git repo and did a build. They have a couple sketchy-looking warnings in their own code. A reference to an undefined variable; storing a 35-bit value in a 32-bit variable...
lglib.c:6896:7: warning: variable 'res' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
lglib.c:6967:10: note: uninitialized use occurs here
plingeling.c:456:17: warning: signed shift result (0x300000000) requires 35 bits to represent, but 'int' only has 32 bits [-Wshift-overflow]
That's funny. My first takeaway is that the programmer is assuming malloc never fails. Let's get past that and assume that malloc and realloc both returned something. Most of us would assume it's unusual for realloc to do anything. We expect a==b to be true which makes (*a!=*b) impossible and the body of the if-block unreachable. So. I'm with you so far.
OK, if the spec says that a is undefined after the call to realloc, then IMHO the compiler should change the type of a from char * to UNDEFINED and complain. Based on what you're saying, it sounds like Clang is wrong. It sounds like they're treating undefined behavior as implementation defined behavior.
I'm sure somebody will correct me if I'm wrong on that one.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Somewhat this made me remember that slideshow on Deep C. I only know that i don't know nothing of C, after reading it.
However, it is technically legal for the compiler to rightfully conclude that two more than any number is always larger than that number, and optimize out the entire "if" statement and everything inside it.
It's a good deal worse than that. The compiler is allowed to do ANYTHING. It can replace the code inside the if with code that sends all your customer data to your competitor. It can install a virus. Anything.
Under C99 all machines must be both 2s-compliment and have 8-bit bytes. IIRC both fall out from inttypes.h. Word is this wasn't intentional, but it had been so long since anyone actually used other architectures that no one noticed that implication.
Socialism: a lie told by totalitarians and believed by fools.
Erm, I have to deal with negative numbers on a constant basis.
I once had some code that confused me when the compiler optimized some stuff out.
I had a macro that expanded to a parenthesized expression with several sub-expressions separated by commas that used a temp variable, e.g.:
#define m(a) (tmp = a, f(tmp) + g(tmp))
because the argument (a) could be an expression with side effects.
Now, I knew that the order of evaluation of function arguments wasn't defined, but I never read that as meaning that a compiler could optimize away parts of a function call such as: x(m(1), m(2)); this particular compiler effectively acted as if it was evaluating both arguments in parallel, thus the value of tmp was undefined throughout (I think it eliminated one of the initial assignments).
Changing it to an in-line function made it work; it had initially been code written for a compiler that didn't have in-line functions and was in the middle of a very tight loop.
Not in C and C++ they don't. The compiler is allowed to perform that optimization with either signed or unsigned integers.
I take back this statement... it is not correct, at least in C99.
Compilers are free to assume that the code does not contain undefined behaviour. This allows for better optimization. But things can get tricky. To give an example:
int my_divide(int a, int b) {
if (!b) diediedie("oh noes");
return a / b;
}
An overzealous optimization may move the division up (it's a high-latency instruction) all the way ahead of the diediedie call.
It'd be legal if diediedie is assured to return. The compiler has erroneousely assumed lack of noreturn attribute constitutes some guarantee here.
As for altering the semantics, the standard has a notion of abstract semantics and actual semantics, which need to agree at certain points. In other words, optimization and magic are allowed if the result is right. The compiler may substitute
for (i = 0; i != 3; i++) printf("%d", i);
With
fputs("012", stdout);
Special precautions are needed when you e.g. run a benchmark that is a big nop in essence.
No, the compiler is allowed to to anything it damn well pleases wherever the standard calls behaviou "undefined". One of my favorite quotes ever from a standards discussion:
When the compiler encounters [a given undefined construct] it is legal for it to make demons fly out of your nose
Nasal demons can cause code instability.
Socialism: a lie told by totalitarians and believed by fools.
It really should be time that 99.9% of the code written ought not to be in languages that have undefined behaviour. It's time we all use languages which are fully defined.
Having said that, if something in code is undefined, and the compiler knows it, then it should generate an error. Very easily solved. If this STACK program is so clever, it should be in the compiler, and it should be an error to do something undefined.
In the movie Dark Star, our intrepid explorers travel the galaxy looking for "unstable planets" and blowing them up. Maybe a Dark Star compiler blows up unstable programs?
Anyway, Dark Star is a classic camp SF movie. Check it out!
My statement is contradictory. I recommended a course of action for undefined behavior, while maintaining that Clang is wrong for documenting a course of action for undefined behavior.
My understanding of "undefined behavior" in the C spec is that it means "anything can happen and the programmer shouldn't rely on what the compiler currently does". Of course, in the real world *something* must happen. If a 3rd party documents what that something is, the compiler is still compliant. It's the programmer's fault for relying on it.
OTOH, if the behavior was "implementation defined" then the compiler authors can define it. If they change their definition from one rev to another without documenting the change, then it's the compiler author's fault for not documenting it.
In other words:
undefined -- programmer's fault for relying on it.
implemenation defined -- compiler's fault for not documenting it.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Who are you, a government accountant?
OK, if the spec says that a is undefined after the call to realloc, then IMHO the compiler should change the type of a from char * to UNDEFINED and complain. Based on what you're saying, it sounds like Clang is wrong. It sounds like they're treating undefined behavior as implementation defined behavior.
I'm sure somebody will correct me if I'm wrong on that one.
You're wrong on that one. :-)
First, let's start with this specific case. First of all, the type of a variable can't "change", because the type of a variable in languages without type-state sutff is static. (Aside: this is a useful way to think about the distinction between statically typed languages and dynamically typed ones -- in statically typed languages, variables have types, while in dynamically typed languages, values have "types.") In this case it's pretty easy to see how the compiler can deal with it, but in general it's not:
Can that program provoke undefined behavior? Depends on the conditions, which means it's undecidable in general. In the type viewpoint, what's the type during the ellipsis? Is it char* or is it inaccessible? It's char* down one branch but inaccessible down the other, and there's not a fully-general way to merge those two types (in a way that type-checking is still decidable).
Second, undefined behavior means the compiler is allowed to do anything -- it's less restrictive than implementation-defined. For implementation-defined behavior, the compiler needs to make a choice, stick with it (at least with a consistent set of compiler settings), and document it. For undefined behavior, the compiler can do anything it wants to, for any reason it wants to, can do different things in the same situation in different places because why the hell not, etc. -- the standard imposes no restrictions on what happens once undefined-behavior is triggered. See here for more.
This doesn't sound right to me. The intX_t types, if present, have to be more 2s-complimenty, but they aren't really required to be present, as I recall.
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
If I tell the compiler to give me warnings, it detects a code whose behavior is undefined in the standard, but then fails to issue a warning then the compiler is broken. If it goes on to make a fancy assumption about the undefined behavior instead of letting it fall through to runtime as written then it's doubly broken.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Under C99 all machines must be both 2s-compliment and have 8-bit bytes. IIRC both fall out from inttypes.h. Word is this wasn't intentional, but it had been so long since anyone actually used other architectures that no one noticed that implication.
You are incorrect. C99 (and C11) still explicitly allow two's complement, one's complement and sign-and-magnitude repsesentation for signed types. You are probably confusing it with the type definitions int8_t, int16_t etc. which ARE required to be two's complement (if they exist). But the standard does not require those type definitions to exist.
, fucked up computer languages allow "undefined code", ie. C / C++.
Every language has some undefined behavior (and there are libraries with undefined behavior in every language), except maybe ADA.
Java leaves a wide area undefined when it comes to multi-threaded code.
Python has the same, plus it inherits some undefined behaviors from C.
C/C++ leaves a wide are undefined to support oddball system architectures. For example, if you have some memory that only can store floating point numbers, and some general-purpose memory, the address ranges might overlap - that's why pointer subtraction is undefined unless within an array. In practice most programmers can treat all memory as one contiguous byte array, but on special-purpose hardware you can still use C. Most of C's undefined behavior comes from the much wider variety of system architectures when C was young, but can still be useful for embedded systems.
Socialism: a lie told by totalitarians and believed by fools.
OK, that explains why I've been getting away with assuming they wrap since the Clinton administration. I don't know if anybody ever explained it to me in C terms. I always assumed that behavior was baked in at the CPU level, and just percolated up to C. I never felt inclined to do any "bit twiddling" with int or even fixed-width signed integers because on an intuitive level it "felt wrong". What's that four-letter personality type thing? I'm pretty sure I had the I for "intutive" there...
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
This makes no sense. The dereference is undefined, and therefore sk may be undefined iff tun IS null but not tun.
I.e. by the time execution reaches the if statement one of the two is true:
tun != null && sk == {something valid} -or-
tun == nul && sk == {undefined}
sk being undefined is possible but that undefined-ness can't be used as a way to infer tun != null--the only thing that causes it is tun == null! It's illogical for the compiler to do what you say and remove the if check. The standard says sk can be undefined, therefore something being in an undefined state is possible, not that the compiler can presume that undefined is impossible to occur and put it's hands over its ears and go la-la-la.
"You saved 1968." - Ms. Valerie Pringle to the crew of Apollo 8
To many changes, dropping support for hardware that FreeBSD still supports (paid contracts) along with the issue that each run could not produce Identical Binaries. If you can't ensure duplicate binaries, you can't ensure that someone hasn't backdoored/trojan'd each and every binary on the system because the code produced is no longer human readable.
Mod me up/Mod me down: I wont frown as I've no crown
This blog post explains why they can't always warn about it.
First, undefined behaviour doesn't require a diagnostic. Even when it's theoretically possible to detect such behaviour at compile time, this isn't required.
Second, if your program exhibits undefined behaviour, then "as written" is meaningless; the behaviour is ... undefined. You probably meant "as intended", but the compiler cannot read your mind, isn't required to try, and in all probability won't try.
It's not quite correct. a == b is not a use of the argument that has been invalidated. a was a variable containing an address of the object that was passed by value to the realloc() function.
I also thought this first, but the standard seems to be quite picky about it. It is undefined behavior if "The value of a pointer that refers to space deallocated by a call to the free or realloc function is used". I interpret this so that just using the address value is UB, even if the pointed memory block is not accessed.
If the runtime moved memory around during a realloc, this code wouldn't work. However, you'd never notice if you use the same runtime all the time. This is why it's a good thing to compile/target different platforms and compilers, and to do a -Wall (or the equivalent) at every optimization level. You have to do it at every optimization level because some compilers only do checks like this during their optimization phase (gcc?).
This type of thing wouldn't get caught by any automated tools when I was doing C. Funny that there isn't a way to specify "this argument gets borked" in any language I can think of.
I examine all the code I write with a disassembler (alongside unit testing it), regardless of how time consuming this is, quality goes first. I don't see why others can't do the same?
That's an awfully narrow description of the kinds of behaviors that are undefined in C and C++. Here's an incomplete list of actions that will provoke undefined behavior in C++ off the top of my head, all of which are perfectly relevant and possible to accidentally do on every-day desktop architectures:
I'm lazy so I'll stop typing now.
That reminds me of this gem:Overflow in sorting algorithms
That little bug just sat around for a few decades before anyone noticed it.
Quick summary: (low + high) / 2
May have an overflow which is undefined behavior. Really every time we add ints it's possible. Just usually our values don't pass the MAX.
Cwm, fjord-bank glyphs vext quiz
What if: a = INT_MIN; externally?
Let's use 16 bit ints for readability. INT_MIN could be -32768; INT_MAX could be 32767.
if (INT_MAX - a < 2) {
if ( 32767 - (-32768) > 2 )
if ( [integer overflow] > 2 )
if ( a <= INT_MAX - 2 ) {
Our exchange just really illustrates the cluster fuckery of the bad language design.
Now, since extern int a; means this value's range may never be known to the compiler at compile time, indeed 'a' could be read from a file or user input, you must first check the value of every variable obtained externally before use to avoid integer overflow. This is, quite frankly, asinine & the very definition of "doing it wrong"(tm) when you consider that to avoid integer overflow you essentially have to write twice the code for nearly all logic, or provide "sanitization" on all variables not locally defined (eg: function parameters); Instead, it would be better (require less complexity) to simply define the behavior of overflows and provide the API to query the chip for such overflow (carry) state if desired.
Look, at the end of the day we're running code on specific platforms. We can hold up the idyllic goal of truly cross platform language, but such does not exist in practice. In practice the code is tested on the platform, and exceptions are made where specific platform capabilities differ, typically. Platform specific modifications are nearly always needed. See: the guts of your stdint.h file on various platforms; Now understand that nearly all programs today use stdint.h (normalized byte sizes) and that stdint.h really isn't a part of the language itself, but part of the runtime or API -- associated non-essential component of the language. The point is that the language is abstracted from the platform, so it must interact with platform specifics in order to provide normalized symbol meanings, SO DO THAT; Otherwise simply provide guaranteed behaviors for variable types. Note: your programmers will use the former to achieve the latter anyway.
C's undefined integer behaviors are the problem. They should not be so. This decision means that the entropy spreads into nearly all areas of the code -- Thus most cross platform frameworks avoid int like the plague. The fixed size (and thereby behavior) stdint.h types should be the only ones in the language. Rarely, if ever, do you ever actually need an 'int' -- a variable that changes size with the platform. Yes, it makes code more efficient to utilize a variable that is the platform's native width, but this is in direct conflict with portability of the code. It's folly to ignore the platform features and apply such undefined behavior. Not every "language" does this. Assembly takes the right approach. It leaves no behavior undefined. All of my toy languages have fully defined behavior for any piece of code in every environment it can run on. There is a finite number of platforms in existence, and I explicitly give the programmer the ability -- within the language -- to detect what the platform capabilities are at compile time. Instead of C's int type, one must use the equivalent of #ifdef blocks to supply your typedef mapping:
// C-ish pseudocode for creating your flexible int type, that you actually rarely (if ever) need:
#if ( __CPU_WORD == 32 )
typedef int32_t int;
#elseif ( __CPU_WORD == 64 )
typedef int64_t int;
#elseif ( __CPU_WORD > 64 )
#warn "Emulating 64 bit integer compatibility."
typedef int64_t int;
#else
#warn "Emulating 32 bit integer compatibility."
typedef int32_t int;
#endif
You only have to do this in one place, and it lets the programmer explicitly define behaviors so they can depend on them. This code allows every p
> If it goes on to make a fancy assumption about the undefined behavior instead of letting it fall through to runtime as written then it's doubly broken.
What does "letting it fall through to runtime as written" mean? The spec explicitly says that the behaviour is undefined, so there is no standard behaviour.
Another example is "int i = 1; f(i++, i++)", what are the values of the parameters passed to f which uses undefined behaviour? You might expect that one would be 1 and the other two but possibly either way around. Since it's undefined it's also allowed that both be 1, both be 2, both be 666, it refuses to compile, or demons fly out of the programmer's nose.
There's nothing in the standard about "warnings", though most compilers are good about it when it comes to common problems. But even with a warning, optimizer's gonna optimize.
Socialism: a lie told by totalitarians and believed by fools.
Myers-Briggs test, and it's 'N' for intuitive. :)
Incorrect –"undefined" means that absolutely any result at all is correct. That means that the compiler can do anything at all that it likes (because all of those will fall into correct), but *you* can't expect anything in particular. Clang is absolutely correct here (as would any behaviour at all be).
GCC fails to warn on stripping code in a way that leads to security vulnerabilities.
IBM has a tool that catches those same vulnerabilities.
And people wonder how the NSA gets so many cool zero-days.
Signed integer overflow (props to some people elsewhere in the thread that taught me that this isn't true for unsigned!)
Writing or reading past the end or beginning of an array
Dereferencing a NULL pointer
Accessing an object of one type via a pointer of another type (violating the strict-aliasing rules)
All of these are exactly what I was talking about - different needs for different architectures. I've coded on a platform where writing to 0 was legal, and did something bad, unless you did it on purpose No fun at all, but possible to code for.
Accessing memory at an address that has been free()ed or deleted
Calling several STL algorithms with iterator pairs that don't form a valid range, e.g. copy(vec1.begin(), vec2.begin(), vec1.end()) (I think I ordered those right)
These are important for library optimization. Without the optimization they allow, people would have written their own, faster libraries and that would have sucked far worse.
Assigning the same scalar value twice without an intervening sequence point (e.g. i = i++; not only doesn't have a well-defined evaluation order but also provokes undefined behavior entirely
I never did understand what they gained from that one, but the examples I've seen of the sequence point thing are horrible code anyway, so it doesn't bother me.
The one big one in C++, the biggest gotcha for even veteran programmers, is the undefined lifetime of "compiler temporaries" (usually unnamed objects created in the argument list to a function call), which is a landmine for shared_ptr. I hope they fixed that one in C++0X
Socialism: a lie told by totalitarians and believed by fools.
>If my C code contains *foo=2, the compiler can't just leave that out
Well, it could if the program produces no further output before exiting, or if "foo" is unassigned.
Based on the headline, I thought it was going to be about Ken Thompson's self-referencing compiler that not only inserted a back door whenever it saw that it was compiling the UNIX login command, it also inserted the back door insertion code whenever it saw it was compiling the compiler source code.
According to the article "unstable code" is anything with undefined behavior according to the C++ standard. This could be as simple as an integer overflow or divide by zero which in debug or "zero optimization" mode would always cause an error, but which in an optimized release may simply be removed.
The behaviour is also undefined if realloc returns NULL. Also, sizeof(char) is 1 by definition.
I think the compiler would be violating sequence points if it moved the division up.
However, I see your point with the for-loop and have experienced it first hand when I wanted to see how fast such a loop would run. I had put some stupid addition or something in there, and the sneaky compiler went ahead and optimized my loop into oblivion. I had to put a function call in the loop to make it generate loop code.
After reading over responses to my original post, and to other posts around here I've come to the following conclusion:
Programmers are invoking undefined behavior.
OK, aside from that I always figured that invoking undefined behavior could make your program blow up at runtime. I never thought about the possibility of undefined behavior occuring at compile-time. I certainly wouldn't rely on such behavior, no matter how fast it made my program run. I'd be at the mercy of the compiler author. Even if I used #ifdef checks for the operating system, compiler version, etc. I could get screwed. Such checks are legitimate for implementation defined behavior in the compiler or quirks of the operating system on which the program will run. They are NOT legitimate for getting away with undefined behavior, not if you want to claim your program is C or C++.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
>a == b is not a use of the argument that has been invalidated
Yes it is. Evaluating the expression "a" causes undefined behaviour if "a" is
indeterminate. "a" is considered to no longer have a value, any attempt to
refer to its value causes UB. (It has the same status as a variable that has
been defined but not initialized, i.e. "int a;"
The only thing that can be done with "a" thereafter is to assign a new value to it .. can't think of any other exceptions)
(or take its address, or do "sizeof a"
What if INT_MAX - a overflows?
Hmm I seem to have messed up a few >s and <s... That's my fault, 0: For not giving a fuck -- It's futile to try deconverting a zealot; and 1: it's 2013 and we're still escaping HTML manually?
Truly, the whole computing world is shit strung together with bubble gum and twine. I mean, really... No isolation for code and data pointers or sacrificing a register for offset / segmentation and not giving us a new offset register so we could ACTUALLY do the heap code pointer protections.
How fucking dumb can everyone be? The language and systems programmers don't interact with the hardware makers and vise versa. What the actual fuck. I'd love just ONE MORE hardware execution permission ring level, so that SANDBOXES could actually work... Nope, not on ARM, or AMD... Just 2 levels -- Hardware designed for a monolithic kernel. It's fucking disgusting.
>The dereference is undefined, and therefore
Stop right here. Once undefined behaviour occurs, "all bets are off" as they say; the remaining code may have any behaviour whatsoever. C works like this on purpose , and it's something I agree with. It means the compiler doesn't have to insert screeds of extra checks , both at compile-time and run-time.
There are plenty of other languages you can use if you want a different language definition :)
"Overflows of unsigned values" is NOT undefined. You can assign out-of-range values to unsigned types, and also perform arithmetic operations which exceed the bounds of the type; and the value is adjusted using modular arithmetic.
Some would be facetious and say that "unsigned types cannot overflow", meaning that they always have well-defined behaviour on operations that would generate an out-of-range value, but that's just an issue of pedantry with English.
With all due respect, this is a silly example of an obvious coding mistake (making assumptions about the location of a dynamically allocated pointer after calling realloc) followed by melodramatic consequences.
I think you must be mis-remembering the details slightly. The comma operator is a sequence-point, so "tmp" must be assigned the value of "a", and f() and g() must both be called with a value that is the value of "a" converted to the type of "tmp". The two functions can be called in either order though (or in parallel) but there is no issue there.
Of course, the compiler can do anything it likes so long as the program's output is equivalent to what I just described. So, for example, it might not allocate a memory location to "tmp", it could just push the value of "a" onto a register and then call f and g with it. Or if f or g do nothing and have no side-effects, the assembly code might not show calls to f and g. But there is no way you could know these things by running the program, which is the whole point.
The first mistake was using signed integers.
The problem is C's promotion rules. In C, when promoting integers to the next size up, typically to the minimum of "int", the rule is to use signed integers if the source type fits, even if the source type is unsigned. This can cause code that seems to use unsigned integers everywhere break because C says signed integer overflow is undefined. Take the following code, for example, which I saw on a blog recently:
uint64_t MultiplyWords(uint16_t x, uint16_y)
{
uint32_t product = x * y;
return product;
}
MultiplyWords(0xFFFF, 0xFFFF) on GCC for x86-64 was returning 0xFFFFFFFFFFFE0001, and yet this is not a compiler bug. From the promotion rules, uint16_t (unsigned short) gets promoted to int, because unsigned short fits in int completely without loss or overflow. So the multiplication became ((int) 0xFFFF) * ((int) 0xFFFF). That multiplication overflows in a signed sense, an undefined operation. The compiler can do whatever it feels like - including generate code that crashes if it wants.
GCC in this case assumes that overflow cannot happen, so therefore x * y is positive (when it's really not at runtime). This means the uint32_t cast does nothing, so is omitted by the optimizer. Now, the code generator sees an int cast to uint64_t, which means sign extension. The optimizer this time isn't smart enough to know again that it's positive and therefore can ignore sign extension and use "mov eax, ecx" to clear the high 32 bits, so it emits a "cqo" opcode to do the sign extension.
So no, avoiding signed integers does not always save you.
"Screw Sun, cross-platform will never work. Let's move on and steal the Java language." - Visual J++ Product Manager
This code also would launch the missiles in gcc as well if the C library did the perfectly valid action of freeing a after allocating a new 1+ sized object which is the reason this is UB. In a security minded implementation each allocation would be placed on at least one page which would be marked as not readable after free which would save us from the missiles but not a drone crash.
Clang optimizes to the first implementation. It should error out.
I like this one, because it shows a very common weakness in high level languages.
In most machine languages, getting the average of two unsigned numbers up to UINT_MAX is absolutely trivial -- add the two, then shift right including the carry. The average of two signed numbers rounding to zero is a little more difficult (x86 makes it harder than it should be by not setting flags in a convenient manner), but still a few instructions.
In C? Assuming low and high are unsigned
(low >> 1) + (high >> 1) + (low & high & 1). Ick. The answer given in your article is inadequate; it gets you one more bit.
Of course, now we have 64 bit integers and the problem is solved ONCE AND FOR ALL.
But there are algorithms which need the average of 64-bit unsigned numbers too..
ONCE AND FOR ALL
Why do people do this crap?
This is simply a bug in the compiler, which has made semantic assumptions that are incorrect. It has nothing to do with removing this so-called 'unstable' code, which I'm still yet to see a real example of.
For all intents and purposes "for all intensive purposes" is a mishearing of "for all intents and purposes". (Just thought I'd mention it, because I used to say that too, and became really conscious of it since someone pointed it out to me).
Believing something doesn't make it true. Not believing something doesn't make it false.
In other words crappy, buggy code can cause underfined behaviours from the compiler and at runtime.
News flash.
Code written with such erroneous assumptions has long been at fault for everything from BSODs to the loss of satellites and deep space probes. Compilers are not mind readers. They can only work with what's been provided; they can't guess what your intentions were.
I do not fail; I succeed at finding out what does not work.
It doesn't need to make any assumptions. The choices are:
1) The test would fail. In this case the body of the conditional is irrelevant.
2) The test would succeed, in which case the program's behaviour was already undefined and any further course of execution is legal - including ignoring the conditional.
In both cases the body of the conditional is irrelevant, and it is legal to remove it. Since the conditional test itself has no side effects, it too is irrelevant and can be removed.
If low and high are unsigned, then (low + high)/2 is well defined, because unsigned arithmetic is defined as modular.
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
(low+high)>>1 === low + ((high-low)>>1) // ONCE AND FOR ALL
Works fine for any values if low and high are signed..
If unsigned, then you need to verify low = high
(low+high)>>1 === low + ((high-low)>>1) // ONCE AND FOR ALL
Works fine for any values if low and high are signed..
If unsigned, then you need to verify low <= high
(edit for slashcode)
Wrapping might have and continue to "work" on signed integers too if the compiler simply translates every addition into a machine instruction that adds and wraps on over flow. If you went to another architecture where such an instruction doesn't exist, being undefined operation allows the compiler to chose some other instruction that works fine for addition whenever there is not an overflow. Otherwise, it might need to add a bunch of instructions to handle overflow if the cpu doesn't wrap it. Undefined operations on a boring platform might actually sometimes do what you want and/or expect, at least until optimization is done.
Pretty much every arithmetic operation on signed integers can produced undefined behaviour. Do you want to see warnings about that for every single line that uses signed integers?
Also what do you mean by "as written"? The code as written has undefined behaviour, and by the standard anything at all can happen at runtime - including doing nothing.
"Write a C function that correctly computes the average of two ints" is one of my favorite little programming puzzles, probably mostly because I ran into that exact problem. Speaking from total ignorance from what makes a good interview question, I think it might make a neat little puzzle for a programming job where you're expected to do low-level stuff like that.
I find it very curious that you'd be prohibited from using the value of the pointer proper. Do you have a citation for it being undefined? I don't see anything that seems to me to say that around the description of realloc in the C99 draft standard, but it also doesn't explicitly say that you can't access the memory that was deallocated in the section I'm looking at either.
What do you mean "word is"? Nobody.. and I mean nobody... should ever be coding in C unless they've read the standard from top-to-bottom, carefully, at least once, and then keeps it handy. It's quite easy to understand, unlike the C++ standard, which is written completely differently and is 10x longer.
To quote C99: "The typedef name intN_t designates a signed integer type with width N , no padding bits, and a two’s complement representation." C99 7.18.1.1p1 (N1256).
That is completely intentional. However, it doesn't resolve the issue because signed overflow is still undefined behavior regardless of the width or representation. If you want modulo behavior just used unsigned. My rule of thumb: unsigned for everything unless you know you need signed arithmetic. And make sure you keep it unsigned (i.e. be careful of unsigned short -> int promotions). Why? Because unsigned behavior is always well defined, and its easier to code to precisely one set of rules. Plus you can make use of unsigned rules to help avoid buffer overflows (i.e. size_t overflow simply wraps around to something smaller, not bigger, which if you don't catch the overflow because you're being sloppy is usually the better behavior, anyhow).
"mit stack checker", and it's the first URL returned: css.csail.mit.edu/stack
I really wonder about people that complain google is unusable and never gives them that for which they are searching. are they lazy? uncreative? semiliterate? Maybe a well constructed online test could discern the truth.
Yes, (low + high)/2 is well-defined for unsigned ints even when (low + high) > UINT_MAX. It's not the definition you want, though. The average of (UINT_MAX/2 + 1000) and (UINT_MAX/2 - 500) should not be 250.
I also like the following possibility that arises from warnings being completely compiler-dependent.
Warning L-8272: Your code is perfectly valid and well-formed.
I wish vendors would implement this, just to be a thorn in the side of those who advocate those unthinking zero-warnings policies.
Signed integer overflow being undefined is a mistake in the C spec. No one will make a ones-complement machine ever again.
Now to extirpate big endian.
In that case, yes. The case in TFA where password is memset to zero, then freed is another matter. The code is unambiguous and clearly serves a security function. However, looked at narrowly, the memset is wasted because the memory will be freed in the next line. But if the memset is skipped, the password is left floating around in unallocated memory. Worse, it might end up in swap.
-Wall doesn't catch aliasing issues see here: http://blog.worldofcoding.com/2010/02/solving-gcc-44-strict-aliasing-problems.html
Myers-Briggs test, and it's 'N' for intuitive. :)
Yep, the I is for introverted.
Good point.
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
He said "negative", not "imaginary".
What do you mean "word is"? Nobody.. and I mean nobody... should ever be coding in C unless they've read the standard from top-to-bottom, carefully, at least once, and then keeps it handy.
As in, (public) discussions on the standard between standards committee members. Reading the standard top-to-bottom only gives part of the picture, like reading the Constitution without reading the Federalist Papers. When late in the game someone writes something like "do you realize the means a char can only be 8 bits?" one can reasonably speculate that mandating an 8-bit byte (goodbye PDP) wasn't the intention.
It's quite hard to write a standard that's both intelligible and unambiguous - English doesn't work that way. Sometimes googling around for the discussions sheds light.
Socialism: a lie told by totalitarians and believed by fools.
Nice. Nice. I like the way you think.
Socialism: a lie told by totalitarians and believed by fools.
Its humans writing a code, humans are not perfect.. If this tools actually works as advertised, it should make fine addition to us not so perfect coders toolbox. Somehow i would have expected that compiler would warn you about it going to leave piece of code out. But i guess that's shows how little i code actually.
Ah, sorry. I screwed up the example slightly. In a proper example, there's always a little more code to ensure that "a" is a strictly positive integer, because if it isn't, then a + 2 is guaranteed to not overflow, making the check superfluous.
Check out my sci-fi/humor trilogy at PatriotsBooks.
I find it very curious that you'd be prohibited from using the value of the pointer proper.
Makes sense to me that doing so would have undefined results. Plenty of more sophisticated memory managers will monkey with the contents of pointers from time to time. Once you deallocate the block it pointed to, God only knows what's left in the pointer. Probably the same value for simple memory allocators. Probably...
"Convictions are more dangerous enemies of truth than lies."
This makes no sense. The dereference is undefined, and therefore sk may be undefined iff tun IS null but not tun.
I.e. by the time execution reaches the if statement one of the two is true:
tun != null && sk == {something valid} -or-
tun == nul && sk == {undefined}
sk being undefined is possible but that undefined-ness can't be used as a way to infer tun != null--the only thing that causes it is tun == null! It's illogical for the compiler to do what you say and remove the if check. The standard says sk can be undefined, therefore something being in an undefined state is possible, not that the compiler can presume that undefined is impossible to occur and put it's hands over its ears and go la-la-la.
"sk being undefined"? You misunderstand what "undefined" means in this case. It's not a question of whether "sk" has a defined value or not, because if tun was null, it's not that the value of sk becomes undefined, rather, the behavior of that deference statement is undefined. "By the time execution reaches the if statement"? It may never reach it! Since the behavior of "struct sock *sk = tun->sk" is undefined in the case tun == null, it's perfectly acceptable for that statement to be treated as a "return" statement if tun is null. Or for it to be treated as a command to reformat your hard drive. Or to simply make demons fly out your nose. The if statement can be eliminated, since if the compiler wants to treat it as unreachable code in the event tun == null, it has every right in the world to do so.
"Convictions are more dangerous enemies of truth than lies."
The problem is C's promotion rules. In C, when promoting integers to the next size up, typically to the minimum of "int", the rule is to use signed integers if the source type fits, even if the source type is unsigned.
I know. C's handling of integer overflow is "undefined". In Pascal, integer overflow was a detected error. DEC VAX computers could be set to raise a hardware exception on integer overflow, and about thirty years ago, I rebuilt the UNIX command line tools with that checking enabled. Most of them broke.
In the first release of 4.3BSD, TCP would fail to work with non-BSD systems during alternate 4-hour periods. The sequence number arithmetic had been botched due to incorrect casts involving signed and unsigned integers. I found that bug. It wasn't fun.
C's casual attitude towards integer overflow is why today's machines don't have the hardware to interrupt on it. Ada and Java do overflow checks, but the predominance of C sloppyness influenced hardware design too much.
I once wrote a paper, "Type Integer Considered Harmful" on this topic. One of my points was that unsigned arithmetic should not "wrap around" by default. If you want modular arithmetic, you should write something like n = (n +1) % 65536;. The compiler can optimize that into machine instructions that exploit word lengths when the hardware allows, and you'll get the same result on all platforms.
I never said it always saved you. There are obviously caveats, one of which is promotions of unsigned char and unsigned short to int. Another is something like this:
unsigned x = UINT_MAX, y = UINT_MAX;
unsigned long z = x + y;
Problem? If unsigned is 32-bits but unsigned long is 64-bits, z is assigned the truncated value, which may or may not be what you wanted. (In the common case of object manipulation, inadvertent unsigned truncation is usually the least harmful effect, and if used purposefully and knowingly, very handy).
But my ultimate point is that if one sticks to unsigned, they've drastically reduced complexity. You still need to understand the semantics, but there are fewer issues to keep in mind. That's no excuse for not understanding all the semantics for all types, but when you're jamming away on code you want to use the method which requires the least amount of rules to be resident in your head, so as to minimize mistakes. One wants to habituate oneself to patterns and techniques which, on the whole, maximally mitigate the effects of bugs.
Interestingly, one counter argument I've heard to using unsigned is that it _prevents_ some of the kinds of optimizations that compilers use. For example, unsigned arithmetic in for-loop conditionals is often less than optimal. But if one is aiming for consistency and correctness, it's usually the way to go, not the least because for-loops often operate on vectors or arrays, and negative values are almost always non-sensical. (Obviously there are caveats to unsigned in for-loops, such as the infamous ">= 0", but that's beside the point).
Apparently lots of programmers start with "IN-"... although I still haven't decided if Myers-Briggs is insightful analysis or modern astrology. It would help if they could agree with each other (I've been variously told I'm INTP or INFP, depending on the test, or maybe what kind of mood I'm in, or perhaps somehow related to the phases of the Jovian moons)...
"Convictions are more dangerous enemies of truth than lies."
The original x86 had 4 ring levels and segmentation, and at the time we got AMD64 no one was using them (or at least Windows/Linux/*BSD weren't using them). AMD removed security measures because they weren't popular.
I don't think that's the problem the OP is referring to. The problem is that the compiler assumes *a is invalid and "optimizes" it even if the realloc returned the same memory address, making (a==b) true. If the compiler did nothing (*a==*b) should also be true but because the compiler replaced it with something incorrect, (*a!=*b) ends up being true instead.
That's the problem with loose standards -- if the behavior is undefined, there will be people who through ignorance or "cleverness" will end up abusing the undefined behavior of a specific system (in this case compiler) and have their code break in ways that are sometimes extremely difficult to debug -- especially if its been working that way so long that you've forgotten its technically an undefined behavior -- you end up completely overlooking the culprit line of code because it looks correct to you.
You're never going to see this comment but one of the LLVM lead programmers outlines the reasons why the compiler can't always warn that the programmer has invoked undefined behaviour in C. He outlines three reasons:
The article outlines some steps programmers can take but ultimately concludes that C just isn't a "safe" language (but that's partly why it can go so fast).
that's absolutely incorrect. while signed-preserving compilers are standards
complaint, unsigned-preserving compilers are too. unsigned-preserving
compilers are conservative in the principle of least surprise.
that's not correct. a does not become undefined. its value does not
change since c is pass by value. and either realloc() assigns new
storage or it does not. so either a==b or a!=b, either one is valid but
there is no third "undefined" option. the compiler is simply wrong
if it thinks that a function call can make a previously defined value go
random.
Isn't right shift for negative values implementation defined? So your code may work on some platforms, but not on others.
The Tao of math: The numbers you can count are not the real numbers.
The Tao of math: The numbers you can count are not the real numbers.
I like even more the fact that, given that a diagnostic is sometimes required, but never disallowed, and the text of a diagnostic is not regulated, a conforming compiler may just output on every compilation:
warning: This program might contain errors.
Useless, but completely conforming.
The Tao of math: The numbers you can count are not the real numbers.
It doesn't really have to _detect_ undefined behaviour. Maybe it even can't detect it. It just has to produce code that will work as defined in cases when the behaviour is defined.
Let's say I need an algorithm A that, given a Turing machine M and an input word w, produces M(w) if M would halt on w; otherwise, A(M, w) is undefined. A can just run M on w and not worry (or warn) about it potentially not halting. In fact, that is the only available option.
More realistically (and in almost all userland scenarios), if tun is null then the dereference causes a segmentation fault, so the test is guaranteed to fail (i.e. tun is guaranteed non-null) if the code gets as far as the test. To get any real risk, you need a compiler that both optimizes out the test and does something really bizarre on a null pointer dereference (which they're technically allowed to do, but generally don't). Or be writing a kernel, of course.
Try this one: if (a a+1) { /* do stuff */ }
If a is a signed int, a smart compiler will leave out the condition and assume it is always true, because it assumes undefined behaviour never happens. a a+1 is true even when a = MAX_INT.
> if (INT_MAX - a 2) {
Oh please no. Don't get clever with security checks. The rule is pretty simple: In almost all cases, the value to be validated should stand alone.
Which means
if (a = INT_MAX - 2)
It will generate better code even with a dumb compiler and it actually works. Because your code will still overflow for a 0, this time not in the addition but in the condition. If you're lucky that is a less critical failure type. If you're unlucky, it's worse than what you tried to fix.
Well, it works when plain text posting actually works and doesn't change >= to = ...
Also things get tricky (and the rule doesn't work) when you must validate two untrusted value to stand in some relation to each other.
> For undefined behavior, the compiler can do anything it wants to
In particular, relevant for the case of optimizations discussed here, it is free to assume that case can never happen and optimize code before/surrounding it accordingly. ...
So if you have (a being int)
int b = a + INT_MAX;
if (a > 0) {
}
It can optimize the if away, since it can only be true when the calculation of b would have overflown. This is even true if you never use b and it is old code you forgot to remove and is far a way in the source file.
This is not correct, however there is a set of constraints in C and a set of constraints in POSIX which, when composed, mean that 8 bits is the only valid size for a char, so if you are writing C code that only targets POSIX machines then you can safely assume that char is 8 bits.
I am TheRaven on Soylent News
Nice, but one of the proposed solutions is incorrect:
mid = ((unsigned int)low + (unsigned int)high)) >> 1;
if low and high are equal or greater than 2^30, then it overflows.
For example, low=131, high=131, then you get mid=0 instead of mid=131.
It's 1<<31, of course !
Slashdot removes the unescaped <
Here's another one: ordered comparisons between pointers to different objects (less-than or greater-than). These are undefined in C and C++, and yet every STL implementation I've seen relies on them being both defined and stable for many collections (map, set and friends) to work. C++11 is explicitly written to allow implementations with GCs that modify pointer values, so this could cause some interesting issues in the future...
I am TheRaven on Soylent News
Switching rings was too damn slow. So slow, that only academics ever used them, and not very well. It is no wonder it died.
Separating code and data pointers is supported through the NX bit if you forbid self-modifying code and trampolines, so just say no to any crap that requires them, and place all data on separate pages with NX enabled. This is supported in Linux+gcc+LD, but it requires explicit action and cant be used for everything due to dumbass language designs that require trampolines.
Properly partitioning memory between ring 0 and ring 3 (sSeparating kernel mode from user mode) is supported through SMEP and SMAP instructions (and their AMD equivalent): Linux already implements it if you have a processor that has the support. I dont know about the BSDs, but I doubt theyd fall behind on this.
And there is now a new "Intel MTX" thing that might fix the array-out-of-bounds disease by actually making it painless enough that we can just enable runtime checking everywhere (static analysis of this is already a MUST).
Now, if only we could kill ring -1 *DEAD* to get rid of both vendor-added bugridden crap SMM, and hypervisor/NSA viruses/trojans, it would be REALLY nice. You can only do it right now if you design the platform from the ground up and youre an Intel hardware partner (or AMD hardware partner) to get complete access to all BIOS/EFI reference code for platform setup.
Segment registers were very useful, but if you look at chip errata youll get a glimpse of the hellpit wed be in should those things still work in long mode: the TLBs and MMUs are already buggy as all hell [in silicon], and you have extremely nasty interactions of even the microcoded "basic instructions" with memory pages and alignment boundaries in X86 (think "fast strings" mode of operation). Add segments to that again, and chances are we will just have to junk the whole arch for good. Which would not be a bad idea, it would sink Microsoft crap along with it, while Linux and the BSDs work extremely fine with everything else worth of notice under the sun.
It is a damn good thing that compilers are mostly lazy, then. And the easiest is to just omit that section of code ;-)
It should be ingrained in every programmer's brain that undefined behaviour is exactly that, so I have absolutely no sympathy for anyone who's code suffers from this. Personally I try not to use languages like C unless I'm forced to interface with some particular library; not only is the semantics full of undefined behaviour, but it's also incredibly complex. This means programmers struggle to understand what's going on *exactly* and have to think in terms of approximations, which leads to some situations becoming unexpected (eg. integer overflow). This also leads to teams of very smart people having to invest a lot of time to make tools like the one in TFA, which is clearly a sign that the language is too complex.
Now, the really *interesting* security problems caused by optimisation are the side-channel attacks. For example, you might have code which checks a security token; if this is done character-by-character and fails on the first mis-match then an attacker can guess each character in turn by timing how long it takes to fail. Quick failure == first character is wrong, slightly slower failure == first character is right, and so on. You might decide to guard against this by looping through the whole token, so that every check takes the same amount of time. However, is your compiler going to optimise this away? Maybe you'll call some random number generator in your loop to force it to run, but maybe your compiler will move this out of the loop without affecting the semantics, etc.
Augmented C for static analysis can do it, which is just C with lots of __foo crap added everywhere (and preprocessor code to kill that when not running under the static analyser). Refer to "sparse" and have a look at the Linux source code which is full of sparse annotations for extended typechecking.
For the "missile launch example":
/* handle the error and a is unchanged */ }
After the statement b = realloc (a, sizeof char);
the value of a is indeterminate unless realloc failed, that is b = NULL. So immediately afterwards a comparison a == b must give the correct result if b == NULL. However, there was an assignment to *b which tells the compiler that b != NULL and therefore a is indeterminate.
So it's safe to do b = realloc (a, sizeof char); if (b == NULL) {
Unsigned arithmetic in C is not defined as modular. Integer overflows are "undefined", leading to unintended compiler optimizations where it assumes overflows just can't happen.
There's nothing in the standard about "warnings", though most compilers are good about it when it comes to common problems. But even with a warning, optimizer's gonna optimize.
There aren't "warnings", but the c89 and c99 std both prescribe diagnostics to be issued under certain circumstances.
I'm a minority race. Save your vitriol for white people.
This is in the paper (read it, it is worth your time if you write any code in something other than ADA), section 6.3:
Postgres. Stack reported 68 bugs in total. The developers promptly fixed 9 of them after we demonstrated how to crash the database server by exploiting these bugs, as described in 6.2.1. We further discovered that Intel’s icc and PathScale’s pathcc compilers discarded 29 checks, which Stack identified as unstable code (i.e., urgent optimization bugs), and reported these problems to the developers. At the writing of this paper, the strategies for fixing them are still under discussion. Stack found 26 time bombs (see 6.2.3 for one example); we did not submit patches to fix these time bombs given the developers’ hesitation in fixing urgent optimization bugs.
In conext, you have to read this about a bug the PostgreSQL people tried to fix and got it all wrong *again* (and therefore did not fix it properly), read section 6.2.3 and 6.2.1.
I am not amused. When someone who KNOWS WHAT THEY'RE DOING point out your code is crap, you FIX it instead of jumping around in one foot chanting "I know more than you" (when you obviously don't).
Now, I alredy run all my code through three static checkers as well as LLVM and newest GCC in -Wall -Werror -pedantic mode, but as soon as the MIT "Stack" site unclogs (it is currently overloaded or something), I will add a fourth :)
You young whippersnappers!
C99 standard, section 6.2.4 paragraph 2 and annex J.2
plenty of more sophisticated memory managers will monkey with the contents of pointers from time to time.
The memory heap manager library is passed the value of the pointer, not a reference to the pointer itself.
e.g. if you had p =malloc(10);
To release the memory, you do free(p); not free(&p);
The key thing to keep in mind is, the memory manager has no knowledge of the object 'p'; only the address of the object that p had pointed to.
Furthermore, there could be plenty of other copies of the pointer lying around, that the memory manager does not know the address of. A perfectly valid construct would be...
struct { char * x; } bar4; char *bar1, *bar2, *bar3;
bar3 = malloc(256); bar1 = bar3;
bar2 = bar1;
bar4.x = bar2; free(bar4.x);
In this case, you have 4 copies of the pointer in scope, they each contain an memory offset within program address space, plus the copy that gets created when free is called. At no time is any memory management library able to change the contents of any of those 4 copies in a C program. The memory manager can only change what is the object contained at the address referenced by the pointers.
the value of a is indeterminate unless realloc failed, that is b = NULL.
In practice; I see most developers neglecting or ignoring the condition where b is NULL; realloc, malloc, and calloc are assumed to always succeed.
On their platform of choice, it may be the case, that the system kills a process with OOM, when allocation fails, instead of these procedures returning NULL.
>> Quick summary: (low + high) / 2
> In C? Assuming low and high are unsigned
> (low >> 1) + (high >> 1) + (low & high & 1). Ick.
(low & high) + ((low ^ high) >> 1);
Every bit in the & expression would be included twice in the sum, and then halved. So leave them alone.
Every bit in the ^ expression would be only included once in the sum, on one side or the other, we care not, and then halved. So just half those.
Also FatPhil on SoylentNews, id 863
Yes it is. Evaluating the expression "a" causes undefined behaviour if "a" is indeterminate. "a" is considered to no longer have a value, any attempt to refer to its value causes UB.
No.... it's *a that becomes indeterminate. I suppose the case could be made, that some strange clause of one of the new drafts could be interpreted as 'a' the pointer, instead of the value of the pointer becoming indeterminate. But nonetheless, in traditional C, it is perfectly valid, and a compiler will have to support it, to avoid breaking backwards compatibility.
I suppose this becomes something like the argument, that the programmer should not be able to rely on MySQL coercing a NULL value to 0, when inserting a SQL row, containing a NOT NULL column, where the NULL has been presented; after all, the formal papers from the SQL standardization effort don't show that the DB engine can coerce NULL in such a manner.
Nonetheless, there is sometimes one established convention or another that predates the standard, and has more authority than the standard.
> Switching rings was too damn slow. So slow, that only academics ever used them, and not very well. It is no wonder it died.
I can't think of any current system that doesn't have something running at ring 0 and something running at ring 3.
Also FatPhil on SoylentNews, id 863
The memory heap manager library is passed the value of the pointer, not a reference to the pointer itself.
The C standard plays it safe even for hardware which does paranoid checking on pointers. The hardware may barf just from trying to assign unmapped address to a memory access register even before you actually try to dereference it.
That's very strange - I thought one of the design goals of Clang was to raise an error for undefined behaviour. I've certainly noticed that happening much more in Clang than in MSVC, in which I've had many cases of the compiler 'knowing what I mean' and then filling in the blanks.
The problem with C compilers that remove unstable code is that nearly every C program fed into it gets optimized down to hello_world. These new C compilers can just spit out a warning: "Your program is unsafe", and go on their merry way now. Actually having the compiler perform a check to verify that this is the case removes the 0.5% false positives, but most will find the extra compilation time not worth that.
I'm eggagerating of course. Hello_world is horribly unsafe too, because it uses printf.
No, I meant as written. When you encounter something undefined, turn the optimizer off and do the statements exactly as written. Whatever happens, make it happen in the order and steps that the programmer wrote. It'll still be wrong (most likely) but it'll be the programmer's wrong, not the compiler's wrong.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
From the blog post:
Violating Type Rules: It is undefined behavior to cast an int* to a float* and dereference it (accessing the "int" as if it were a "float"). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results. The rules for this are quite nuanced and I don't want to go into the details here (there is an exception for char*, vectors have special properties, unions change things, etc). This behavior enables an analysis known as "Type-Based Alias Analysis" (TBAA) which is used by a broad range of memory access optimizations in the compiler, and can significantly improve performance of the generated code. For example, this rule allows clang to optimize this function:
float *P;
void zero_array() {
int i;
for (i = 0; i < 10000; ++i)
P[i] = 0.0f;
}
Okay, maybe it's too early in the morning but where exactly did this function cast an int* to a float*? Where's the "undefined behavior"?
And anyway, how is casting int* to float* undefined behavior? You set this pointer to that pointer and now you're looking at the data a different way. It won't be sensible data unless you know what architecture you're programming to, but who says you don't?
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
I'd expect it to work left to right, pass f(1,2) and leave the value 3 in i. Just like everything else in the language that works left to right and top to bottom. If I got any other result, I'd call the compiler broken.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
That's probably the best answer (at least by some metric), but it needs just a tiny tweak to be perfect: a/2 + (a%2 + b%2)/2 + b/2.
I if you have a platform where INT_MIN is odd and a negative number divided by 2 rounds toward negative infinity* (I'm pretty sure such a platform is legal), then avg(INT_MIN, INT_MIN) will overflow. I think my tweak fixes that.
* I'm not 100% sure of what standards allow what behavior, but at least in C++98/03, -3/2 can return either -1 or -2, as long as (a/b)*b + a%b == a (so if -3/2 == -1 then -3%2 == -1 and if -3/2 == -2 then -3%2 == 1)
> I find it very curious that you'd be prohibited from using the value of the pointer proper. Do you have a citation for it being undefined?
N1570 L.3p2
"The value of a pointer that refers to space deallocated by a call to the free or realloc function is used (7.22.3)."
realloc() *always* deallocates, unless it returns NULL. Simply growing an area counts as a deallocation of the old area followed by a new allocation at the same spot.
Also FatPhil on SoylentNews, id 863
That reminds me of this gem:Overflow in sorting algorithms
Why would anyone make array indices signed ints in the first place? As a C programmer that sets off alarm bells for me, so I'd immediately suspect the rest of the function too. I'm amazed no-one noticed it for so long. Did nobody review that code?
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
There's a catch: the expression f(tmp = a, tmp, tmp) contains no sequence points. Comma used as a separator for function arguments is not regarded as comma the operator. So unless you explicitly add parentheses to force the comma to be treated as the operator, any side-effects made inside the function argument list may not take effect until after the function returns.
In this instance there is a better way. We know that low > 1) + (high >> 1)
This takes care of rounding up. Alternatively, just re-write the binary search to work with rounding down, a trivial modification. Then you can just do
ave = (low >> 1) + (high >> 1)
which can be compiled down to three assembly instructions on many architectures.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
In this instance there is a better way. We know that low < high because we already tested for low==high as our loop exit condition. Therefore we can safely add 1 to low without fear of an overflow.
ave = ((low+1 ) >> 1) + (high >> 1)
This takes care of rounding up. Alternatively, just re-write the binary search to work with rounding down, a trivial modification. Then you can just do
ave = (low >> 1) + (high >> 1)
which can be compiled down to three assembly instructions on many architectures.
(re-posting because it was mangled the first time, should have used preview)
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Not following the logic that an infinite loop is "undefined." Seems pretty well defined to me. Bugged. But perfectly well defined.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
An interrupt is not a good way to handle integer overflow, especially since it is often the desired behaviour. Most modern CPUs can detect overflow and set a CPU flag, which the code can then test. That's the best way to handle it - test the flag if you care, or not if you don't.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
The results of arithmetic overflow is undefined in the language, not the architecture. Don't warn. Punt to the CPU and accept its result. If you optimize in a way that potentially corrupts the result from the CPU then I expect you to warn... and give me a compiler flag to turn the specific optimization off.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
You sound like one of these Terrorists who want to deny Raytheon their multi-billion dollar Cyber Warfare Pork !
Well defined, but sometimes wrong:
arch x86_64, gcc 4.7.2 20121109 (Red Hat 4.7.2-8) (GCC)
char: 7f , 7f = 7f
unsigned char: ff , ff = ff
short: 7fff , 7fff = 7fff
unsigned short: ffff , ffff = ffff
int: 7fffffff , 7fffffff = ffffffff
unsigned int: ffffffff , ffffffff = 7fffffff
long: 7fffffffffffffff , 7fffffffffffffff = ffffffffffffffff
unsigned long: ffffffffffffffff , ffffffffffffffff = 7fffffffffffffff
(Under either -O0 or -O2. Sizes smaller than int work because the arithmetic will be done under promotion up to int. Sizes int and larger fail because the MSbit is lost.)
No, I meant as written.
When you say "as written", what seem to mean is a different compiler spec that you believe is obvious. I don't believe "as written" has any meaning, because that spec just doesn't exist and each programmer might have a different definition of what "as written" means. That's why we have specs.
This is a common error. On real hardware, arithmetic overflow is well-defined. It may not be the behavior you want, but it's well-defined. In the C standard, arithmetic overflow is undefined. Sure, if you intentionally overflow ints, you will get something, based on the hardware behavior of the instructions generated by the C compiler. But according to the C standard, arithmetic overflow is undefined. That means that if you write in C, and what you write in C overflows, you may not have any expectations about the behavior: it can be anything.
This brings up another tricky thing in C: pointers that are not pointers.
Arrays are just pointers to pre-allocated memory. This allocates both the memory for the array (100 bytes) and the memory for a pointer to the array (typically 4 bytes on a 32 bit system):
uint8_t a[100];
&a gives you a pointer to a pointer to the first element of the array. Consider this:
typedef struct {
uint8_t a[100];
} oddity_t;
oddity_t st;
Here we have a struct 100 bytes in size. However, st.a is not a pointer. &st.a gives you the pointer of the first element of the array. a itself gives you the first element of the array and is of type uint8, but I'm not sure if that is undefined/compiler dependent and can't be bothered to look it up now.
Now try doing sizeof(a) and sizeof(st.a).
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
From your link:
Lets look at an example: even though invalid type casting bugs are frequently exposed by type based alias analysis, it would not be useful to produce a warning that "the optimizer is assuming that P and P[i] don't alias" when optimizing "zero_array" (from Part #1 of our series).
float *P;
void zero_array() {
int i;
for (i = 0; i < 10000; ++i)
P[i] = 0.0f;
}
But that statement makes no sense. P[i] is not a pointer, it's a single float. It can't be an alias for the pointer P. Hence there is no assumption for the optimizer to warn about.
P+i would be a pointer. But P[i] is the same as saying *(P+i).
And yes, I did see your comment. Did you see my response?
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Sorry, I was thinking C++, where it is defined.
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Everyone who cares about correctness of their C or C++ programs at all, should read carefully all of the following articles:
Dangerous Optimizations and the Loss of Causality (PDF)
Understanding Integer Overflow in C/C++ (PDF)
A Guide to Undefined Behavior in C and C++, Part 1 (blog post)
Finding Undefined Behavior Bugs by Finding Dead Code (blog post)
Then complain to your local compiler manufacturer. :P
There's at least 191 kinds of undefined behaviour in C99. Its not reasonable to expect programmers to always perfectly avoid all of them.
We need compiler-writers to meet us half-way, by agreeing not to aggressively optimize out code that triggers undefined behaviour the way they are doing. There millions of existing time-bombs in billions of lines of C/C++ code out there, any one of which could suddenly become a serious problem when a new version of a popular compiler starts taking advantage of the fact that it is relying on undefined behaviour and optimizes it out, breaking code that used to work (even though that code was "incorrect"). Compiler writers need to be shamed into not doing this, because its a bad thing for everything and everyone except for their fucking benchmarks.
All modern optimizing compilers can delete code that invokes undefined behavior. All modern optimizing compilers assume integer overflows won't happen, allowing them to do things with loop induction vars that they just couldn't do if they had accomodate the possiblity of 2's complement overflow. Programmers need to learn to avoid undefined behavior, because the compiler will fuck you if you don't.
Microsoft's compiler does it too.
I read the article but I'm not following the author's point.
int a = 0x12345678;
short *b = (short *)&a;
b[1] = 0;
What's wrong with that? You take the location of a and assign it to a pointer to a short. b[0] now contains the first two bytes that comprise a and b[1] now contains the second two bytes that comprise a. Which index contains 0x1234 and which one contains 0x5678 (before b[1]=0 sets it to zero) depends on thee endianness of your machine but that's beside the point. It's very clear what this statement should do: act on the bytes which comprise the 32 bit integer and do it 16 bits at a time.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
NULL pointers are actually GOOD, because they will facilitate Fast Fail
Only if your compiler's manual states that it extends the C language such that NULL pointers facilitate Fast Fail. The point of the article is that dereferencing NULL pointers in C is undefined behavior, and compilers aren't required to facilitate Fast Fail in cases of undefined behavior. The proper way to facilitate Fast Fail in C or C++ is to check for NULL pointers near the start of each function using a construction like assert.
I suspect it's because of signed ints being the default. Many people simply think "float or int?" and aren't even thinking about the sign unless they specifically want to double their addressable space. Also the habit of using -1 as an error code prevents unsigned ints being passed back from functions.
Unsigned ints are very deliberate so perhaps only used a fraction of the time they could be, maybe predominately in structs.
That's a research paper just waiting to be written up. :-)
Cwm, fjord-bank glyphs vext quiz
What would be a safer low-level systems language than C? I'd love to see one, preferably, one with a LOT less undefined behavior, but still within 2-3x the performance of C, and with the ability to call C or C++ libraries when necessary. I'm not looking for a fully managed environment like Java or .NET, or a higher level language like Python or Ruby. Definitely *not* looking for C++ either . . . I know one can write safer code in it but one can also, quite by accident, write very unsafe code in it as well. Maybe something like D?
Nonaggression works!
The average of (UINT_MAX/2 + 1000) and (UINT_MAX/2 - 500) cannot be expressed in a standard int. If you're assigning the value to an unsigned int, you get UINT_MAX/2 + 250, which is reasonable. You can't, by definition, represent that number in an int. In that case, why are you complaining about a specific result, when all possible results are wrong?
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
Wrong. Unsigned arithmetic in C is defined as modular. Signed arithmetic isn't, and overflows are undefined behavior.
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
Wrong. Only signed overflow and underflow are undefined. Unsigned arithmetic is, indeed, defined as modular. From ISO/IEC9899:TC3:
A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
"The state is that great fiction by which everyone tries to live at the expense of everyone else." - Bastiat
No, the response I got was that since the order of evaluation of function arguments is undefined, they can even be done in parallel. Each of the two expressions has sequence points within them, but the comma in the function call does not define a sequence point.
It isn't about the order of f() and g() being evaluated, but the two arguments to x():
x( (tmp = 1, f(tmp) + g(tmp) ), (tmp = 2, f(tmp) + g(tmp) ) );
Now, the value of tmp after the call to x() is obviously undefined, but apparently even the two arguments to x() are undefined.
Maybe the specific language specification has changed since then, I don't know, this was around 10-15 years ago, on an Alpha with DEC's ucode-based optimizing compiler.
http://www.youtube.com/watch?v=2taViFH_6_Y
It can, however, be expressed in an unsigned int. And you do NOT get UINT_MAX/2 + 250 if you naively average two variables with those values; you get 250.
$ ./foo
a 2147484647, b 2147483147, expected_avg 2147483897, avg 249, real_avg 2147483897
Here we have a struct 100 bytes in size. However, st.a is not a pointer.
Where? Of course st.a is a pointer.... there are rules in C of pointer-array interchangeability.
st.a has type uint_t [100]. And sizeof both of those is the same.
&st.a has type uint_t (*)[100]
Just try it.... http://pastebin.com/7an0MS9g
Remember: when you are defining an array of two dimensions dynamically, there are two completely different approaches.
1. List of pointers technique
int **Array = malloc(sizeof(int*) * OuterMax);
And 2. Built-in array type
Both result in the a[i][j] notation.
Both are structured completely differently.
Hilarity ensues, when a programmer accidentally forgets which type of 2D array it is, and tries to resize (a[i]); or bcopy/memcpy on 'a' in an array created using the list-of-pointers technique
Part 1 of the series has more. Here's the result, and then I'll explain:
"[The strict-aliasing rule] allows clang to optimize [zero_array] into "memset(P, 0, 40000)". This optimization also allows many loads to be hoisted out of loops, common subexpressions to be eliminated, etc. This class of undefined behavior can be disabled by passing the -fno-strict-aliasing flag, which disallows this analysis. When this flag is passed, Clang is required to compile this loop into 10000 4-byte stores (which is several times slower) because it has to assume that it is possible for any of the stores to change the value of P."
Now, for the explaination (I don't think the LLVM blog explains well):
That code, taken on its own, doesn't invoke violate the strict-aliasing rules or have UB. The UB would arise (unrelated, so far, to zero_array) if you wrote something like
If you did that and then called zero_array, what would happen (practically speaking, when there is no optimization) is that on the first iteration of the loop the compiler would write 0.0f at the address of P[0] = *(P+0) = *((float&P) = P, thus changing the value of P itself. On the next loop iteration, P would have changed.
The strict-aliasing rule allows the compiler to assume that P does not change between loop iterations, which allows it to generate better code.
The short answer is because "the standard says so".
But here's a very realistic situation for which forcing semantics would be very detrimental. Assume you're on a platform with 32-bit ints, 64-bit doubles, and for which a 64-bit memory load must be aligned to 8 bytes. (This is a very realistic architecture.) Now suppose you do a somewhat different type-punning cast:
What should this code do? If you run it on the architecture I said above with a naive compilation, it will probably bus error: probably x will be nicely-aligned but then y will probably be exactly not on an 8-byte boundary and when foo dereferences pd it will be a misaligned load.
In the absence of the strict-aliasing rule -- if the load from address pd had to produce at least some value -- the compiler would have to assume that every memory access it could not establish safe could potentially be misaligned, and either insert code to catch the trap if possible or perform the appropriate correction.
There are other ways in which the strict-aliasing rule makes sense (e.g. similar code but y is at the end of a page for which the next page isn't mapped), but that's probably the most convincing one I can come up with off the top of my head because most of the others would involve made up memory models and stuff that have probably never been built but are permitted by the standard anyway. :-)
Cool, thanks!
...either insert code to catch the trap if possible or perform the appropriate correction
Which, of course, in the latter case would be slllllooooow.
I've already written way too much for this /. story (but this is the kind of story that fits very snugly in my areas of interest and, to a lesser extent, expertise), but it just occurred to me: my example is much more convincing if you substitute char for int and "any_type_larger_than_1_byte_t" for double. :-)
In other words, without interprocedural optimizations, a compiler for a platform where misaligned loads and stores causes a trap could never compile
to a simple load instruction unless it was prepared to deal with the trap and restart.
An interrupt is not a good way to handle integer overflow, especially since it is often the desired behaviour.
Very seldom, if ever, is integer overflow desired behavior. Other that for computing simple checksums, there are very few use cases.
Loop counters, RNGs, timers, doing maths on values with more bits than the architecture can handle at once. I can think of many situations where it is used.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
What I mean by "not a pointer" is that you it does not contain an address. If you printf it you get the first value of the array, not a pointer to the array. You have to use an ampersand to get the address of the first item of the array.
Neither arrays nor pointers work that way. There are good reasons why, it's just confusing and inconsistent.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
In that program, you're sticking lots of values where they just don't fit. The variables a and c are initialized with values that don't fit in an int, and avg will return a value that doesn't fit in an int. You are also printing out ints with a %u specifier, which doesn't match. Given that there's no actual requirement for a C implementation to use two's complement, the result of printing out an int with %u is not well defined (I don't know which category of "not well defined" it falls in, offhand).
Since the variables you use cannot hold the values you use, any value is incorrect, and you're asking that it be incorrect in a way you like. If you use variables of a type that can hold the values (unsigned int, possibly long or long long), I think you'll find that everything looks fine.
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes
It's *your* responsibility as a C programmer to find out what the rules of the game are; you should accept that responsibility. However ...
You're correct that I messed up some types. However, making everything unsigned produces exactly the same results.
$ ./foo
a 2147484647, b 2147483147, expected_avg 2147483897, avg 249, real_avg 2147483897
This is a well-defined program, but avg() returns an incorrect value. The issue is the intermediate value (a + b), which is well-defined, but is 498 instead of UINT_MAX + 499.
" rather than have the compiler simply leave it out. "
I always knew that C was a "Bug Farm", but I never knew that some C compilers intentionally insert bugs !!