Slashdot Mirror


Arrays vs Pointers in C?

UOZaphod asks: "A recent sub-discussion on Slashdot (in which, I confess, I was involved) piqued my curiosity because of several comments made about C compiler optimizations. I was informed that said optimizations have made it so that indexing an array with the [] operator is just as fast as using an incremented pointer. When the goal is maximum performance across multiple CPU architectures, can one always assume that this is true?" "Here are my own thoughts on the issue:

For discussion purposes, I present the following two equivalent functions which reverse the contents of a string. Note that these code fragments are straight C, and do not account for MBCS or Unicode.

The first function uses array indexing:
void reversestring_array(char *str)
{
int head, tail;
char temp;
if (!str) return;
tail = strlen(str) - 1;
for (head = 0; head < tail; ++head, --tail)
{
temp = str[tail];
str[tail] = str[head];
str[head] = temp;
}
}
The second function uses pointers:
void reversestring_pointer(char *str)
{
char *phead, *ptail, temp;
if (!str) return;
ptail = str + strlen(str) - 1;
for (phead = str; phead < ptail; ++phead,--ptail)
{
temp = *ptail;
*ptail = *phead;
*phead = temp;
}
}
While there are obvious optimizations that could be done for both functions, I wanted to keep them as simple and semantically similar as possible.

Arguments have been made that the compiler will optimize the first example using register indexing built into the CPU instruction set, so that it runs just as fast as the pointer version.

My argument is that one cannot assume, in a multi-architecture environment, that such optimizations will always be available. Semantically, the expression array[index] must always be expanded to *(array + index) when the index is variable. In other words, the expression cannot be reduced further, because the value of the index is unknown at run time.

Granted, when I compiled the above examples on an x86 machine, the resulting assembly for each of the two functions ended up looking very similar. In both cases, I enabled full compiler optimization (Pentium Pro). I will present just the inner loop for each function...

The array function:
forloop:
mov bl,byte ptr [esi+edx]
mov al,byte ptr [ecx+edx]
mov byte ptr [ecx+edx],bl
mov byte ptr [esi+edx],al
inc esi
dec ecx
cmp esi,ecx
jl forloop
The pointer function:
forloop:
mov bl,byte ptr [ecx]
mov dl,byte ptr [eax]
mov byte ptr [eax],bl
mov byte ptr [ecx],dl
inc ecx
dec eax
cmp ecx,eax
jb forloop
While this example appears to prove the claim that compiler optimizations eliminate the differences between array and pointer usage, I wonder if it would still be true with more complicated code, or when indexing larger structures.

I'd certainly be interested in hearing more discussion on the matter, accompanied by examples and references."

4 of 308 comments (clear)

  1. It optimizes out by klossner · · Score: 5, Interesting
    An optimizing compiler, such as gcc -O, will rearrange the array code into the pointer code -- it doesn't require a base-index address mechanism. This is called strength reduction.

    Back in the day, we all learned about this because a compiler construction course was required for a comp sci degree.

  2. GCC experimental results by RML · · Score: 5, Interesting

    Just for fun, I tried the sample code on gcc (GCC) 4.1.0 20050723 (experimental), with -O3 -march=pentium-m. The loop from the array version:

    L13:
    movzbl  -1(%ebx), %edx
    movl    %esi, %ecx
    decl    %edi
    movl    8(%ebp), %eax
    movb    %dl, -13(%ebp)
    movzbl  -1(%esi,%eax), %edx
    movb    %dl, -1(%ebx)
    decl    %ebx
    movzbl  -13(%ebp), %edx
    movb    %dl, -1(%esi,%eax)
    incl    %esi
    cmpl    %ecx, %edi
    jg      L13

    The loop from the pointer version:

    L5:
    movzbl  1(%esi), %edx
    movl    %esi, %ecx
    movzbl  (%ebx), %eax
    movb    %al, 1(%esi)
    decl    %esi
    movb    %dl, (%ebx)
    incl    %ebx
    cmpl    %ecx, %ebx
    jb      L5

    Time to execute the array version 100,000 times on a 10,000 character string: 0m4.515s
    Time to execute the pointer version 100,000 times on a 10,000 character string: 0m3.936s

    So the pointer version actually generates somewhat faster code with the compiler I used on this example, which surprises me. But there's no substitute for actually testing.

    --
    Human/Ranger/Zangband
  3. Re:Why do you care? by JanneM · · Score: 5, Insightful

    For any real programming task, the question has to be: why do you care baout that? Is it, specifically, a bottleneck in your code as detected with profiling tools?

    When the programming task is something like real-time image processing (computer vision), this kind of thing can make a serious difference. If 90% of your time is spent running these kinds of loops over and over again, an improvement in time will make a real difference on what combination of methods you have time for; or how exhaustively you can search for features during one frame; or what resolution image you are able to work on.

    If your code does something nice and graphical where 99% of the time is spent waiting for the user, sure, you're absolutely right. And if your system is doing something inherently bounded - it works until it's done, then it stops and waits until it's time again - then all you need is to make it fast enough and no faster. But there are real-world systems that today, and for the foreseeable future, are bounded by the available processing power and that can always benefit from any improvement in execution time.

    --
    Trust the Computer. The Computer is your friend.
  4. From the Compiler Trenches by Anonymous Coward · · Score: 5, Informative
    I develop highly optimizing compilers for a living, so use that to put in context what I'm going to say. Things look very different from the other side of the source file.

    I'll admit that I'm always slightly bemused by these sorts of discussions. That bemusement quickly turns to disiniterest after I realize that a lot of people are burning a lot of cycles arguing about it.

    Quite frankly, your example has little relevance in the real world. The optimization you are talking about is covered by strength reduction, as others have pointed out. But that's not the point of my message. This sort of piddly optimization means almost nothing when one looks at the whole application. If this piece of code is in an inner loop that takes 90% of the application time and it proves to be the bottleneck, then one can think about taking a closer look.

    We have customers that come to us all the time with just such examples. They literally tell us, "You missed an opportunity to use a register here," and we know it's important because we know the customers are serious about profiling code and finding bottlenecks. So when they come to us we are happy to look at the "piddly stuff."

    I've seen all kinds of different code. They basically fall into two categories: code that the compiler can do something significant with and code the compiler has no hope of automatically optimizing in any truly meaningful way. When I say "significant" and "meaningful," I explicitly mean "not Dragon Book stuff, except for scheduling (which can provide a significant performance win)" Simple optimizations like common subexpression elimination and copy propagation are more useful at enabling other optimizations than in any cycles gained in their own right. They are important, but not to make the code run significantly faster in and of themselves.

    If one writes an application that does a lot of branching and pointer chasing (say, like, a traditional compiler*) then there's not much an optimizer can do with it. The aliasing difficulties alone will kill most optimizations. It's more important to write these kinds of applications in an understandable way because that is where the programmer time is most costly.

    That said, judicious use of directives for compilers that support them can go a long way toward making these kinds of codes run really fast. Think of threading a tree search, for example. But the compiler is not going to have much hope of converting such a low-level piece of code without help.

    An example of code that a compiler can do really, really well on are the traditional scientific applications. Here parallelism is everything, be it data-level, thread-level or at the distributed memory level (for really big machines). In these cases the loop optimizations are orders of magnitude more important speed-wise than sequential optimization (though, as I said, sequential optimization can enable some of these loop transformations). Some of the more important loop restructuring transformations are:

    • Interchange
    • Unroll
    • Fusion
    • Fission
    • Unroll-and-Jam
    • Invariant hoisting (both data and control)
    • Vectorization
    • Cache Blocking

    When the compiler is done with these, one hardly recognizes the code anymore. :)

    In my experience, the fundamental problem is that compilers are really hard to understand. People argue about what a compiler can and cannot do because they are enormously complex systems that require arcane knowledge of language standards and hardware architecture to really dig into. It's slightly less difficult to understand the broad strokes, for example, simple cases of vectorizable loops. It's a lot more difficult to understand how to parallelize a loop that compresses a sparse array into a dense one.

    If there's one lesson that I like to convey to programmers, it's to not sweat the optimization details. Don't hand-optimize the code. I can tell you fro