Slashdot Mirror


Arrays vs Pointers in C?

UOZaphod asks: "A recent sub-discussion on Slashdot (in which, I confess, I was involved) piqued my curiosity because of several comments made about C compiler optimizations. I was informed that said optimizations have made it so that indexing an array with the [] operator is just as fast as using an incremented pointer. When the goal is maximum performance across multiple CPU architectures, can one always assume that this is true?" "Here are my own thoughts on the issue:

For discussion purposes, I present the following two equivalent functions which reverse the contents of a string. Note that these code fragments are straight C, and do not account for MBCS or Unicode.

The first function uses array indexing:
void reversestring_array(char *str)
{
int head, tail;
char temp;
if (!str) return;
tail = strlen(str) - 1;
for (head = 0; head < tail; ++head, --tail)
{
temp = str[tail];
str[tail] = str[head];
str[head] = temp;
}
}
The second function uses pointers:
void reversestring_pointer(char *str)
{
char *phead, *ptail, temp;
if (!str) return;
ptail = str + strlen(str) - 1;
for (phead = str; phead < ptail; ++phead,--ptail)
{
temp = *ptail;
*ptail = *phead;
*phead = temp;
}
}
While there are obvious optimizations that could be done for both functions, I wanted to keep them as simple and semantically similar as possible.

Arguments have been made that the compiler will optimize the first example using register indexing built into the CPU instruction set, so that it runs just as fast as the pointer version.

My argument is that one cannot assume, in a multi-architecture environment, that such optimizations will always be available. Semantically, the expression array[index] must always be expanded to *(array + index) when the index is variable. In other words, the expression cannot be reduced further, because the value of the index is unknown at run time.

Granted, when I compiled the above examples on an x86 machine, the resulting assembly for each of the two functions ended up looking very similar. In both cases, I enabled full compiler optimization (Pentium Pro). I will present just the inner loop for each function...

The array function:
forloop:
mov bl,byte ptr [esi+edx]
mov al,byte ptr [ecx+edx]
mov byte ptr [ecx+edx],bl
mov byte ptr [esi+edx],al
inc esi
dec ecx
cmp esi,ecx
jl forloop
The pointer function:
forloop:
mov bl,byte ptr [ecx]
mov dl,byte ptr [eax]
mov byte ptr [eax],bl
mov byte ptr [ecx],dl
inc ecx
dec eax
cmp ecx,eax
jb forloop
While this example appears to prove the claim that compiler optimizations eliminate the differences between array and pointer usage, I wonder if it would still be true with more complicated code, or when indexing larger structures.

I'd certainly be interested in hearing more discussion on the matter, accompanied by examples and references."

15 of 308 comments (clear)

  1. Re:Hmm.. by Surye · · Score: 2, Informative

    Oops, messed up the tag at the top, and it ate the quote portion:

    I'd certainly be interested in hearing more discussion on the matter, accompanied by examples and references

    Preview! Gah.

  2. Strength Reduction by The+boojum · · Score: 3, Informative
    My argument is that one cannot assume, in a multi-architecture environment, that such optimizations will always be available. Semantically, the expression array[index] must always be expanded to *(array + index) when the index is variable. In other words, the expression cannot be reduced further, because the value of the index is unknown at run time.
    Yes, semantically array[index] has to have the same effect as *(array+index). But the compiler is free to generate conceptually equivalent code in any way that it pleases. Any decent C/C++ compiler optimizer that can perform strength reduction ought to be able to see how the index changes the memory location and turn it into simple pointer incrementation accordingly. And strength reduction is a well known optimization that's been around for ages -- if memory serves, even the old Red Dragon book talks about how it works in this context. If your compiler can't handle this you need to find a better compiler.
  3. There cannot be any difference. by Anonymous Coward · · Score: 2, Informative

    The whole point is dumb.
    Every C programmer should know the answer to the following question:

    What is 6["abcdefghijkl"]?

    Answer: 'g'.

    How is this determined?
    By definition x[y]=*(x+y)=y[x].

    Don't believe me, check the standard.

    ( Yeah this is a degenerate case, like Duff's device. Still a compiler has to support it. )

  4. Re:Professor Cormen said... by RAMMS+EIN · · Score: 3, Informative

    You're not completely right.

    First:

    ``int i[20] followed by int *k = i, then i[4] is the same as *(k + (4 * 4))''

    You're trying to get the 5th element of the array by using an offset of 4 times 4, assuming sizeof(int) == 4. First off, don't make that assumption; always write sizeof(int) when that's what you mean. Secondly, the C compiler automatically multiplies your offset by the size of the elements, so you would have to write *(k + 4) instead.

    Secondly, you're not getting the point (probably, you were misled by the headline, as I was). The question is not whether variables holding arrays are really holding pointers to the arrays (they are), but whether, say, iterating through an array by updating a pointer is faster than iterating by using an index variable as an offset. In other words, it's not whether a[0] is the same as *a, but whether while(*a) a++; is faster than while(a[i]) i++;.

    --
    Please correct me if I got my facts wrong.
  5. your mileage will vary, but... by blackcoot · · Score: 3, Informative

    ... my experience has been that this matters more in the multidimensional array case than in the single dimensional array case (for those who are curious: my goal is to write algorithms which do non-trivial amounts of processing on VGA or larger video at full frame rates [>= 15Hz], any time i can make array operations faster my entire program benefits significantly). when dealing with two dimensional arrays, you can either do the addressing yourself (location [i,j] in a r x c matrix maps to [i*c+j] in a flat array). if you are clever about how you explain your indexing to the compiler, it should realize that you're passing through consecutive addresses in memory and generate code accordingly. if, on the other hand, you're doing something like A[i][j], the compiler has to generate two deref ops plus pay the cost of whatever cache misses result from using the two levels of indirection --- in this case, working with pointer / index arithmetic relative to the base address is a big win.

    have you tried this with intel's c/c++ compiler or other compilers? i'd be curious to see if what you're seeing is a result of how gcc is limited in the number of optimizations it can apply directly to the parse tree because it can't assume (at that stage) a particular target machine.

  6. And C++/STL? by XenonOfArcticus · · Score: 4, Informative

    And what of STL container classes under C++?

    Seriously though, there is no generalized answer. Good compilers will do what you want. Bad compilers (and there are more than you realize out there) will make lousy code.

    If your target involves an environment where you might be using a more primitive compiler, or you can't predict the environment and compiler, it might be an issue. This is why code like the PNG and JPEG libs go for tight/cryptic. As well, the performance of the runtime platform (CPU, memory, bus) have bearing. In some cases, though one piece of assembly might look less efficient than the other, the sheer brute force of CPU parallelism, out of order execution and other black juju might render it meaningless.

    Finally, you have to consider the cost/benefit of the situtation. Making cryptic fast code is worthwhile if you're writing some wicked FFT code or image processor main loop, where it'll run a few quadrillion times. Other places in the same codebase though, it's probably worthwhile to trade absolute performance for a bit better code readability and maintainability.

    Remember, profile before optimizing. Only optimize things that will really make a significant performance difference. Rewriting the UI display loop to use pointers instead of lists is probably pointless. Heh. Pointless.

    I'm a big fan of C++ STL containers now. If I _know_ a block of code is going to be a critical bottleneck, I'll use something else from the start. I've known people who coded UIs in assembly, for no good reason, and others who wrote image processing code in interpreted RAD scripting environments. I've written and optimized code (C++/C and assembly) on systems all the way back to the 6502 (yay! two 8-bit index registers!) and it's not hard, as long as you proceed sensibly and logically.

    That being said, the Microsoft VC++6 compiler (still in common use today) has a terrible code generator. It fails to perform simple loop invariant hoisting operations that my old SAS/C++ compiler (Amiga, yah, I'm one of _them_...) did in 1990. VC++7/2003 and Whidbey/2005 are showing signs of being MUCH more caught-up, and the Intel and Codeplay compilers (despite Intel's AMD-phobia) are much better too.

    When performance really counts, a whole new set of tools and processes come into play.

    --
    -- There is no truth. There is only Perception. To Percieve is to Exist.
    1. Re:And C++/STL? by WARM3CH · · Score: 2, Informative
      OK I tried it with Visual Studio 2003 and also tried STL. Of course with STL you only need one line of code to reverse a character string:
      reverse(str, str+strlen(str));
      Now, the interesting part is what the optimizing compiler outputs for each of the three vairants (I only include the inner loop):
      With pointers we have:

      $L13583:
      mov bl, BYTE PTR [ecx]
      mov dl, BYTE PTR [eax]
      mov BYTE PTR [eax], bl
      mov BYTE PTR [ecx], dl
      inc ecx
      dec eax
      cmp ecx, eax
      jb SHORT $L13583

      With array we have:

      $L12527:
      mov bl, BYTE PTR [ecx+esi]
      mov dl, BYTE PTR [eax+esi]
      mov BYTE PTR [eax+esi], bl
      mov BYTE PTR [ecx+esi], dl
      inc ecx
      dec eax
      cmp ecx, eax
      jl SHORT $L12527

      and with STL we have:

      $L13663:
      mov dl, BYTE PTR [ecx]
      mov bl, BYTE PTR [eax-1]
      dec eax
      mov BYTE PTR [ecx], bl
      inc ecx
      cmp ecx, eax
      mov BYTE PTR [eax], dl
      jb SHORT $L13663
      Quite nice, no? ;) Now, I made one last step and used the RDTSC instruction to actually count how many clock cycles it takes to run each version of the function to reverese a 80 characters string. This way we can also see the effect of the parts not inside the loop. This is the result:
      function with the array: 661 cycles
      fucntion with the pointer: 616 cycles
      function with the STL: 607 cycles
      So although the core loop section is almost identical with the STL and pointer version, the STL version is tiny bit better with the setup section. All in all, I think this an example to show nice and neat C++ code can compete fairly well with optimized C.
  7. Re:there is a difference... by Waffle+Iron · · Score: 4, Informative
    what you in fact have is in the array 4 additional additions (between registers),

    Actually, most x86s have a dedicated address generation units which handle those index additions in parallel on separate logic from the main ALU. So both cases would actually run at the same speed on a modern x86.

  8. Re:It optimizes out by Anonymous Coward · · Score: 2, Informative

    Exactly, strength reduction changes the indexing operation to straight pointer arithmetic. this is done at least when i looked a few years ago in gcc/g++, in the later phases of the compilation, so that that the compiler is rearranging the RTL to eliminate the indexing variable. you can verify strength reduction by just setting the optimisation in gcc and looking at the assembler output.

    The point is now a bit moot since for many loops you do not want either indexing or pointer arithmetic, you want SIMD instructions which are a third alternative way of implementing the the programmers looping construct. this is now done in SSA at the front end of gcc. In most cases compilers are smart enough to ignore the programmers code and minor semantic differences in code, like indexing or arithmatic will be restructured to the most optimal solution on the target architecture. So the point is, leave it to the compiler.

    jxx

  9. From the Compiler Trenches by Anonymous Coward · · Score: 5, Informative
    I develop highly optimizing compilers for a living, so use that to put in context what I'm going to say. Things look very different from the other side of the source file.

    I'll admit that I'm always slightly bemused by these sorts of discussions. That bemusement quickly turns to disiniterest after I realize that a lot of people are burning a lot of cycles arguing about it.

    Quite frankly, your example has little relevance in the real world. The optimization you are talking about is covered by strength reduction, as others have pointed out. But that's not the point of my message. This sort of piddly optimization means almost nothing when one looks at the whole application. If this piece of code is in an inner loop that takes 90% of the application time and it proves to be the bottleneck, then one can think about taking a closer look.

    We have customers that come to us all the time with just such examples. They literally tell us, "You missed an opportunity to use a register here," and we know it's important because we know the customers are serious about profiling code and finding bottlenecks. So when they come to us we are happy to look at the "piddly stuff."

    I've seen all kinds of different code. They basically fall into two categories: code that the compiler can do something significant with and code the compiler has no hope of automatically optimizing in any truly meaningful way. When I say "significant" and "meaningful," I explicitly mean "not Dragon Book stuff, except for scheduling (which can provide a significant performance win)" Simple optimizations like common subexpression elimination and copy propagation are more useful at enabling other optimizations than in any cycles gained in their own right. They are important, but not to make the code run significantly faster in and of themselves.

    If one writes an application that does a lot of branching and pointer chasing (say, like, a traditional compiler*) then there's not much an optimizer can do with it. The aliasing difficulties alone will kill most optimizations. It's more important to write these kinds of applications in an understandable way because that is where the programmer time is most costly.

    That said, judicious use of directives for compilers that support them can go a long way toward making these kinds of codes run really fast. Think of threading a tree search, for example. But the compiler is not going to have much hope of converting such a low-level piece of code without help.

    An example of code that a compiler can do really, really well on are the traditional scientific applications. Here parallelism is everything, be it data-level, thread-level or at the distributed memory level (for really big machines). In these cases the loop optimizations are orders of magnitude more important speed-wise than sequential optimization (though, as I said, sequential optimization can enable some of these loop transformations). Some of the more important loop restructuring transformations are:

    • Interchange
    • Unroll
    • Fusion
    • Fission
    • Unroll-and-Jam
    • Invariant hoisting (both data and control)
    • Vectorization
    • Cache Blocking

    When the compiler is done with these, one hardly recognizes the code anymore. :)

    In my experience, the fundamental problem is that compilers are really hard to understand. People argue about what a compiler can and cannot do because they are enormously complex systems that require arcane knowledge of language standards and hardware architecture to really dig into. It's slightly less difficult to understand the broad strokes, for example, simple cases of vectorizable loops. It's a lot more difficult to understand how to parallelize a loop that compresses a sparse array into a dense one.

    If there's one lesson that I like to convey to programmers, it's to not sweat the optimization details. Don't hand-optimize the code. I can tell you fro

  10. Re:GCC experimental results by Piquan · · Score: 2, Informative

    Unfortunately, the article didn't say what compiler he was using. But since we're giving data points:

    gcc 3.4.2 3.4.2 [FreeBSD] 20040728, x86, -O3 -march=pentium-m. Generated essentially the same code as the article's.

    Array version:

    .L12:
    movzbl (%esi,%ecx), %edx
    movzbl (%esi,%ebx), %eax
    movb %al, (%esi,%ecx)
    decl %ecx
    movb %dl, (%esi,%ebx)
    incl %ebx
    .L10:
    cmpl %ecx, %ebx
    jl .L12

    Pointer code:

    .L23:
    movzbl (%ebx), %edx
    movzbl (%ecx), %eax
    movb %al, (%ebx)
    decl %ebx
    movb %dl, (%ecx)
    incl %ecx
    .L22:
    cmpl %ebx, %ecx
    jb .L23

    1.4 GHz Athlon. Array code time: 3.274s. Pointer time: 3.322s. Single (100000x) trial of each.

    I'd say that's within noise.

  11. Re:I echo the above statements by Paul+Jakma · · Score: 2, Informative

    lower likelihood of programming errors on the array option. .... IMHO there is no place for pointer arithmetic in modern software. If someone working for me wrote something like the second option, I would ask them to rewrite it.

    You realise that pointer arithmetic and array indices are the same thing, don't you? Ie given:

    int b[100];
    int i;

    The following are equivalent:

    b[i++]

    *(b + i++)

    Arrays *are* /nearly/ equivalent to pointers, indeed an array variable without a subscript degrades to a pointer. Further, note in both cases above i may or may not be in-bounds for the array, the array notation does not help check the bounds at all, and the programmer will have to go to same trouble to check bounds properly regardless.

    It sounds to me btw like you're not qualified to decide what place pointer arithmetic have in modern software.

    --
    I use Friend/Foe + mod-point modifiers as a karma/reputation system.
  12. You can't know it a priori by Anonymous Coward · · Score: 1, Informative

    On some compiler/architectures array indexed code is faster than direct pointer access. On others the opposite is true. I discovered this while writing portable code for Mac/Win32 ten years ago. The inner loop of my pointer code was running faster on 68k/PPC as compared to x86. When I switched it to array dereferencing, the x86 was faster. I confirmed the results by looking at the disassembly of the compiled code.

    The moral of the story: write the code that is most readable/maintainable to your eyes.

  13. Photoshop plug-ins by mwvdlee · · Score: 3, Informative

    I've written quite a lot of code for Photoshop plug-ins.

    Since this type of code typically iterates over a few hunderd million pixels you'd think that changing such details as array vs. pointer or some other common optimalization technique would have an impact.

    It does; it typically shaves about a few tenths of a second off of a 5 minute calculation.

    Then again, spending that same amount of time altering the algorithm will usually increase performance in a noticable way.

    Nowadays I don't bother optimizing code (usually the compiler does a better job at it anyway) but rather optimize the algorithms. Instead of opening the topic and waiting for a definitive answer on your quest for ultimate performance, you could probably have rewritten the algorithm and gained much more performance you'd ever get this way.

    --
    Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
  14. Re:Why do you care? by Anonymous Coward · · Score: 1, Informative

    For _real_ programming tasks, like scientific computing applications that run for months at a time on supercomputers, it matters a lot. In that world, a 1% performance improvements equates to days of wall clock (and thousands of CPU-days).

    Algorithmic improvements are always the best way to get better performance. But once you can't figure out how to make an algorithm any "better," the profilers come out and hot loops are tackled. One of the portable tricks that seems to have benefit on a variety of compilers/machines is changing how floating point operations are scheduled: even native compilers seem to have problems avoiding dependency stalls. The trick to this kind of optimization is that by breaking up complicated expressions into smaller chunks, you are giving the compiler more options for applying its bag of tricks. Of course, if you break up things too much, you can miss opportunities to use instructions like muladds (a*x+b). It just takes a lot of experience on the particular platforms that your code has to run on.

    Back to the original question, I haven't seen a whole lot of difference between pointers and indexing. There are big performance differences, however, between STL vectors (as implemented in g++) and arrays. You should be wary of implementations that define operator[](const size_t index) as *(begin()+index): that stuff doesn't always inline away like you might think, especially if begin() is returning a non-trivial iterator class rather than just a T*. There are also big differences between using multidimensional arrays (e.g. in C) and 1-D arrays with multidimensional indexing (a la FORTRAN), the latter being considerably faster.

    I'm happy to see that the poster is interested in understanding code at a deeper level. There is still a call for programmers that understand the rich interaction between compilers and hardware.

    BTW: the CHUD tools on a Mac are great for learning about this sort of thing, especially the Shark profiling tool. Gotta love the built in PPC instruction reference, side-by-side source and assembly, stall counts, and pretty good advice about stupid things in the code.