Arrays vs Pointers in C?
UOZaphod asks: "A recent sub-discussion on Slashdot (in which, I confess, I was involved) piqued my curiosity because of several comments made about C compiler optimizations. I was informed that said optimizations have made it so that indexing an array with the [] operator is just as fast as using an incremented pointer. When the goal is maximum performance across multiple CPU architectures, can one always assume that this is true?"
"Here are my own thoughts on the issue:
For discussion purposes, I present the following two equivalent functions which reverse the contents of a string. Note that these code fragments are straight C, and do not account for MBCS or Unicode.
The first function uses array indexing:
Arguments have been made that the compiler will optimize the first example using register indexing built into the CPU instruction set, so that it runs just as fast as the pointer version.
My argument is that one cannot assume, in a multi-architecture environment, that such optimizations will always be available. Semantically, the expression array[index] must always be expanded to *(array + index) when the index is variable. In other words, the expression cannot be reduced further, because the value of the index is unknown at run time.
Granted, when I compiled the above examples on an x86 machine, the resulting assembly for each of the two functions ended up looking very similar. In both cases, I enabled full compiler optimization (Pentium Pro). I will present just the inner loop for each function...
The array function:
I'd certainly be interested in hearing more discussion on the matter, accompanied by examples and references."
For discussion purposes, I present the following two equivalent functions which reverse the contents of a string. Note that these code fragments are straight C, and do not account for MBCS or Unicode.
The first function uses array indexing:
void reversestring_array(char *str)The second function uses pointers:
{int head, tail;}
char temp;
if (!str) return;
tail = strlen(str) - 1;
for (head = 0; head < tail; ++head, --tail)
{temp = str[tail];}
str[tail] = str[head];
str[head] = temp;
void reversestring_pointer(char *str)While there are obvious optimizations that could be done for both functions, I wanted to keep them as simple and semantically similar as possible.
{char *phead, *ptail, temp;}
if (!str) return;
ptail = str + strlen(str) - 1;
for (phead = str; phead < ptail; ++phead,--ptail)
{temp = *ptail;}
*ptail = *phead;
*phead = temp;
Arguments have been made that the compiler will optimize the first example using register indexing built into the CPU instruction set, so that it runs just as fast as the pointer version.
My argument is that one cannot assume, in a multi-architecture environment, that such optimizations will always be available. Semantically, the expression array[index] must always be expanded to *(array + index) when the index is variable. In other words, the expression cannot be reduced further, because the value of the index is unknown at run time.
Granted, when I compiled the above examples on an x86 machine, the resulting assembly for each of the two functions ended up looking very similar. In both cases, I enabled full compiler optimization (Pentium Pro). I will present just the inner loop for each function...
The array function:
forloop:The pointer function:
mov bl,byte ptr [esi+edx]
mov al,byte ptr [ecx+edx]
mov byte ptr [ecx+edx],bl
mov byte ptr [esi+edx],al
inc esi
dec ecx
cmp esi,ecx
jl forloop
forloop:While this example appears to prove the claim that compiler optimizations eliminate the differences between array and pointer usage, I wonder if it would still be true with more complicated code, or when indexing larger structures.
mov bl,byte ptr [ecx]
mov dl,byte ptr [eax]
mov byte ptr [eax],bl
mov byte ptr [ecx],dl
inc ecx
dec eax
cmp ecx,eax
jb forloop
I'd certainly be interested in hearing more discussion on the matter, accompanied by examples and references."
In my experience, which is somewhat dated (by about 5 years), what you state is generally true for simple loops.
However, for more complicated loops the compiler fails to opimize the loop very well. My 'experience' was in writing an alogirthm which took a raw camera sensor ccd data and ordered it into a proper bayer pattern. When moving from indexer based access to pointer based access I saw a significant increase in performance.
Even basing one pointer off of another was slower than maintaining separate pointers and manipulating them via arithmatic.
The loop looked pretty nasty when it was done (the loop also mixed several other stages of the image processing pipe at once, such as application of color correction & calibration data, dead pixel correction, etc; probably had something like 30 pointers that had to be maintained), but it was quick.
I ended up taking following the same approach converting the raw ccd data to a live preview and code which generated a thumnail preview, obtaining the same kind of gains.
When I was done with the thumnail code, it was capable of generating 108 fully processed thumnails (color 768x512 8bpc image) per second on a 700mhz PIII from a 6 megapixel source image (3072x2048, 16bpc bayer image). I was proud of that at the time...
Back in the day, we all learned about this because a compiler construction course was required for a comp sci degree.
{ }
That'll teach ya.
then don't wank around optimizing
Dude, best use of the word wank. Ever!
for single cycles on a machine that probably bleeds off a million cycles every time you raise a window
Computers have become more powerful and programmers have become more lazy. It's not strictly true because instead of focussing a lot of time writing efficient code programmers are now focussing a lot of time writing a lot of code to fill bigger machines. That million cycles is wasted doing crap and probably half of doesn't need to be done anyway...
I can still remember the days when machines ran in the sub 10MHz range (yes, 10MHz is 400x slower than today's 4GHz). Software was generally responsive, functional and minimal. Adding a zillion features and eye candy was not considered necessary. Programs were easy to use and intuitive, and did I mention functional and minimal? In the days where "nobody will ever need more than 640k" software was designed and optimised to be small and chew up few cycles.
Now, with RAM and Gigahertz available for next to nix software has just bloated out. It's nice to see a programmer thinking about efficiency/size even if it is purely academic. We should be encouraging that; I know I'd like my applications to work faster and carry less crap than they do currently.
I drink to make other people interesting!
Just for fun, I tried the sample code on gcc (GCC) 4.1.0 20050723 (experimental), with -O3 -march=pentium-m. The loop from the array version:
L13:
movzbl -1(%ebx), %edx
movl %esi, %ecx
decl %edi
movl 8(%ebp), %eax
movb %dl, -13(%ebp)
movzbl -1(%esi,%eax), %edx
movb %dl, -1(%ebx)
decl %ebx
movzbl -13(%ebp), %edx
movb %dl, -1(%esi,%eax)
incl %esi
cmpl %ecx, %edi
jg L13
The loop from the pointer version:
L5:
movzbl 1(%esi), %edx
movl %esi, %ecx
movzbl (%ebx), %eax
movb %al, 1(%esi)
decl %esi
movb %dl, (%ebx)
incl %ebx
cmpl %ecx, %ebx
jb L5
Time to execute the array version 100,000 times on a 10,000 character string: 0m4.515s
Time to execute the pointer version 100,000 times on a 10,000 character string: 0m3.936s
So the pointer version actually generates somewhat faster code with the compiler I used on this example, which surprises me. But there's no substitute for actually testing.
Human/Ranger/Zangband
These types of opimizations are virtually pointless on modern machines.
.Net and Java XML parsers. And yes, the original programmers tried several different methods/libraries to tweak the code (Sax, Xerces, whatever). They got it down to a best of four and a half hours.
I call bullshit. Optimizations are important regardless of the language or CPU.
My Pentium III test machine with 256Meg of Ram blew away a dual processor Intel system with 1Gig of Ram while parsing a 30Meg XML import/export file.
It took over six hours on the dual processor system with the native
My C program parsed it in a hour and half. And yes, I used pointers. Why? Because its more efficient.
Time is money, especially when your trying to push down 20,000 price changes from the Mainframe to 2000+ POS units during the off hours. The solution? We put my C routine into a shared library callable via C# or Java. Bonus, the 'C' code gets it done under an hour on the dual CPU machine. And yes, I tested the inputs for overflows, security problems whatever, before we went into production. Theres a big difference between a programmer who knows a language vs a programmer who understands it.
IMHO there is no place for pointer arithmetic in modern software. If someone working for me wrote something like the second option, I would ask them to rewrite it.
Thats your opinion and I'm glad you shared it. You do know that your C#/Java/VB/Python etc. VM calls all wind up as pointer arithmetic to the CPU? Don't you? I wouldn't want to work for you though. Your competitors will write a faster program that uses less memory and you will loose the contract/job.
No flame intended,
Enjoy.
It's just the normal noises in here.
Whereas C's ghost is parsimony and piety, décadence haunts the house of Java.
There are optimizations and then there are optimizations. If you run this particular loop once, or twice, it won't matter. Run it ten times or even a hundred, it won't matter. In these cases it would be a pointless optimization. Profile your software before you optimize. That way you can optimize where you need it and not waste time where you don't.
I remember one code review where someone told me to use prefix instead of postfix notation, for optimization reasons. Yet it occured in an initialization routine in a background thread that would run once per day. That's like worrying about a memory leak in a power-off interrupt handler...
A Government Is a Body of People, Usually Notably Ungoverned
> 10MHz range (yes, 10MHz is 400x slower than today's 4GHz). Software was generally responsive, functional and minimal. Adding a zillion features and eye candy was not considered necessary. Programs were easy to use and intuitive, and did I mention functional and minimal? In the days where "nobody will ever need more than 640k" software was designed and optimised to be small and chew up few cycles.
You forgot to mention that keeping an 80x25 column text display updated only required moving around a maximum of 4000 bytes at a time (80x25 = 2000 bytes for the monochrome text buffer + another 80x25 colour buffer for 16 background and 16 foreground colours per text block), and that a sub 10MHz CPU would certainly struggle to animate that at 30 frames per second or higher.
My 1920x1200 colour display requires 55296000 bytes (1920x1200*24), which is 13,824 times as much as that 80x25 text display.
Now despite my 2GHz CPU only being 200 times faster than the hypothetical 10Mhz CPU, it doesn't struggle at all - not only can it animate the whole screen at 60FPS or more, but it can also calculate positions for thousands of objects at the same time.
There are certainly some areas in which the software doesn't seem to have sped up in proportion to the processor speed - but my guess is thats mostly because they ceased being CPU bound years ago, not because the CPU is flat out wasting cycles trying to do the job.
That doesn't mean it's not the programmer's fault - but if it is, it's because they decided to block the interface while they waited for a DNS query (or similar bad design decision), not because they used a pointer offset instead of an array dereference, or vice versa.
Micro-optimisation is pretty much always a waste of time, if your program really is that CPU bound that you have to start worrying about internal implementation issues then you're better off just buying a newer CPU than wasting time optimising.
And if it's not, then you've got a design or algorithm problem which you should be fixing.
My Pentium III test machine with 256Meg of Ram blew away a dual processor Intel system with 1Gig of Ram while parsing a 30Meg XML import/export file.
l l_Data.ags.gzg ram/gene2xml/
Heh, a little offtopic but - This is why I hate XML. It's so bloated. You take 1 to 6 hours parsing a 30 megabyte XML file in C? I was just tasked with parsing out some select data from a 37 gigabyte XML file (870 million lines). I tried all sorts of optimizations and parsers upon finding that it might take days to parse. My solution - 50 lines of perl using regular expressions. I run this on a dual processor 3.something with 2 gigs of ram. It takes 5 minutes. If I coded it in C it would probably take 10 seconds but it's not worth the time.
Here's the file and convertor if anyone wants to fuck around with nearly a billion (bloated XML) lines of genetic data:
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/A
ftp://ftp.ncbi.nlm.nih.gov/asn1-converters/by_pro
You forgot to mention that keeping an 80x25 column text display updated only required moving around a maximum of 4000 bytes at a time (80x25 = 2000 bytes for the monochrome text buffer + another 80x25 colour buffer for 16 background and 16 foreground colours per text block), and that a sub 10MHz CPU would certainly struggle to animate that at 30 frames per second or higher.
I grew up in the days if Amiga - and they certainly didn't have an 80x25 text console... so your analogy is fundamentally flawed. All of my favourite Amiga software was lightweight, efficient and responsive (except for the rey tracing engines I used, but that did _actually_ have to do some serious calculation). In fact, my Amiga ticked along on a 800x600 screen quite nicely. Your analogy is also flawed becase not all of the screen is updated at any one time; only the parts that have changed. Oh, and 4000 bytes x 30 FPS is 120kb;
My Commodore 64 has a 700kHz processor in it and it can certainly animate the full screen at 25 frames per second albeit at a slightly reduced resolution. My Atari 2600 has a CPU of about the same speed and it was capable of keeping up with 25 frames per second (again, lower resolution) and still running the game engines just fine. Your argument doesn't hold water pal!
My 1920x1200 colour display requires 55296000 bytes (1920x1200*24), which is 13,824 times as much as that 80x25 text display. Now despite my 2GHz CPU only being 200 times faster than the hypothetical 10Mhz CPU, it doesn't struggle at all - not only can it animate the whole screen at 60FPS or more, but it can also calculate positions for thousands of objects at the same time.
What exactly do you do on your 1900x1200 colour display? First, 1900x1200x24 bits is actually 1900x1200x3 bytes (6,840,000 bytes). That is a far cry from your piss-ant 55,296,000 bytes (55MB).
Given your eagerness to quote numbers that are practically meaningless to the point and blatantly inaccurate, and your "calculation of positions of thousands of objects" I suggest you are playing games on Windows.
Again, your numbers are flawed, because my old 80x25 text mode display is still drawn from individual pixels. Those pixels still have to be updated individually by the CPU (in the 10MHz days it was often the main CPU that drew every little dot). 80x25 is drawn on a 640x480 in 16 glorious colours. Now, by your argument, 640x480x2 (your 2 image argument from above) = 614,400 bytes. Therefore, your 1900x1200 screen only requires 12x as many bytes to move about but I really dont' care about those numbers because most of the work is done in the GPU now; not the main CPU.
There are certainly some areas in which the software doesn't seem to have sped up in proportion to the processor speed - but my guess is thats mostly because they ceased being CPU bound years ago, not because the CPU is flat out wasting cycles trying to do the job. That doesn't mean it's not the programmer's fault - but if it is, it's because they decided to block the interface while they waited for a DNS query (or similar bad design decision), not because they used a pointer offset instead of an array dereference, or vice versa.
Software bloat is because programmers can get lazy when the CPU gets fast (the old "oh there'll be better CPUs by the time we release it" excuse). Looking at the power consumption figures for a modern Pentium4 CPU and figuring out that it wastes billions and billions of cycles a day doing work that it should never have needed to do is scary. If you average it out over all the PCs running in the world the amount of energy turned into heat because of sloppy programming alone would be enormous.
What about all the wasted hours waiting for the computer to do something because of some sloppy programmer being willing to waste a few million CPU cycles here and there? It doesn't seem like much at the micro level, but think about it across all the processes that run on your machine of a day and the few
I drink to make other people interesting!
My C program parsed it in a hour and half. And yes, I used pointers. Why? Because its more efficient.
I call bullshit. Your crappy piece of code was faster because it probably didn't handle half the cases Xerces (in fastest mode, Sax) does: from proper charset decoding (usually UTF-8) to entity handling, error reporting (line numbers), entity expansion, input normalization (linefeeds to \n independent of input), namespace handling and so forth. Simple naive and non-compliant "solutions" are dime a dozen. I've written some for specific tasks, but I do not compare apples to oranges. Speed difference between C and Java/C#, for tasks like xml parsing, are nowadays somewhere between 10 and 25% at most.
And yes, I do know what I am talking about, having written a fully compliant xml parser (not xerces, but similar); and having profiled it, 40% of total processing time is for raw input handling (including charset decoding). Based on that I very much doubt you could have improved 3x the performance of a streaming parsing: it'd mean that you had no parsing overhead whatsover beyond simple input streaming.
This kind of "improvement", though, would be quite normal if one compared in-memory tree solutions (DOM) to streaming ones (Sax etc.)...
With modern compilers you should always use arrays. The reason is that for all but the most pathological cases of indexing, the compiler will do strength reduction and induction to turn the p[i*4] into *p,p+=4. The problem with using pointers is that if you have more than one of them, the compiler has no easy way of knowing if they point to the same place (this is known as pointer aliasing) so a bunch of useful optimizations have to be turned off. If you are doing a[i] = b[i] the compiler can much more easily find out that a and b are distinct memory locations than if you are doing *p = *q.
If you want to learn more about these kind of source level optimisations, look the AMD Athlon(TM) Processor x86 Code Optimization Guide is a good reference -- it includes a section on why you want to use array accesses instead of pointers, and also has a lot more up-to-date and useful advice on what compilers do to your code. It is available from AMD's website.
The interactive way to Go -- http://www.playgo.to/iwtg/en/
In the PC world, the textmode was handled in hardware by the graphic card. Yes, we had already hardware accelerated display at that time. In fact is still in the latest PC video card.
One byte for the character, one byte for the color attribute.
So your argument doesn't apply.
One of the more interesting things I've seen is to see how different software developers write a program for the following problem:
Interestingly enough, the approach nearly always used it horribly inefficent. I've never seen anyone run the program long enough to get an answer.
In contrast, the approach used less often solves it in a fraction of a second.
Wow. Just wow.
No one will probably read this comment because its been a day since the OP, but I'm amazed at the quantity of people who are slamming this guy for wanting to research something that's admittedly interesting.
For starters, if the submitter is a CompSci student then he definately gets my kudos. Too many CS students are just focused on "I wanna learn C# so I can go make money!" as opposed to actually LEARNING.
Secondly, what happened to just plain geekiness of research and studying things because its fun and interesting? Does everything we do have to have some specific applicable purpose? If you say yes, you are thinking like the MBAs that always get bashed around here instead of a real nerd.
Who knows? Its unlikely, but possible that thinking about this problem somehow leads to a train of thought that solves P=NP or something.
"You cannot find out which view is the right one by science in the ordinary sense." - C.S. Lewis on Intelligent Design
First of all, the array and pointers are not the reason why a C program run slowly.
"Unnecessary looping" is the cause of the problem. I have encountered and fixed many C programs created by programmers, who have left the company.
I have found that many programmers like use alot of for-loop and while-loop in their programs. Most of the time, the algorithm of their C programs can be modified to reduce the need for the loops and increase the speed of the C programs by 10 times or more.